Browsing by Subject "Statistics"
Results Per Page
Sort Options
Item Open Access A Bayesian Dirichlet-Multinomial Test for Cross-Group Differences(2016) Chen, YuhanTesting for differences within data sets is an important issue across various applications. Our work is primarily motivated by the analysis of microbiomial composition, which has been increasingly relevant and important with the rise of DNA sequencing. We first review classical frequentist tests that are commonly used in tackling such problems. We then propose a Bayesian Dirichlet-multinomial framework for modeling the metagenomic data and for testing underlying differences between the samples. A parametric Dirichlet-multinomial model uses an intuitive hierarchical structure that allows for flexibility in characterizing both the within-group variation and the cross-group difference and provides very interpretable parameters. A computational method for evaluating the marginal likelihoods under the null and alternative hypotheses is also given. Through simulations, we show that our Bayesian model performs competitively against frequentist counterparts. We illustrate the method through analyzing metagenomic applications using the Human Microbiome Project data.
Item Open Access A Bayesian Forward Simulation Approach to Establishing a Realistic Prior Model for Complex Geometrical Objects(2018) Wang, YizhengGeology is a descriptive science making itself hard to provide quantification. We develop a Bayesian forward simulation approach to formulate a realistic prior model for geological images using the Approximate Bayesian computation (ABC) method. In other words, our approach aims to select a set of representative images from a larger list of complex geometrical objects and provide a probability distribution on it. This allows geologists to start contributing their perspectives to the specification of a realistic prior model. We examine the proposed ABC approach on an experimental Delta dataset and show that, on the basis of selected representative images, the nature of the variability of the Delta can be statistically reproduced by means of the IQSIM, a state-of-the-art multiple-point geostatistical (MPS) simulation algorithm. The results demonstrate that the proposed approach may have a broader spectrum of application. In addition, two different choices for the size of the prior, i.e., the number of representative images are compared and discussed.
Item Open Access A Bayesian Model for Nucleosome Positioning Using DNase-seq Data(2015) Zhong, JianlingAs fundamental structural units of the chromatin, nucleosomes are involved in virtually all aspects of genome function. Different methods have been developed to map genome-wide nucleosome positions, including MNase-seq and a recent chemical method requiring genetically engineered cells. However, these methods are either low resolution and prone to enzymatic sequence bias or require genetically modified cells. The DNase I enzyme has been used to probe nucleosome structure since the 1960s, but in the current high throughput sequencing era, DNase-seq has mainly been used to study regulatory sequences known as DNase hypersensitive sites. This thesis shows that DNase-seq data is also very informative about nucleosome positioning. The distinctive oscillatory DNase I cutting patterns on nucleosomal DNA are shown and discussed. Based on these patterns, a Bayes factor is proposed to be used for distinguishing nucleosomal and non-nucleosomal genome positions. The results show that this approach is highly sensitive and specific. A Bayesian method that simulates the data generation process and can provide more interpretable results is further developed based on the Bayes factor investigations. Preliminary results on a test genomic region show that the Bayesian model works well in identifying nucleosome positioning. Estimated posterior distributions also agree with some known biological observations from external data. Taken together, methods developed in this thesis show that DNase-seq can be used to identify nucleosome positioning, adding great value to this widely utilized protocol.
Item Open Access A Bayesian Strategy to the 20 Question Game with Applications to Recommender Systems(2017) Suresh, Sunith RajIn this paper, we develop an algorithm that utilizes a Bayesian strategy to determine a sequence of questions to play the 20 Question game. The algorithm is motivated with an application to active recommender systems. We first develop an algorithm that constructs a sequence of questions where each question inquires only about a single binary feature. We test the performance of the algorithm utilizing simulation studies, and find that it performs relatively well under an informed prior. We modify the algorithm to construct a sequence of questions where each question inquires about 2 binary features via AND conjunction. We test the performance of the modified algorithm
via simulation studies, and find that it does not significantly improve performance.
Item Open Access A Comparison Of Multiple Imputation Methods For Categorical Data(2015) Akande, Olanrewaju MichaelThis thesis evaluates the performance of several multiple imputation methods for categorical data, including multiple imputation by chained equations using generalized linear models, multiple imputation by chained equations using classification and regression trees and non-parametric Bayesian multiple imputation for categorical data (using the Dirichlet process mixture of products of multinomial distributions model). The performance of each method is evaluated with repeated sampling studies using housing unit data from the American Community Survey 2012. These data afford exploration of practical problems such as multicollinearity and large dimensions. This thesis highlights some advantages and limitations of each method compared to others. Finally, it provides suggestions on which method should be preferred, and conditions under which the suggestions hold.
Item Open Access A Comparison of Serial & Parallel Particle Filters for Time Series Analysis(2014) Klemish, DavidThis paper discusses the application of parallel programming techniques to the estimation of hidden Markov models via the use of a particle filter. It highlights how the Thrust parallel programming
language can be used to implement a particle filter in parallel. The impact of a parallel particle filter on the running times of three different models is investigated. For particle filters using a large number
of particles, Thrust provides a speed-up of five to ten times over a serial C++ implementation, which is less than reported in other research.
Item Open Access A Comparison of Strategies for Generating Synthetic Data for Complex Survey(2024) Chen, MinSynthetic data is a type of method for protecting data privacy. In the context of disseminating confidential data for public utilization, some statistical agencies employ the generation of fully synthetic datasets. This practice is applied to census and administrative records. It is important to note that many research datasets come from surveys with complex sampling methods, which is not ignorable when constructing synthetic data. The thesis presents an illustration for the comparison of three different synthetic data strategies. Each of them has different procedures to generate the synthetic data. Two of them are based on the bootstrap methods, one is Bayesian bootstrap, and the other is regular bootstrap. The third method is based on the posterior inference with pseudo-likelihood. Using simulation studies with probability proportional to size sampling, we show that all three methods can result in accurate estimates of the mean of a finite population. However, when estimating the sampling statistic's variance, only the method based on the Bayesian bootstrap method can provide an approximately unbiased estimate in these simulations.
Item Open Access A COST-SENSITIVE, SEMI-SUPERVISED, AND ACTIVE LEARNING APPROACH FOR PRIORITY OUTLIER INVESTIGATION(2023) Song, XinranThis master’s thesis presents a novel approach to address the problem of balancing the cost of investigating suspected cases with the potential gain of detecting an outlier, particularly in the context of fraud detection. The proposed approach is a cost- sensitive, semi-supervised, and active-learning priority outlier investigation model, which aims to identify the top-k unlabeled cases that maximize the overall expected gain.
The proposed approach is developed based on a comprehensive review of related work in cost-sensitive and active learning in outlier detection. We formulate the problem as a maximization function that utilizes kernel density estimation to calculate the probability of unlabeled cases. To improve the model’s accuracy and efficiency, we employ graph representation, which takes into account the similarities and relationships among cases. Furthermore, we utilize the neighborhood of cases for efficient kernel density estimation. The performance of the proposed approach is evaluated using both synthetic data and a real-world credit card fraud detection dataset.
The contributions of this thesis include the development of effective and efficient outlier investigation strategies with practical applications in various domains, particularly in the context of fraud detection. The proposed approach offers a promising solution to the challenge of balancing the cost of investigating suspected cases with the potential gain of detecting an outlier.
Item Open Access A Data-Retaining Framework for Tail Estimation(2020) Cunningham, ErikaModeling of extreme data often involves thresholding, or retaining only the most extreme observations, in order that the tail may "speak" and not be overwhelmed by the bulk of the data. We describe a transformation-based framework that allows univariate density estimation to smoothly transition from a flexible, semi-parametric estimation of the bulk into a parametric estimation of the tail without thresholding. In the limit, this framework has desirable theoretical tail-matching properties to the selected parametric distribution. We develop three Bayesian models under the framework: one using a logistic Gaussian process (LGP) approach; one using a Dirichlet process mixture model (DPMM); and one using a predictive recursion approximation of the DPMM. Models produce estimates and intervals for density, distribution, and quantile functions across the full data range and for the tail index (inverse-power-decay parameter), under an assumption of heavy tails. For each approach, we carry out a simulation study to explore the model's practical usage in non-asymptotic settings, comparing its performance to methods that involve thresholding.
Among the three models proposed, the LGP has lowest bias through the bulk and highest quantile interval coverage generally. Compared to thresholding methods, its tail predictions have lower root mean squared error (RMSE) in all scenarios but the most complicated, e.g. a sharp bulk-to-tail transition. The LGP's consistent underestimation of the tail index does not hinder tail estimation in pre-extrapolation to moderate-extrapolation regions but does affect extreme extrapolations.
An interplay between the parametric transform and the natural sparsity of the DPMM sometimes causes the DPMM to favor estimation of the bulk over estimation of the tail. This can be overcome by increasing prior precision on less sparse (flatter) base-measure density shapes. A finite mixture model (FMM), substituted for the DPMM in simulation, proves effective at reducing tail RMSE over thresholding methods in some, but not all, scenarios and quantile levels.
The predictive recursion marginal posterior (PRMP) model is fast and does the best job among proposed models of estimating the tail-index parameter. This allows it to reduce RMSE in extrapolation over thresholding methods in most scenarios considered. However, bias from the predictive recursion contaminates the tail, casting doubt on the PRMP's predictions in tail regions where data should still inform estimation. We recommend the PRMP model as a quick tool for visualizing the marginal posterior over transformation parameters, which can aid in diagnosing multimodality and informing the precision needed to overcome sparsity in the mixture model approach.
In summary, there is not enough information in the likelihood alone to prevent the bulk from overwhelming the tail. However, a model that harnesses the likelihood with a carefully specified prior can allow both the bulk and tail to speak without an explicit separation of the two. Moreover, retaining all of the data under this framework reduces quantile variability, improving prediction in the tails compared to methods that threshold.
Item Open Access A Differentially Private Bayesian Approach to Replication Analysis(2022) Yang, ChengxinReplication analysis is widely used in many fields of study. Once a research is published, other researchers will conduct analysis to assess the reliability of the published research. However, what if the data are confidential? In particular, if the data sets used for the studies are confidential, we cannot release the results of replication analyses to any entity without the permission to access the data sets, otherwise it may result in privacy leakage especially when the published study and replication studies are using similar or common data sets. In this paper, we present two methods for replication analysis. We illustrate the properties of our methods by a combination of theoretical analysis and simulation.
Item Open Access A Geometric Approach for Inference on Graphical Models(2009) Lunagomez, SimonWe formulate a novel approach to infer conditional independence models or Markov structure of a multivariate distribution. Specifically, our objective is to place informative prior distributions over graphs (decomposable and unrestricted) and sample efficiently from the induced posterior distribution. We also explore the idea of factorizing according to complete sets of a graph; which implies working with a hypergraph that cannot be retrieved from the graph alone. The key idea we develop in this paper is a parametrization of hypergraphs using the geometry of points in $R^m$. This induces informative priors on graphs from specified priors on finite sets of points. Constructing hypergraphs from finite point sets has been well studied in the fields of computational topology and random geometric graphs. We develop the framework underlying this idea and illustrate its efficacy using simulations.Item Open Access A High-Tech Solution for the Low Resource Setting: A Tool to Support Decision Making for Patients with Traumatic Brain Injury(2019) Elahi, CyrusBackground. The confluence of a capacity-exceeding disease burden and persistent resource shortages have resulted in traumatic brain injury’s (TBI) devastating impact in low and middle income countries (LMIC). Lifesaving care for TBI depends on accurate and timely decision making within the hospital. As result of technology and highly skilled provider shortages, treatment delays are common in low resource settings. This reality demands a low cost, scalable and accurate alternative to support decision making. Decision support tools leveraging the accuracy of modern prognostic modeling techniques represents one possible solution. This thesis is a collation of research dedicated to the advancement of TBI decision support technology in low resource settings. Methods. The study location included three national and referral hospitals in Uganda and Tanzania. We performed a survival analysis, externally validated existing TBI prognostic models, developed our own prognostic model, and performed a feasibility study for TBI decision support tools in an LMIC. Results. The survival analysis revealed a greater surgical benefit for mild and moderate head injuries compared to severe injuries. However, severe injury patients experienced a higher surgery rate than mild and moderate injuries. We developed a prognostic model using machine learning with a good level of accuracy. This model outperformed existing TBI models in regards to discrimination but not calibration. Our feasibility study captured the need for improved prognostication of TBI patients in the hospital. Conclusions. This pioneering work has provided a foundation for further investigation and implementation of TBI decision support technologies in low resource settings.
Item Open Access A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms(2017) Li, BaiDifferential privacy (DP) aims to design methods and algorithms that satisfy rigorous notions of privacy while simultaneously providing utility with valid statistical inference. More recently, an emphasis has been placed on combining notions of statistical utility with algorithmic approaches to address privacy risk in the presence of big data---with differential privacy emerging as a rigorous notion of risk. While DP provides strong guarantees for privacy, there are often tradeoffs regarding data utility and computational scalability. In this paper, we introduce a categorical data synthesizer that releases high-dimensional sparse histograms, illustrating its ability to overcome current limitations with data synthesizers in the current literature. Specifically, we combine a differential privacy algorithm---the stability based algorithm--- along with feature hashing, with allows for dimension reduction in terms of the histograms and Gibbs sampling. As a result, our proposed algorithm is differentially private, offers similar or better statistical utility and is scalable to large databases. In addition, we give an analytical result for the error caused by the stability based algorithm, which allows us to control the loss of utility. Finally, we study the behavior of our algorithm on both simulated and real data.
Item Open Access A Tapered Pareto-Poisson Model for Extreme Pyroclastic Flows: Application to the Quantification of Volcano Hazards(2015) Dai, FanThis paper intends to discuss the problems of parameter estimation in a proposed tapered Pareto-Poisson model for the assessment of large pyroclastic flows, which are essential in quantifying the size and risk of volcanic hazards. In dealing with the tapered Pareto distribution, the paper applies both maximum likelihood estimation and a Bayesian framework with objective priors and Metropolis algorithm. The techniques are further illustrated by an example of modeling extreme flow volumes at Soufriere Hills Volcano, and their simulation results are addressed.
Item Open Access A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results(2018) Coker, BeauInference is the process of using facts we know to learn about facts we do not know. A theory of inference gives assumptions necessary to get from the former to the latter, along with a definition for and summary of the resulting uncertainty. Any one theory of inference is neither right nor wrong, but merely an axiom that may or may not be useful. Each of the many diverse theories of inference can be valuable for certain applications. However, no existing theory of inference addresses the tendency to choose, from the range of plausible data analysis specifications consistent with prior evidence, those that inadvertently favor one's own hypotheses. Since the biases from these choices are a growing concern across scientific fields, and in a sense the reason the scientific community was invented in the first place, we introduce a new theory of inference designed to address this critical problem. From this theory, we derive ``hacking intervals,'' which are the range of summary statistic one may obtain given a class of possible endogenous manipulations of the data. They make no appeal to hypothetical data sets drawn from imaginary superpopulations. A scientific result with a small hacking interval is more robust to researcher manipulation than one with a larger interval, and is often easier to interpret than a classic confidence interval. Hacking intervals turn out to be equivalent to classical confidence intervals under the linear regression model, and are equivalent to profile likelihood confidence intervals under certain other conditions, which means they may sometimes provide a more intuitive and potentially more useful interpretation of classical intervals.
Item Open Access Advanced Topics in Introductory Statistics(2023) Bryan, Jordan GreyIt is now common practice in many scientific disciplines to collect large amounts of experimental or observational data in the course of a research study. The abundance of such data creates a circumstance in which even simply posed research questions may, or sometimes must, be answered using multivariate datasets with complex structure. Introductory-level statistical tools familiar to practitioners may be applied to these types of data, but inference will either be sub-optimal or invalid if properties of the data violate the assumptions made by these statistical procedures. In this thesis, we provide examples of how basic statistical procedures may be adapted to suit the complexity of modern datasets while preserving the simplicity of low-dimensional parametric models. In the context of genomics studies, we propose a frequentist-assisted-by-Bayes (FAB) method for conducting hypothesis tests for the means of normal models when auxiliary information about the means is available. If the auxiliary information accurately describes the means, then the proposed FAB hypothesis tests may be more powerful than the corresponding classical $t$-tests. If the information is not accurate, then the FAB tests retain type-I error control. For multivariate financial and climatological data, we develop a semiparametric model in order to characterize the dependence between two sets of random variables. Our approach is inspired by a multivariate notion of the sample rank and extends classical concepts such as canonical correlation analysis (CCA) and the Gaussian copula model. The proposed model allows for the analysis of multivariate dependence between variable sets with arbitrary marginal distributions. Motivated by fluorescence spectroscopy data collected from sites along the Neuse River, we also propose a least squares estimator for quantifying the contribution of various land-use sources to the water quality of the river. The estimator can be computed quickly relative to estimators derived using parallel factor analysis (PARAFAC) and it performs favorably in two source apportionment tasks.
Item Open Access Advancements in Probabilistic Machine Learning and Causal Inference for Personalized Medicine(2019) Lorenzi, Elizabeth CatherineIn this dissertation, we present four novel contributions to the field of statistics with the shared goal of personalizing medicine to individual patients. These methods are developed to directly address problems in health care through two subfields of statistics: probabilistic machine learning and causal inference. These projects include improving predictions of adverse events after surgeries, or learning the effectiveness of treatments for specific subgroups and for individuals. We begin the dissertation in Chapter 1 with a discussion of personalized medicine, the use of electronic health record (EHR) data, and a brief discussion on learning heterogeneous treatment effects. In chapter 2, we present a novel algorithm, Predictive Hierarchical Clustering (PHC), for agglomerative hierarchical clustering of current procedural terminology (CPT) codes. Our predictive hierarchical clustering aims to cluster subgroups, not individual observations, found within our data, such that the clusters discovered result in optimal performance of a classification model, specifically for predicting surgical complications. In chapter 3, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data across subpopulations while sharing information to improve inference and prediction. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients at Duke University Health System (DUHS). The last chapters of the dissertation address personalized medicine from a causal perspective, where the goal is to understand how interventions affect individuals not full populations. In chapter 4, we address heterogeneous treatment effects across subgroups, where guidance for observational comparisons within subgroups is lacking as is a connection to classic design principles for propensity score (PS) analyses. We address these shortcomings by proposing a novel propensity score method for subgroup analysis (SGA) that seeks to balance existing strategies in an automatic and efficient way. With the use of overlap weights, we prove that an over-specified propensity model including interactions between subgroups and all covariates results in exact covariate balance within subgroups. This is paired with variable selection approaches to adjust for a possibly overspecified propensity score model. Finally, chapter 5 discusses our final contribution, a longitudinal matching algorithm aiming to predict individual treatment effects of a medication change for diabetes patients. This project aims to develop a novel and generalizable causal inference framework for learning heterogeneous treatment effects from Electronic Health Records (EHR) data. The key methodological innovation is to cast the sparse and irregularly-spaced EHR time series into functional data analysis in the design stage to adjust for confounding that changes over time. We conclude the dissertation and discuss future work in Section 6, outlining many directions for continued research on these topics.
Item Open Access Advances in Bayesian Factor Modeling and Scalable Gaussian Process Regression(2020) Moran, Kelly R.Correlated measurements arise across a diverse array of disciplines such as epidemiology, toxicology, genomics, economics, and meteorology. Factor models describe the association between variables by assuming some latent factors drive structured variation therein. Gaussian process (GP) models, on the other hand, describe the association between variables using a distance-based covariance kernel. This dissertation introduces two novel extensions of Bayesian factor models driven by applied problems, and then proposes an algorithm to allow for scalable approximate Bayesian GP sampling. First, the FActor Regression for Verbal Autopsy (FARVA) model is developed for predicting the cause of death and cause-specific mortality fraction in low-resource settings based on verbal autopsies. Both the mean and the association between symptoms provides information used to differentiate decedents across cause of death groups. This class of hierarchical factor regression models avoids restrictive assumptions of standard methods, allows both the mean and covariance to vary with COD category, and can include covariate information on the decedent, region, or events surrounding death. Next, the Bayesian partially Supervised Sparse and Smooth Factor Analysis (BS3FA) model is developed to enable toxicologists, who are faced with a rising tide of chemicals under regulation and in use, to choose which chemicals to prioritize for screening and to predict the toxicity of as-yet-unscreened chemicals based on their molecular structure. Latent factors driving structured variability are assumed to be shared between the molecular structure observations and dose-response observations from high-throughput screening. These shared latent factors allow the model to learn a distance between chemicals targeted to toxicity, rather than one based on molecular structure alone. Finally, the Fast Increased Fidelity Approximate GP (FIFA-GP) allows for the association between observations to be modeled by a high fidelity Gaussian process approximation even when the number of observations is on the order of 10^5. A sampling algorithm that scales at O(n log^2(n)) time is described, and a proof showing that the approximation's Kullback-Leibler divergence to the true posterior can be made arbitrarily small is provided.
Item Open Access Advances in Bayesian Hierarchical Modeling with Tree-based Methods(2020) Mao, JialiangDeveloping flexible tools that apply to datasets with large size and complex structure while providing interpretable outputs is a major goal of modern statistical modeling. A family of models that are especially suitable for this task is the P\'olya tree type models. Following a divide-and-conquer strategy, these tree-based methods transform the original task into a series of tasks that are smaller in size and easier to solve while their nonparametric nature guarantees the modeling flexibility to cope with datasets with a complex structure. In this work, we develop three novel tree-based methods that tackle different challenges in Bayesian hierarchical modeling. Our first two methods are designed specifically for the microbiome sequencing data, which consists of high dimensional counts with a complex, domain-specific covariate structure and exhibits large cross-sample variations. These features limit the performance of generic statistical tools and require special modeling considerations. Both methods inherit the flexibility and computation efficiency from the general tree-based methods and directly utilize the domain knowledge to help infer the complex dependency structure among different microbiome categories by bringing the phylogenetic tree into the modeling framework. An important task in microbiome research is to compare the composition of the microbial community of groups of subjects. We first propose a model for this classic two-sample problem in the microbiome context by transforming the original problem into a multiple testing problem, with a series of tests defined at the internal nodes of the phylogenetic tree. To improve the power of the test, we use a graphical model to allow information sharing among the tests. A regression-type adjustment is also considered to reduce the chance of false discovery. Next, we introduce a model-based clustering method for the microbiome count data with a Dirichlet process mixtures setup. The phylogenetic tree is used for constructing the mixture kernels to offer a flexible covariate structure. To improve the ability to detect clusters determined not only by the dominating microbiome categories, a subroutine is introduced in the clustering procedure that selects a subset of internal nodes of the tree which are relevant for clustering. This subroutine is also important in avoiding potential overfitting. Our third contribution proposes a framework for causal inference through Bayesian recursive partitioning that allows joint modeling of the covariate balancing and the potential outcome. With a retrospective perspective, we model the covariates and the outcome conditioning on the treatment assignment status. For the challenging multivariate covariate modeling, we adopt a flexible nonparametric prior that focuses on the relation of the covariate distributions under the two treatment groups, while integrating out other aspects of these distributions that are irrelevant for estimating the causal effect.
Item Open Access Advances in Bayesian Hierarchical Models for Complex Health Data(2023) Nguyen, Phuc HongWith the advancement of technology in screening and tracking risk factors as well as human health outcomes, there is increasing richness and complexity in health data. This dissertation presents methodological and applied work using Bayesian hierarchical models to exploit dependency structure in the data to improve estimation efficiency, and sometimes also reduce computational cost and increase interpretability. In Chapter 2, we present a multivariate factor analysis model with time-varying effects to assess the longitudinal effects of prenatal exposure to phthalates on the risk of childhood obesity in children aged 4 to 10. In Chapter 3, we present a framework and package for power analysis using Monte Carlo simulation for study design as well as model comparison of complex models for correlated chemical mixture exposure data. In Chapter 4, we introduce a new way to characterize bias due to unmeasured confounding using a set of imperfect negative control outcomes, taking advantage of the knowledge that they share common unobserved causes. Finally, in Chapter 5, we present a new tree representation of brain connectomes based on the biological hierarchy of brain regions. In all these applications, we use Bayesian hierarchical models for borrowing information across related observations and enforcing latent structures.