Browsing by Author "Ma, Li"
Results Per Page
Sort Options
Item Open Access A Bayesian Dirichlet-Multinomial Test for Cross-Group Differences(2016) Chen, YuhanTesting for differences within data sets is an important issue across various applications. Our work is primarily motivated by the analysis of microbiomial composition, which has been increasingly relevant and important with the rise of DNA sequencing. We first review classical frequentist tests that are commonly used in tackling such problems. We then propose a Bayesian Dirichlet-multinomial framework for modeling the metagenomic data and for testing underlying differences between the samples. A parametric Dirichlet-multinomial model uses an intuitive hierarchical structure that allows for flexibility in characterizing both the within-group variation and the cross-group difference and provides very interpretable parameters. A computational method for evaluating the marginal likelihoods under the null and alternative hypotheses is also given. Through simulations, we show that our Bayesian model performs competitively against frequentist counterparts. We illustrate the method through analyzing metagenomic applications using the Human Microbiome Project data.
Item Open Access A plant genetic network for preventing dysbiosis in the phyllosphere.(Nature, 2020-04-08) Chen, Tao; Nomura, Kinya; Wang, Xiaolin; Sohrabi, Reza; Xu, Jin; Yao, Lingya; Paasch, Bradley C; Ma, Li; Kremer, James; Cheng, Yuti; Zhang, Li; Wang, Nian; Wang, Ertao; Xin, Xiu-Fang; He, Sheng YangThe aboveground parts of terrestrial plants, collectively called the phyllosphere, have a key role in the global balance of atmospheric carbon dioxide and oxygen. The phyllosphere represents one of the most abundant habitats for microbiota colonization. Whether and how plants control phyllosphere microbiota to ensure plant health is not well understood. Here we show that the Arabidopsis quadruple mutant (min7 fls2 efr cerk1; hereafter, mfec)1, simultaneously defective in pattern-triggered immunity and the MIN7 vesicle-trafficking pathway, or a constitutively activated cell death1 (cad1) mutant, carrying a S205F mutation in a membrane-attack-complex/perforin (MACPF)-domain protein, harbour altered endophytic phyllosphere microbiota and display leaf-tissue damage associated with dysbiosis. The Shannon diversity index and the relative abundance of Firmicutes were markedly reduced, whereas Proteobacteria were enriched in the mfec and cad1S205F mutants, bearing cross-kingdom resemblance to some aspects of the dysbiosis that occurs in human inflammatory bowel disease. Bacterial community transplantation experiments demonstrated a causal role of a properly assembled leaf bacterial community in phyllosphere health. Pattern-triggered immune signalling, MIN7 and CAD1 are found in major land plant lineages and are probably key components of a genetic network through which terrestrial plants control the level and nurture the diversity of endophytic phyllosphere microbiota for survival and health in a microorganism-rich environment.Item Open Access Advances in Bayesian Hierarchical Modeling with Tree-based Methods(2020) Mao, JialiangDeveloping flexible tools that apply to datasets with large size and complex structure while providing interpretable outputs is a major goal of modern statistical modeling. A family of models that are especially suitable for this task is the P\'olya tree type models. Following a divide-and-conquer strategy, these tree-based methods transform the original task into a series of tasks that are smaller in size and easier to solve while their nonparametric nature guarantees the modeling flexibility to cope with datasets with a complex structure. In this work, we develop three novel tree-based methods that tackle different challenges in Bayesian hierarchical modeling. Our first two methods are designed specifically for the microbiome sequencing data, which consists of high dimensional counts with a complex, domain-specific covariate structure and exhibits large cross-sample variations. These features limit the performance of generic statistical tools and require special modeling considerations. Both methods inherit the flexibility and computation efficiency from the general tree-based methods and directly utilize the domain knowledge to help infer the complex dependency structure among different microbiome categories by bringing the phylogenetic tree into the modeling framework. An important task in microbiome research is to compare the composition of the microbial community of groups of subjects. We first propose a model for this classic two-sample problem in the microbiome context by transforming the original problem into a multiple testing problem, with a series of tests defined at the internal nodes of the phylogenetic tree. To improve the power of the test, we use a graphical model to allow information sharing among the tests. A regression-type adjustment is also considered to reduce the chance of false discovery. Next, we introduce a model-based clustering method for the microbiome count data with a Dirichlet process mixtures setup. The phylogenetic tree is used for constructing the mixture kernels to offer a flexible covariate structure. To improve the ability to detect clusters determined not only by the dominating microbiome categories, a subroutine is introduced in the clustering procedure that selects a subset of internal nodes of the tree which are relevant for clustering. This subroutine is also important in avoiding potential overfitting. Our third contribution proposes a framework for causal inference through Bayesian recursive partitioning that allows joint modeling of the covariate balancing and the potential outcome. With a retrospective perspective, we model the covariates and the outcome conditioning on the treatment assignment status. For the challenging multivariate covariate modeling, we adopt a flexible nonparametric prior that focuses on the relation of the covariate distributions under the two treatment groups, while integrating out other aspects of these distributions that are irrelevant for estimating the causal effect.
Item Open Access Applications and Computation of Stateful Polya Trees(2017) Christensen, JonathanPolya trees are a class of nonparametric priors on distributions which are able to model absolutely continuous distributions directly, rather than modeling a discrete distribution over parameters of a mixing kernel to obtain an absolutely continuous distribution. The Polya tree discretizes the state space with a recursive partition, generating a distribution by assigning mass to the child elements at each level of the recursive partition according to a Beta distribution. Stateful Polya trees are an extension of the Polya tree where each set in the recursive partition has one or more discrete state variables associated with it. We can learn the posterior distributions of these state variables along with the posterior of the distribution. State variables may be of interest in their own right, or may be nuisance parameters which we use to achieve more flexible models but wish to integrate out in the posterior. We discuss the development of stateful Polya trees and discuss the Hierarchical Adaptive Polya Tree, which uses state variables to flexibly model the concentration parameter of Polya trees in a hierarchical Bayesian model. We also consider difficulties with the use of marginal likelihoods to determine posterior probabilities of states.
Item Open Access Bayesian Methods for Two-Sample Comparison(2015) Soriano, JacopoTwo-sample comparison is a fundamental problem in statistics. Given two samples of data, the interest lies in understanding whether the two samples were generated by the same distribution or not. Traditional two-sample comparison methods are not suitable for modern data where the underlying distributions are multivariate and highly multi-modal, and the differences across the distributions are often locally concentrated. The focus of this thesis is to develop novel statistical methodology for two-sample comparison which is effective in such scenarios. Tools from the nonparametric Bayesian literature are used to flexibly describe the distributions. Additionally, the two-sample comparison problem is decomposed into a collection of local tests on individual parameters describing the distributions. This strategy not only yields high statistical power, but also allows one to identify the nature of the distributional difference. In many real-world applications, detecting the nature of the difference is as important as the existence of the difference itself. Generalizations to multi-sample comparison and more complex statistical problems, such as multi-way analysis of variance, are also discussed.
Item Open Access Logistic Tree Gaussian Processes (LoTGaP) for Microbiome Dynamics and Treatment Effects(2021) Greenberg, MorrisWith advancements in and increased access to next-generation sequencing technology, hospitals (such as Duke Medical Center) have started to track the microbiomes of at-risk patients over time, but at inconsistently measured points across patients. Modeling the trajectories of high-throughput microbiome data proves difficult, due to inconsistent data collection, as well as a collection of analytical obstacles such as compositional data, sparsity, high dimensionality, and phylogenetic covariance structure. As a result, few methods allow us to capture uncertainty in the microbiome over time using increasingly standard data collection and processing methods.
Here, we develop a novel hierarchical model to measure dynamics of the microbiome across cohorts of patients measured inconsistently, which we call logistic-tree Gaussian processes for the microbiome (LoTGaP). LoTGaP adds to the existing microbiome literature through (1) using Gaussian processes to flexibly estimate the evolution of the microbiome over a finite set of days to handle missing/inconsistently measured data, (2) transforming operational taxonomic units (OTUs) to their internal nodes on the phylogenetic tree to accelerate computation and preserve biological relationships, and (3) building functionality to estimate the influence of covariates on microbiome dynamics across patients, which can allow for hospitals to link treatment regimens to microbiome dynamics, or make direct connections between microbiome data and other measurements, such as demographic information.
We demonstrate that LoTGaP produces uncertainty bands that reflect both within-person variation over time and across-person variation while comparing favorably in computation time to existing methods that are narrower in scope.
Item Open Access Logistic-tree Normal Mixture for Clustering Microbiome Compositions(2023) Wang, JiongranHuman microbiome has become an interesting research topic in recent years and a common task in the analysis of these data is to cluster microbiome compositions into subtypes. This task serves as an intermediary step in achieving personalized diagnosis and treatment. However, this seemingly standard task is very challenging in the microbiome composition context due to several key features of such data. Common distance-based algorithms can not produce reliable results as they do not take into account the heterogeneity of the cross-sample variability among the bacterial taxa. In addition, existing model-based approaches are not flexible enough to capture the complex within-cluster variation from cross-cluster variation. An useful Bayesian generative model Dirichlet-tree multinomial mixtures (DTMM) has been proposed to overcome these challenges. DTMM indeed achieves reliable results, but it is still not flexible enough in characterizing covariance structure among taxa and lacks the scalability to higher dimensions. Hence we propose another generative model, called the "Logistic-tree normal mixture" (LTNM), that addresses this need. The LTN kernel incorporates the tree-based decomposition as the Dirichlet-tree does, but it also models the branching probability using a multivariate logistic-normal distribution. Hence it has a rich covariance structure along with computationally efficiency through Pólya-Gamma data augmentation technique. This thesis will be organized as follows: first we briefly review some popular existing algorithms; then we will introduce LTNM in detail; then we will do extensive simulation study to compare LTNM and other existing methods; at last we apply LTNM to a real microbiome study, the American Gut Project (AGP) to analyze the inference results of LTNM.
Item Open Access Nonparametric Methods for Analysis and Modeling of Complex Multivariate Distributions(2020) Gorsky, ShaiModern statistical science is challenged by data sets that grow rapidly in both size and complexity. These data sets are very often multivariate, including, for instance, continuous and categorical variables. In addition, such data may encode information about multiple data-generative mechanisms. Traditional, ``parametric'' statistical models and methods are limited, either in their ability to capture nuances that cannot be generated by low dimensional models or in applying restrictive assumptions to inferential procedures that are rarely met. In this work, we present three novel nonparametric methods we developed which tackle different challenges that large and complex multivariate data sets present. Our first contribution introduces a scalable method to test the independence between two random vectors by breaking down the task into simple univariate tests of independence, transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures to examine the sample space more effectively. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We demonstrate the substantial computational advantage of the procedure in comparison with existing approaches as well as its decent statistical power under various dependency scenarios through an extensive simulation study. We illustrate how the divide-and-conquer nature of the procedure can be exploited not only to test independence but to learn the nature of the underlying dependency. Our second method is motivated by the task of classification and calibration of flow cytometry observations. An important step in comparative analyses of multi-sample flow cytometry data is cross-sample calibration, whose goal is to align cell subsets across multiple samples in the presence of variations in locations, so that variation due to technical reasons is minimized and true biological variation can be meaningfully compared. We introduce a Bayesian nonparametric hierarchical modeling approach for accomplishing both calibration and cell classification jointly in a unified probabilistic manner. Three important features of our method make it particularly effective for analyzing multi-sample flow cytometry data: a nonparametric mixture avoids prespecifying the number of cell clusters; the hierarchical skew normal kernels allow flexibility in the shapes of the cell subsets and cross-sample variation in their locations; and finally the ``coarsening'' strategy makes inference robust to small departures from the model, a feature that becomes crucial with massive numbers of observations such as can be encountered in flow cytometry data. Our third contributed method concerns hierarchical modeling of weights of a Dirichlet Process Mixture. We build on the Hierarchical Dirichlet Process where an infinite-parameter mean measure is taken as a Dirichlet Process Mixture and child measures are drawn as Dirichlet Process Mixtures with the base distribution taken as the above mean measure. The Hierarchical Dirichlet Process only admits a scalar dispersion parameter, a formulation that prevents it from capturing structures that may have been generated from different data-generating mechanisms. Our approach is based on mixing over latent classes of Hierarchical Dirichlet Processes where each class corresponds to a certain level of dispersion and a portion of the shared sample space, which allows heterogeneous variation among multiple distributions over it. We demonstrate the strengths of our three methods through extensive simulation studies and case studies that can yield valuable scientific insights.
Item Open Access Pyramid Multi-resolution Scanning for Two-sample Comparison(2016) Mao, JialiangTesting for two-sample differences is challenging when the differences are local and only involve a small portion of the data. To solve this problem, we apply a multi- resolution scanning framework that performs dependent local tests on subsets of the sample space. We use a nested dyadic partition of the sample space to get a collection of windows and test for sample differences within each window. We put a joint prior on the states of local hypotheses that allows both vertical and horizontal message passing among the partition tree to reflect the spatial dependency features among windows. This information passing framework is critical to detect local sample differences. We use both the loopy belief propagation algorithm and MCMC to get the posterior null probability on each window. These probabilities are then used to report sample differences based on decision procedures. Simulation studies are conducted to illustrate the performance. Multiple testing adjustment and convergence of the algorithms are also discussed.
Item Open Access Topics in Applied Statistics(2023) LeBlanc, Patrick MOne of the fundamental goals of statistics is to develop methods which provide improved inference in applied problems. This dissertation will introduce novel methodology and review state-of-the-art existing methods in three different areas of applied statistics. Chapter 2 focuses on modelling subcommunity dynamics in gut microbiome data. Existing methods ignore cross-sample heterogeneity in subcommunity composition; we propose a novel mixed-membership model which models cross-sample heterogeneity using the phylogenetic tree and as a result is robust to mispecifying the number of subcommunities. Chapter 3 reviews state-of-the-art methods in recommender systems, including collaborative filtering, content-based filtering, hybrid recommenders, and active recommender systems. Existing literature has focused primarily on bespoke applications; statisticians have an opportunity to build recommender system theory. Chapter 4 proposes a novel method of accounting for time-based design inconsistencies in Bayesian network meta-analysis models and discovers non-linear time trends in the effectiveness of vancomycin as a MRSA treatment. Chapter 5 provides some concluding remarks.
Item Open Access Tree-based Methods for Learning Probability Distributions(2022) Awaya, NaokiLearning probability distributions is a fundamental inferential task in statistics but challenging if a data distribution of our interest is complicated and high-dimensional. Addressing this challenging problem is the main topic of this thesis, and mainly discussed herein are two types of new tree-based methods: a single-tree method and an ensemble method. The new single tree method, the main topic of Chapter 2, is introduced by constructing a generalized Polya tree process, that is, a new Bayesian nonparametric model, equipped with a new flexible tree prior. With this new prior we can find trees that represent the distributional structures well, and the tree space is efficiently explored with a new sequential Monte Carlo algorithm. The new ensemble method discussed in Chapter 3 is proposed under a new addition rule defined for probability distributions. The new rule based on cumulative distribution functions and their generalizations enables us to smoothly introduce a new efficient boosting algorithm, inheriting the important notions such as "residuals" and "zeros"..The thesis is closed by Chapter 4 which provides concluding remarks.
Item Open Access Wavelet Regression using MapReduce and Analysis of Multiple Sclerosis Clinical Data(2017) Song, HanyuTwo problems, one related to scalable methods and the other on application of statistical methods to clinical data are addressed in this thesis. In the first chapter, motivated by growing numbers of ``large p'' datasets, we present a novel MapReduce framework for handling multivariate wavelet regression. We compare the time complexity of proposed and conventional methods and show the novel framework scales linearly in the dimension $p$ of the response matrix. Empirical results show consistency with our complexity analysis. This work has its potential application in analysing image data or genomic data where the dimensions are huge.
In the second chapter, we explore a clinical dataset of Multiple Sclerosis (MS) provided by Biogen, which comprises 579 actively managed MS patients enrolled at single center for up to 5 years. Since a therapy to curing MS is unknown, Biogen and we are developing statistical models to predict the progression of disability level as a therapeutic guide. Such disability can be roughly quantified by EDSS (Expanded Disability Status Scale), and as such we conduct predict modelling of EDSS. Before we arrive at these models, we perform explanatory data analysis, conduct predictive modelling of current EDSS based on measurements in the same year.