Bayesian Multivariate Count Models for the Analysis of Microbiome Studies
Advances in high-throughput DNA sequencing allow for rapid and affordable surveys of thousands of bacterial taxa across thousands of samples. The exploding availability of sequencing data has poised microbiota research to advance our understanding of fields as diverse as ecology, evolution, medicine, and agriculture. Yet, while microbiota data is now ubiquitous, methods for the analysis of such data remain underdeveloped. This gap reflects the challenge of analyzing sparse high-dimensional count data that contains compositional (relative abundance) information. To address these challenges this dissertation introduces a number of tools for Bayesian inference applied to microbiome data. A central theme throughout this work is the use of multinomial logistic-normal models which are found to concisely address these challenges. In particular, the connection between the logistic-normal distribution and the Aitchison geometry of the simplex is commonly used to develop interpretable tools for the analysis of microbiome data.
The structure of this dissertation is as follows. Chapter 1 introduces key challenges in the analysis of microbiome data. Chapter 2 introduces a novel log-ratio transform between the simplex and Real space to enable the development of statistical tools for compositional data with phylogenetic structure. Chapter 3 introduces a multinomial logistic-normal generalized dynamic linear modelling framework for analysis of microbiome time-series data. Chapter 4 explores the analysis of zero values in sequence count data from a stochastic process perspective and demonstrates that zero-inflated models often produce counter-intuitive results in this this regime. Finally, Chapter 5 introduces the theory of Marginally Latent Matrix-T Processes as a means of developing efficient accurate inference for a large class of both multinomial logistic-normal models including linear regression, non-linear regression, and dynamic linear models. Notably, the inference schemes developed in Chapter 5 are found to often be orders of magnitude faster than Hamiltonian Monte Carlo without sacrificing accuracy in point estimation or uncertainty quantification.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations