# Browsing by Author "Dunson, David B."

###### Results Per Page

###### Sort Options

Item Open Access Bayesian Modeling and Computation for Mixed Data(2012) Cui, KaiMultivariate or high-dimensional data with mixed types are ubiquitous in many fields of studies, including science, engineering, social science, finance, health and medicine, and joint analysis of such data entails both statistical models flexible enough to accommodate them and novel methodologies for computationally efficient inference. Such joint analysis is potentially advantageous in many statistical and practical aspects, including shared information, dimensional reduction, efficiency gains, increased power and better control of error rates.

This thesis mainly focuses on two types of mixed data: (i) mixed discrete and continuous outcomes, especially in a dynamic setting; and (ii) multivariate or high dimensional continuous data with potential non-normality, where each dimension may have different degrees of skewness and tail-behaviors. Flexible Bayesian models are developed to jointly model these types of data, with a particular interest in exploring and utilizing the factor models framework. Much emphasis has also been placed on the ability to scale the statistical approaches and computation efficiently up to problems with long mixed time series or increasingly high-dimensional heavy-tailed and skewed data.

To this end, in Chapter 1, we start with reviewing the mixed data challenges. We start developing generalized dynamic factor models for mixed-measurement time series in Chapter 2. The framework allows mixed scale measurements in different time series, with the different measurements having distributions in the exponential family conditional on time-specific dynamic latent factors. Efficient computational algorithms for Bayesian inference are developed that can be easily extended to long time series. Chapter 3 focuses on the problem of jointly modeling of high-dimensional data with potential non-normality, where the mixed skewness and/or tail-behaviors in different dimensions are accurately captured via the proposed heavy-tailed and skewed factor models. Chapter 4 further explores the properties and efficient Bayesian inference for the generalized semiparametric Gaussian variance-mean mixtures family, and introduce it as a potentially useful family for modeling multivariate heavy-tailed and skewed data.

Item Open Access Clustering-Enhanced Stochastic Gradient MCMC for Hidden Markov Models(2019) Ou, RihuiMCMC algorithms for hidden Markov models, which often rely on the forward-backward sampler, suffer with large sample size due to the temporal dependence inherent in the data. Recently, a number of approaches have been developed for posterior inference which make use of the mixing of the hidden Markov process to approximate the full posterior by using small chunks of the data. However, in the presence of imbalanced data resulting from rare latent states, the proposed minibatch estimates will often exclude rare state data resulting in poor inference of the associated emission parameters and inaccurate prediction or detection of rare events. Here, we propose to use a preliminary clustering to over-sample the rare clusters and reduce variance in gradient estimation within Stochastic Gradient MCMC. We demonstrate very substantial gains in predictive and inferential accuracy on real and synthetic examples.

Item Open Access Distributed Feature Selection in Large n and Large p Regression Problems(2016) Wang, XiangyuFitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.

While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.

For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.

Item Open Access Ecological Modeling via Bayesian Nonparametric Species Sampling Priors(2023) Zito, AlessandroSpecies sampling models are a broad class of discrete Bayesian nonparametric priors that model the sequential appearance of distinct tags, called species or clusters, in a sequence of labeled objects. Over the last 50 years, species sampling priors have found much success in a variety of settings, including clustering and density estimation. However, despite the rich theoretical and methodological developments, these models have rarely been used as tools by applied ecologists, even though their primary investigation often involves the modeling of actual species. This dissertation aims at partially filling this gap by elucidating how species sampling models can be useful to scientists and practitioners in the ecological field. Our emphasis is on clustering and on species discovery properties linked to species sampling models. In particular, Chapter 2 illustrates how a Dirichlet process mixture model with a random precision parameter leads to greater robustness when inferring the number of clusters, or communities, in a given population. We specifically introduce a novel prior for the precision, called Stirling-gamma distribution, which allows for transparent elicitation supported by theoretical findings. We illustrate its advantages when detecting communities in a colony of ant workers. Chapter 3 presents a general Bayesian framework to model accumulation curves, which summarize the sequential discoveries of distinct species over time. This work is inspired by traditional species sampling models such as the Dirichlet process and the Pitman--Yor process. By modeling the discovery probability as a survival function of some latent variables, a flexible specification that can account for both finite and infinite species richness is developed. We apply our model to a large fungal biodiversity study from Finland. Finally, Chapter 4 presents a novel Bayesian nonparametric taxonomic classifier called BayesANT. Here, the goal is to predict the taxonomy of DNA sequences sampled from the environment. The difficulty of such a task is that the vast majority of species do not have a reference barcode or are yet unknown to science. Hence, species novelty needs to be accounted for when doing classification. BayesANT builds upon Dirichlet-multinomial kernels to model DNA sequences, and upon species sampling models to account for such potential novelty. We show how it attains excellent classification performances, especially when the true taxa of the test sequences are not observed in the training set.All methods presented in this dissertation are freely available as R packages. Our hope is that these contributions will pave the way for future utilization of Bayesian nonparametric methods in applied ecological analyses.

Item Embargo Incorporating Scalability and Structural Constraints in Bayesian Modeling(2023) Chattopadhyay, ShounakReal-life modeling of probabilistic events often involves incorporating constraints on quantities of interest. Broadly, such constraints can be classified as being either computational when facing limitations on computational feasibility or budget, or structural when facing limitations in terms of modeling a desirable quantity of interest due to the inherent nature of this quantity. To that end, this work focuses on incorporating computational and structural constraints into modeling real-life data from a Bayesian perspective. In Chapter 2, we focus on the problem of Bayesian nonparametric density estimation. Although well-studied and highly regarded in existing literature due to their flexibility, adaptability, and accuracy along with quantifying uncertainty when estimating probability density functions, Bayesian nonparametric approaches often face major roadblocks in terms of computation via cumbersome Markov chain Monte Carlo (MCMC) algorithms. By leveraging on aspects of nearest neighbor allocation and Bayesian mixture models, we engineer a highly effective hybrid density estimation approach called Nearest Neighbor Dirichlet Mixtures (NN-DM). The NN-DM completely avoids MCMC and is embarrassingly parallel, providing substantial computational gains in comparison to existing approaches, along with providing accurate point estimation and uncertainty quantification both theoretically and empirically. In Chapter 3, we consider the problem of dose response modeling in a public health scenario, where individuals are exposed to toxic chemicals. An overwhelming portion of the current approaches only focus on quantifying the marginal effects of these exposures on the response, ignoring possible interactions. As an alternative, our focus is on incorporating structural constraints in the form of modeling synergistic and antagonistic interactions between the chemicals. We developed the Synergistic Antagonistic Interaction Detection (SAID), a novel Bayesian approach shrinking interactions to being synergistic or antagonistic. Instead of focusing only on linear effects, our model is flexible to allow non-linearity and scales well computationally with moderate number of exposures. We apply our approach to an NHANES data set and uncover interactions between heavy metals affecting kidney function. Finally, in Chapter 4, we focus on the problem of Bayesian factor analysis. Bayesian factor models provide an elegant framework to model high-dimensional covariance matrices as the sum of two components, one low rank and another diagonal. Existing approaches utilizing MCMC to obtain posterior draws of the covariance matrix face significant challenges in terms of slow convergence and mode switching due to non-identifiability of the factor model resulting from rotational invariance. As both the sample size and the number of dimensions increase, we focus on a blessing of dimensionality phenomenon allowing us to effectively obtain a plug-in estimate of the latent factors. Using this plug-in estimate, our proposed Factor Analysis with BLEssing of dimensionality (FABLE) approach provides a pseudo-posterior for the covariance matrix. FABLE is an embarrassingly parallel technique with immense computational benefits, completely bypassing MCMC and thus its pitfalls. We provide theoretical guarantees on the performance of FABLE, along with evaluating the approach in numerous simulation studies.

Item Open Access Nonparametric Bayes for Big Data(2014) Yang, YunClassical asymptotic theory deals with models in which the sample size $n$ goes to infinity with the number of parameters $p$ being fixed. However, rapid advancement of technology has empowered today's scientists to collect a huge number of explanatory variables

to predict a response. Many modern applications in science and engineering belong to the ``big data" regime in which both $p$ and $n$ may be very large. A variety of genomic applications even have $p$ substantially greater than $n$. With the advent of MCMC, Bayesian approaches exploded in popularity. Bayesian inference often allows easier interpretability than frequentist inference. Therefore, it becomes important to understand and evaluate

Bayesian procedures for ``big data" from a frequentist perspective.

In this dissertation, we address a number of questions related to solving large-scale statistical problems via Bayesian nonparametric methods.

It is well-known that classical estimators can be inconsistent in the high-dimensional regime without any constraints on the model. Therefore, imposing additional low-dimensional structures on the high-dimensional ambient space becomes inevitable. In the first two chapters of the thesis, we study the prediction performance of high-dimensional nonparametric regression from a minimax point of view. We consider two different low-dimensional constraints: 1. the response depends only on a small subset of the covariates; 2. the covariates lie on a low dimensional manifold in the original high dimensional ambient space. We also provide Bayesian nonparametric methods based on Gaussian process priors that are shown to be adaptive to unknown smoothness or low-dimensional manifold structure by attaining minimax convergence rates up to log factors. In chapter 3, we consider high-dimensional classification problems where all data are of categorical nature. We build a parsimonious model based on Bayesian tensor factorization for classification while doing inferences on the important predictors.

It is generally believed that ensemble approaches, which combine multiple algorithms or models, can outperform any single algorithm at machine learning tasks, such as prediction. In chapter 5, we propose Bayesian convex and linear aggregation approaches motivated by regression applications. We show that the proposed approach is minimax optimal when the true data-generating model is a convex or linear combination of models in the list. Moreover, the method can adapt to sparsity structure in which certain models should receive zero weights, and the method is tuning parameter free unlike competitors. More generally, under an M-open view when the truth falls outside the space of all convex/linear combinations, our theory suggests that the posterior measure tends to concentrate on the best approximation of the truth at the minimax rate.

Chapter 6 is devoted to sequential Markov chain Monte Carlo algorithms for Bayesian on-line learning of big data. The last chapter attempts to justify the use of posterior distribution to conduct statistical inferences for semiparametric estimation problems (the semiparametric Bernstein von-Mises theorem) from a frequentist perspective.

Item Open Access Random Orthogonal Matrices with Applications in Statistics(2019) Jauch, MichaelThis dissertation focuses on random orthogonal matrices with applications in statistics. While Bayesian inference for statistical models with orthogonal matrix parameters is a recurring theme, several of the results on random orthogonal matrices may be of interest to those in the broader probability and random matrix theory communities. In Chapter 2, we parametrize the Stiefel and Grassmann manifolds, represented as subsets of orthogonal matrices, in terms of Euclidean parameters using the Cayley transform and then derive Jacobian terms for change of variables formulas. This allows for Markov chain Monte Carlo simulation from probability distributions defined on the Stiefel and Grassmann manifolds. We also establish an asymptotic independent normal approximation for the distribution of the Euclidean parameters corresponding to the uniform distribution on the Stiefel manifold. In Chapter 3, we present polar expansion, a general approach to Monte Carlo simulation from probability distributions on the Stiefel manifold. When combined with modern Markov chain Monte Carlo software, polar expansion allows for routine and flexible posterior inference in models with orthogonal matrix parameters. Chapter 4 addresses prior distributions for structured orthogonal matrices. We introduce an approach to constructing prior distributions for structured orthogonal matrices which leads to tractable posterior simulation via polar expansion. We state two main results which support our approach and offer a new perspective on approximating the entries of random orthogonal matrices.