New tools for Bayesian clustering and factor analysis

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Traditional model-based clustering faces challenges when applied to mixed scale multivariate data, consisting of both categorical and continuous variables. In such cases, there is a tendency for certain variables to overly influence clustering. In addition, as dimensionality increases, clustering can becomemore sensitive to kernel misspecification and less reliable. In Chapter 1, we propose a simple local-global Bayesian clustering framework designed to address both of these problems. The model assigns a separate cluster ID to each variable from each subject to define the local component of the model. These local clustering IDs are dependent on a global clustering ID for each subject through a simple hierarchical model. The proposed framework builds on previous related ideas including consensus clustering, the enriched Dirichlet process, and mixed membership models. We show its property of local-global borrowing of information and ease of handling missing data. As a canonical special case, we focus on a simple Dirichlet over-fitted local-global mixture, for which we show that the extra global components of the posterior can be emptied asymptotically. This is the first such result applicable to a broad class of over-fitted finite mixture of mixtures models. We also propose kernel and prior specification for the canonical case and show it leads to a simple Gibbs sampler for posterior computation. We illustrate the approach using simulation studies and applications, through which we see the model is able to identify relevant variables for clustering. Large data have become the norm in many modern applications; they often cannot be easily moved across computers or loaded into memory on a single computer. In such cases, model-based clustering, which typically uses the inherently serial Markov chain Monte Carlo for computation, faces challenges. Existing distributed algorithms have emphasized nonparametric Bayesian mixture models and typically require moving raw data across workers. In Chapter 2, we introduce a nearly embarrassingly parallel algorithm for clustering under a Bayesian overfitted finite mixture of Gaussian mixtures, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across workers for a final clustering estimate based on \emph{any} loss function on the space of partitions. DIB-C can also estimate cluster densities, quickly classify new subjects and provide a posterior predictive distribution. Both simulation studies and real data applications show superior performance of DIB-C in terms of robustness and computational efficiency.

Chapter 3develops a simple factor analysis model in light of the need for new models for characterizing dependence in multivariate data. The multivariate Gaussian distribution is routinely used, but cannot characterize nonlinear relationships in the data. Most non-linear extensions tend to be highly complex; for example, involving estimation of a non-linear regression model in latent variables. We propose a relatively simple class of Ellipsoid-Gaussian multivariate distributions, which are derived by using a Gaussian linear factor model involving latent variables having a von Mises-Fisher distribution on a unit hyper-sphere. We show that the Ellipsoid-Gaussian distribution can flexibly model curved relationships among variables with lower-dimensional structures. Taking a Bayesian approach, we propose a hybrid of gradient-based geodesic Monte Carlo and adaptive Metropolis for posterior sampling. We derive basic properties and illustrate the utility of the Ellipsoid-Gaussian distribution on a variety of simulated and real data applications.





Song, Hanyu (2022). New tools for Bayesian clustering and factor analysis. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.