New tools for Bayesian clustering and factor analysis

dc.contributor.advisor

Dunson, David B

dc.contributor.author

Song, Hanyu

dc.date.accessioned

2022-09-21T13:55:21Z

dc.date.available

2022-09-21T13:55:21Z

dc.date.issued

2022

dc.department

Statistical Science

dc.description.abstract

Traditional model-based clustering faces challenges when applied to mixed scale multivariate data, consisting of both categorical and continuous variables. In such cases, there is a tendency for certain variables to overly influence clustering. In addition, as dimensionality increases, clustering can becomemore sensitive to kernel misspecification and less reliable. In Chapter 1, we propose a simple local-global Bayesian clustering framework designed to address both of these problems. The model assigns a separate cluster ID to each variable from each subject to define the local component of the model. These local clustering IDs are dependent on a global clustering ID for each subject through a simple hierarchical model. The proposed framework builds on previous related ideas including consensus clustering, the enriched Dirichlet process, and mixed membership models. We show its property of local-global borrowing of information and ease of handling missing data. As a canonical special case, we focus on a simple Dirichlet over-fitted local-global mixture, for which we show that the extra global components of the posterior can be emptied asymptotically. This is the first such result applicable to a broad class of over-fitted finite mixture of mixtures models. We also propose kernel and prior specification for the canonical case and show it leads to a simple Gibbs sampler for posterior computation. We illustrate the approach using simulation studies and applications, through which we see the model is able to identify relevant variables for clustering. Large data have become the norm in many modern applications; they often cannot be easily moved across computers or loaded into memory on a single computer. In such cases, model-based clustering, which typically uses the inherently serial Markov chain Monte Carlo for computation, faces challenges. Existing distributed algorithms have emphasized nonparametric Bayesian mixture models and typically require moving raw data across workers. In Chapter 2, we introduce a nearly embarrassingly parallel algorithm for clustering under a Bayesian overfitted finite mixture of Gaussian mixtures, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across workers for a final clustering estimate based on \emph{any} loss function on the space of partitions. DIB-C can also estimate cluster densities, quickly classify new subjects and provide a posterior predictive distribution. Both simulation studies and real data applications show superior performance of DIB-C in terms of robustness and computational efficiency.

Chapter 3develops a simple factor analysis model in light of the need for new models for characterizing dependence in multivariate data. The multivariate Gaussian distribution is routinely used, but cannot characterize nonlinear relationships in the data. Most non-linear extensions tend to be highly complex; for example, involving estimation of a non-linear regression model in latent variables. We propose a relatively simple class of Ellipsoid-Gaussian multivariate distributions, which are derived by using a Gaussian linear factor model involving latent variables having a von Mises-Fisher distribution on a unit hyper-sphere. We show that the Ellipsoid-Gaussian distribution can flexibly model curved relationships among variables with lower-dimensional structures. Taking a Bayesian approach, we propose a hybrid of gradient-based geodesic Monte Carlo and adaptive Metropolis for posterior sampling. We derive basic properties and illustrate the utility of the Ellipsoid-Gaussian distribution on a variety of simulated and real data applications.

dc.identifier.uri

https://hdl.handle.net/10161/25850

dc.subject

Statistics

dc.subject

Bayesian

dc.subject

Ellipsoid

dc.subject

Factor analysis

dc.subject

Model-based clustering

dc.title

New tools for Bayesian clustering and factor analysis

dc.type

Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Song_duke_0066D_16981.pdf
Size:
24.88 MB
Format:
Adobe Portable Document Format

Collections