Browsing by Author "Song, Hanyu"
- Results Per Page
- Sort Options
Item Open Access New tools for Bayesian clustering and factor analysis(2022) Song, HanyuTraditional model-based clustering faces challenges when applied to mixed scale multivariate data, consisting of both categorical and continuous variables. In such cases, there is a tendency for certain variables to overly influence clustering. In addition, as dimensionality increases, clustering can becomemore sensitive to kernel misspecification and less reliable. In Chapter 1, we propose a simple local-global Bayesian clustering framework designed to address both of these problems. The model assigns a separate cluster ID to each variable from each subject to define the local component of the model. These local clustering IDs are dependent on a global clustering ID for each subject through a simple hierarchical model. The proposed framework builds on previous related ideas including consensus clustering, the enriched Dirichlet process, and mixed membership models. We show its property of local-global borrowing of information and ease of handling missing data. As a canonical special case, we focus on a simple Dirichlet over-fitted local-global mixture, for which we show that the extra global components of the posterior can be emptied asymptotically. This is the first such result applicable to a broad class of over-fitted finite mixture of mixtures models. We also propose kernel and prior specification for the canonical case and show it leads to a simple Gibbs sampler for posterior computation. We illustrate the approach using simulation studies and applications, through which we see the model is able to identify relevant variables for clustering. Large data have become the norm in many modern applications; they often cannot be easily moved across computers or loaded into memory on a single computer. In such cases, model-based clustering, which typically uses the inherently serial Markov chain Monte Carlo for computation, faces challenges. Existing distributed algorithms have emphasized nonparametric Bayesian mixture models and typically require moving raw data across workers. In Chapter 2, we introduce a nearly embarrassingly parallel algorithm for clustering under a Bayesian overfitted finite mixture of Gaussian mixtures, which we term distributed Bayesian clustering (DIB-C). DIB-C can flexibly accommodate data sets with various shapes (e.g. skewed or multi-modal). With data randomly partitioned and distributed, we first run Markov chain Monte Carlo in an embarrassingly parallel manner to obtain local clustering draws and then refine across workers for a final clustering estimate based on \emph{any} loss function on the space of partitions. DIB-C can also estimate cluster densities, quickly classify new subjects and provide a posterior predictive distribution. Both simulation studies and real data applications show superior performance of DIB-C in terms of robustness and computational efficiency.
Chapter 3develops a simple factor analysis model in light of the need for new models for characterizing dependence in multivariate data. The multivariate Gaussian distribution is routinely used, but cannot characterize nonlinear relationships in the data. Most non-linear extensions tend to be highly complex; for example, involving estimation of a non-linear regression model in latent variables. We propose a relatively simple class of Ellipsoid-Gaussian multivariate distributions, which are derived by using a Gaussian linear factor model involving latent variables having a von Mises-Fisher distribution on a unit hyper-sphere. We show that the Ellipsoid-Gaussian distribution can flexibly model curved relationships among variables with lower-dimensional structures. Taking a Bayesian approach, we propose a hybrid of gradient-based geodesic Monte Carlo and adaptive Metropolis for posterior sampling. We derive basic properties and illustrate the utility of the Ellipsoid-Gaussian distribution on a variety of simulated and real data applications.
Item Open Access Wavelet Regression using MapReduce and Analysis of Multiple Sclerosis Clinical Data(2017) Song, HanyuTwo problems, one related to scalable methods and the other on application of statistical methods to clinical data are addressed in this thesis. In the first chapter, motivated by growing numbers of ``large p'' datasets, we present a novel MapReduce framework for handling multivariate wavelet regression. We compare the time complexity of proposed and conventional methods and show the novel framework scales linearly in the dimension $p$ of the response matrix. Empirical results show consistency with our complexity analysis. This work has its potential application in analysing image data or genomic data where the dimensions are huge.
In the second chapter, we explore a clinical dataset of Multiple Sclerosis (MS) provided by Biogen, which comprises 579 actively managed MS patients enrolled at single center for up to 5 years. Since a therapy to curing MS is unknown, Biogen and we are developing statistical models to predict the progression of disability level as a therapeutic guide. Such disability can be roughly quantified by EDSS (Expanded Disability Status Scale), and as such we conduct predict modelling of EDSS. Before we arrive at these models, we perform explanatory data analysis, conduct predictive modelling of current EDSS based on measurements in the same year.