Skip to main content
Duke University Libraries
DukeSpace Scholarship by Duke Authors
  • Login
  • Ask
  • Menu
  • Login
  • Ask a Librarian
  • Search & Find
  • Using the Library
  • Research Support
  • Course Support
  • Libraries
  • About
View Item 
  •   DukeSpace
  • Theses and Dissertations
  • Duke Dissertations
  • View Item
  •   DukeSpace
  • Theses and Dissertations
  • Duke Dissertations
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Bayesian Variable Selection in Clustering and Hierarchical Mixture Modeling

Thumbnail
View / Download
3.3 Mb
Date
2012
Author
Lin, Lin
Advisor
West, Mike
Repository Usage Stats
497
views
478
downloads
Abstract

Clustering methods are designed to separate heterogeneous data into groups of similar objects such that objects within a group are similar, and objects in different groups are dissimilar. From the machine learning perspective, clustering can also be viewed as one of the most important topics within the unsupervised learning problem, which involves finding structures in a collection of unlabeled data. Various clustering methods have been developed under different problem contexts. Specifically, high dimensional data has stimulated a high level of interest in combining clustering algorithms and variable selection procedures; large data sets with expanding dimension have provoked an increasing need for relevant, customized clustering algorithms that offer the ability to detect low probability clusters.

This dissertation focuses on the model-based Bayesian approach to clustering. I first develop a new Bayesian Expectation-Maximization algorithm in fitting Dirichlet process mixture models and an algorithm to identify clusters under mixture models by aggregating mixture components. These two algorithms are used extensively throughout the dissertation. I then develop the concept and theory of a new variable selection method that is based on an evaluation of subsets of variables for the discriminatory evidence they provide in multivariate mixture modeling. This new approach to discriminative information analysis uses a natural measure of concordance between mixture component densities. The approach is both effective and computationally attractive for routine use in assessing and prioritizing subsets of variables according to their roles in the discrimination of one or more clusters. I demonstrate that the approach is useful for providing an objective basis for including or excluding specific variables in flow cytometry data analysis. These studies demonstrate how ranked sets of such variables can be used to optimize clustering strategies and selectively visualize identified clusters of the data of interest.

Next, I create a new approach to Bayesian mixture modeling with large data sets for a specific, important class of problems in biological subtype identification. The context, that of combinatorial encoding in flow cytometry, naturally introduces the hierarchical structure that these new models are designed to incorporate. I describe these novel classes of Bayesian mixture models with hierarchical structures that reflect the underlying problem context. The Bayesian analysis involves structured priors and computations using customized Markov chain Monte Carlo methods for model fitting that exploit a distributed GPU (graphics processing unit) implementation. The hierarchical mixture model is applied in the novel use of automated flow cytometry technology to measure levels of protein markers on thousands to millions of cells.

Finally, I develop a new approach to cluster high dimensional data based on Kingman's coalescent tree modeling ideas. Under traditional clustering models, the number of parameters required to construct the model increases exponentially with the number of dimensions. This phenomenon can lead to model overfitting and an enormous computational search challenge. The approach addresses these issues by proposing to learn the data structure in each individual dimension and combining these dimensions in a flexible tree-based model class. The new tree-based mixture model is studied extensively under various simulation studies, under which the model's superiority is reflected compared with traditional mixture models.

Type
Dissertation
Department
Statistical Science
Subject
Statistics
Permalink
https://hdl.handle.net/10161/5846
Citation
Lin, Lin (2012). Bayesian Variable Selection in Clustering and Hierarchical Mixture Modeling. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/5846.
Collections
  • Duke Dissertations
More Info
Show full item record
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

Rights for Collection: Duke Dissertations


Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info

Make Your Work Available Here

How to Deposit

Browse

All of DukeSpaceCommunities & CollectionsAuthorsTitlesTypesBy Issue DateDepartmentsAffiliations of Duke Author(s)SubjectsBy Submit DateThis CollectionAuthorsTitlesTypesBy Issue DateDepartmentsAffiliations of Duke Author(s)SubjectsBy Submit Date

My Account

LoginRegister

Statistics

View Usage Statistics
Duke University Libraries

Contact Us

411 Chapel Drive
Durham, NC 27708
(919) 660-5870
Perkins Library Service Desk

Digital Repositories at Duke

  • Report a problem with the repositories
  • About digital repositories at Duke
  • Accessibility Policy
  • Deaccession and DMCA Takedown Policy

TwitterFacebookYouTubeFlickrInstagramBlogs

Sign Up for Our Newsletter
  • Re-use & Attribution / Privacy
  • Harmful Language Statement
  • Support the Libraries
Duke University