New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data

dc.contributor.advisor

Mukherjee, Sayan

dc.contributor.author

Zhao, Shiwen

dc.date.accessioned

2016-06-06T14:37:19Z

dc.date.available

2016-06-06T14:37:19Z

dc.date.issued

2016

dc.department

Computational Biology and Bioinformatics

dc.description.abstract

Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

dc.identifier.uri

https://hdl.handle.net/10161/12187

dc.subject

Statistics

dc.subject

Bioinformatics

dc.subject

Mathematics

dc.subject

Bayesian statistics

dc.subject

Big data

dc.subject

Dimension reduction

dc.subject

Latent Structure

dc.subject

Method of Moment

dc.title

New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data

dc.type

Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhao_duke_0066D_13376.pdf
Size:
15.54 MB
Format:
Adobe Portable Document Format

Collections