Browsing by Subject "Dimension reduction"
Results Per Page
Sort Options
Item Open Access Bayesian Computation for High-Dimensional Continuous & Sparse Count Data(2018) Wang, YeProbabilistic modeling of multidimensional data is a common problem in practice. When the data is continuous, one common approach is to suppose that the observed data are close to a lower-dimensional smooth manifold. There are a rich variety of manifold learning methods available, which allow mapping of data points to the manifold. However, there is a clear lack of probabilistic methods that allow learning of the manifold along with the generative distribution of the observed data. The best attempt is the Gaussian process latent variable model (GP-LVM), but identifiability issues lead to poor performance. We solve these issues by proposing a novel Coulomb repulsive process (Corp) for locations of points on the manifold, inspired by physical models of electrostatic interactions among particles. Combining this process with a GP prior for the mapping function yields a novel electrostatic GP (electroGP) process.
Another popular approach is to suppose that the observed data are closed to one or a union of lower-dimensional linear subspaces. However, popular methods such as probabilistic principal component analysis scale poorly computationally. We introduce a novel empirical Bayesian method that we term geometric density estimation (GEODE), which assumes the data is centered near a low-dimensional linear subspace. We show that, with mild assumptions on the prior, the subspace spanned by the principal axes of the data maximizes the posterior mode. Hence, leveraged on the geometric information of the data, GEODE easily scales to massive dimensional problems. It is also capable of learning the intrinsic dimension via a novel shrinkage prior. Finally we mix GEODE across a dyadic clustering tree to account for nonlinear cases.
When data is discrete, a common strategy is to define a generalized linear model (GLM) for each variable, with dependence in the different variables induced through including multivariate latent variables in the GLMs. The Bayesian inference for these models usually
rely on data augmented Markov chain Monte Carlo (DA-MCMC) method, which has a provable slow mixing rate when the data is imbalanced. For more scalable inference, we proposes Bayesian mosaic, a parallelizable composite posterior, for scalable Bayesian inference on a subclass of the multivariate discrete data models. Sampling is embarrassingly parallel since Bayesian mosaic is a multiplication of component posteriors that can be independently sampled from. Analogous to composite likelihood methods, these component posteriors are based on univariate or bivariate marginal densities. Utilizing the fact that the score functions of these densities are unbiased, we have shown that Bayesian mosaic is consistent and asymptotically normal under mild conditions. Since the evaluation of univariate or bivariate marginal densities could be done via numerical integration, sampling from Bayesian mosaic completely bypasses the traditional data augmented Markov chain Monte Carlo (DA-MCMC) method. Moreover, we have shown that sampling from Bayesian mosaic also has better scalability to large sample size than DA-MCMC.
The performance of the proposed methods and models will be demonstrated via both simulation studies and real world applications.
Item Open Access Computational Methods For Functional Motif Identification and Approximate Dimension Reduction in Genomic Data(2011) Georgiev, StoyanUncovering the DNA regulatory logic in complex organisms has been one of the important goals of modern biology in the post-genomic era. The sequencing of multiple genomes in combination with the advent of DNA microarrays and, more recently, of massively parallel high-throughput sequencing technologies has made possible the adoption of a global perspective to the inference of the regulatory rules governing the context-specific interpretation of the genetic code that complements the more focused classical experimental approaches. Extracting useful information and managing the complexity resulting from the sheer volume and the high-dimensionality of the data produced by these genomic assays has emerged as a major challenge which we attempt to address in this work by developing computational methods and tools, specifically designed for the study of the gene regulatory processes in this new global genomic context.
First, we focus on the genome-wide discovery of physical interactions between regulatory sequence regions and their cognate proteins at both the DNA and RNA level. We present a motif analysis framework that leverages the genome-wide
evidence for sequence-specific interactions between trans-acting factors and their preferred cis-acting regulatory regions. The utility of the proposed framework is demonstarted on DNA and RNA cross-linking high-throughput data.
A second goal of this thesis is the development of scalable approaches to dimension reduction based on spectral decomposition and their application to the study of population structure in massive high-dimensional genetic data sets. We have developed computational tools and have performed theoretical and empirical analyses of their statistical properties with particular emphasis on the analysis of the individual genetic variation measured by Single Nucleotide Polymorphism (SNP) microrarrays.
Item Open Access New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data(2016) Zhao, ShiwenConstant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Item Open Access Probabilistic Models on Fibre Bundles(2019) Shan, ShanIn this thesis, we propose probabilistic models on fibre bundles for learning the generative process of data. The main tool we use is the diffusion kernel and we use it in two ways. First, we build from the diffusion kernel on a fibre bundle a projected kernel that generates robust representations of the data, and we test that it outperforms regular diffusion maps under noise. Second, this diffusion kernel gives rise to a natural covariance function when defining Gaussian processes (GP) on the fibre bundle. To demonstrate the uses of GP on a fibre bundle, we apply it to simulated data on a Mobius strip for the problem of prediction and regression. Parameter tuning can also be guided by a novel semi-group test arising from the geometric properties of diffusion kernel. For an example of real-world application, we use probabilistic models on fibre bundles to study evolutionary process on anatomical surfaces. In a separate chapter, we propose a robust algorithm (ariaDNE) for computing curvature on each individual surface. The proposed machinery, relating diffusion processes to probabilistic models on fibre bundles, provides a unified framework for ideas from a variety of different topics such as geometric operators, dimension reduction, regression and Bayesian statistics.