dc.description.abstract |
<p>Capturing high dimensional complex ensembles of data is becoming commonplace in
a variety of application areas. Some examples include</p><p>biological studies exploring
relationships between genetic mutations and diseases, atmospheric and spatial data,
and internet usage and online behavioral data. These large complex data present many
challenges in their modeling and statistical analysis. Motivated by high dimensional
data applications, in this thesis, we focus on building scalable Bayesian nonparametric
regression algorithms and on developing models for joint distributions of complex
object ensembles.</p><p>We begin with a scalable method for Gaussian process regression,
a commonly used tool for nonparametric regression, prediction and spatial modeling.
A very common bottleneck for large data sets is the need for repeated inversions of
a big covariance matrix, which is required for likelihood evaluation and inference.
Such inversion can be practically infeasible and even if implemented, highly numerically
unstable. We propose an algorithm utilizing random projection ideas to construct flexible,
computationally efficient and easy to implement approaches for generic scenarios.
We then further improve the algorithm incorporating some structure and blocking ideas
in our random projections and demonstrate their applicability in other contexts requiring
inversion of large covariance matrices. We show theoretical guarantees for performance
as well as substantial improvements over existing methods with simulated and real
data. A by product of the work is that we discover hitherto unknown equivalences between
approaches in machine learning, random linear algebra and Bayesian statistics. We
finally connect random projection methods for large dimensional predictors and large
sample size under a unifying theoretical framework.</p><p>The other focus of this
thesis is joint modeling of complex ensembles of data from different domains. This
goes beyond traditional relational modeling of ensembles of one type of data and relies
on probability mixing measures over tensors. These models have added flexibility over
some existing product mixture model approaches in letting each component of the ensemble
have its own dependent cluster structure. We further investigate the question of measuring
dependence between variables of different types and propose a very general novel scaled
measure based on divergences between the joint and marginal distributions of the objects.
Once again, we show excellent performance in both simulated and real data scenarios.</p>
|
|