Nonparametric Bayes for Big Data
Classical asymptotic theory deals with models in which the sample size $n$ goes to infinity with the number of parameters $p$ being fixed. However, rapid advancement of technology has empowered today's scientists to collect a huge number of explanatory variables
to predict a response. Many modern applications in science and engineering belong to the ``big data" regime in which both $p$ and $n$ may be very large. A variety of genomic applications even have $p$ substantially greater than $n$. With the advent of MCMC, Bayesian approaches exploded in popularity. Bayesian inference often allows easier interpretability than frequentist inference. Therefore, it becomes important to understand and evaluate
Bayesian procedures for ``big data" from a frequentist perspective.
In this dissertation, we address a number of questions related to solving large-scale statistical problems via Bayesian nonparametric methods.
It is well-known that classical estimators can be inconsistent in the high-dimensional regime without any constraints on the model. Therefore, imposing additional low-dimensional structures on the high-dimensional ambient space becomes inevitable. In the first two chapters of the thesis, we study the prediction performance of high-dimensional nonparametric regression from a minimax point of view. We consider two different low-dimensional constraints: 1. the response depends only on a small subset of the covariates; 2. the covariates lie on a low dimensional manifold in the original high dimensional ambient space. We also provide Bayesian nonparametric methods based on Gaussian process priors that are shown to be adaptive to unknown smoothness or low-dimensional manifold structure by attaining minimax convergence rates up to log factors. In chapter 3, we consider high-dimensional classification problems where all data are of categorical nature. We build a parsimonious model based on Bayesian tensor factorization for classification while doing inferences on the important predictors.
It is generally believed that ensemble approaches, which combine multiple algorithms or models, can outperform any single algorithm at machine learning tasks, such as prediction. In chapter 5, we propose Bayesian convex and linear aggregation approaches motivated by regression applications. We show that the proposed approach is minimax optimal when the true data-generating model is a convex or linear combination of models in the list. Moreover, the method can adapt to sparsity structure in which certain models should receive zero weights, and the method is tuning parameter free unlike competitors. More generally, under an M-open view when the truth falls outside the space of all convex/linear combinations, our theory suggests that the posterior measure tends to concentrate on the best approximation of the truth at the minimax rate.
Chapter 6 is devoted to sequential Markov chain Monte Carlo algorithms for Bayesian on-line learning of big data. The last chapter attempts to justify the use of posterior distribution to conduct statistical inferences for semiparametric estimation problems (the semiparametric Bernstein von-Mises theorem) from a frequentist perspective.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations