# Browsing by Author "Dunson, David B"

###### Results Per Page

###### Sort Options

Item Open Access A Foxp2 Mutation Implicated in Human Speech Deficits Alters Sequencing of Ultrasonic Vocalizations in Adult Male Mice.(Front Behav Neurosci, 2016) Chabout, Jonathan; Sarkar, Abhra; Patel, Sheel R; Radden, Taylor; Dunson, David B; Fisher, Simon E; Jarvis, Erich DDevelopment of proficient spoken language skills is disrupted by mutations of the FOXP2 transcription factor. A heterozygous missense mutation in the KE family causes speech apraxia, involving difficulty producing words with complex learned sequences of syllables. Manipulations in songbirds have helped to elucidate the role of this gene in vocal learning, but findings in non-human mammals have been limited or inconclusive. Here, we performed a systematic study of ultrasonic vocalizations (USVs) of adult male mice carrying the KE family mutation. Using novel statistical tools, we found that Foxp2 heterozygous mice did not have detectable changes in USV syllable acoustic structure, but produced shorter sequences and did not shift to more complex syntax in social contexts where wildtype animals did. Heterozygous mice also displayed a shift in the position of their rudimentary laryngeal motor cortex (LMC) layer-5 neurons. Our findings indicate that although mouse USVs are mostly innate, the underlying contributions of FoxP2 to sequencing of vocalizations are conserved with humans.Item Open Access Bayes High-Dimensional Density Estimation Using Multiscale Dictionaries(2014) Wang, YeAlthough Bayesian density estimation using discrete mixtures has good performance in modest dimensions, there is a lack of statistical and computational scalability to high-dimensional multivariate cases. To combat the curse of dimensionality, it is necessary to assume the data are concentrated near a lower-dimensional subspace. However, Bayesian methods for learning this subspace along with the density of the data scale poorly computationally. To solve this problem, we propose an empirical Bayes approach, which estimates a multiscale dictionary using geometric multiresolution analysis in a first stage. We use this dictionary within a multiscale mixture model, which allows uncertainty in component allocation, mixture weights and scaling factors over a binary tree. A computational algorithm is proposed, which scales efficiently to massive dimensional problems. We provide some theoretical support for this method, and illustrate the performance through simulated and real data examples.

Item Open Access Bayesian Computation for High-Dimensional Continuous & Sparse Count Data(2018) Wang, YeProbabilistic modeling of multidimensional data is a common problem in practice. When the data is continuous, one common approach is to suppose that the observed data are close to a lower-dimensional smooth manifold. There are a rich variety of manifold learning methods available, which allow mapping of data points to the manifold. However, there is a clear lack of probabilistic methods that allow learning of the manifold along with the generative distribution of the observed data. The best attempt is the Gaussian process latent variable model (GP-LVM), but identifiability issues lead to poor performance. We solve these issues by proposing a novel Coulomb repulsive process (Corp) for locations of points on the manifold, inspired by physical models of electrostatic interactions among particles. Combining this process with a GP prior for the mapping function yields a novel electrostatic GP (electroGP) process.

Another popular approach is to suppose that the observed data are closed to one or a union of lower-dimensional linear subspaces. However, popular methods such as probabilistic principal component analysis scale poorly computationally. We introduce a novel empirical Bayesian method that we term geometric density estimation (GEODE), which assumes the data is centered near a low-dimensional linear subspace. We show that, with mild assumptions on the prior, the subspace spanned by the principal axes of the data maximizes the posterior mode. Hence, leveraged on the geometric information of the data, GEODE easily scales to massive dimensional problems. It is also capable of learning the intrinsic dimension via a novel shrinkage prior. Finally we mix GEODE across a dyadic clustering tree to account for nonlinear cases.

When data is discrete, a common strategy is to define a generalized linear model (GLM) for each variable, with dependence in the different variables induced through including multivariate latent variables in the GLMs. The Bayesian inference for these models usually

rely on data augmented Markov chain Monte Carlo (DA-MCMC) method, which has a provable slow mixing rate when the data is imbalanced. For more scalable inference, we proposes Bayesian mosaic, a parallelizable composite posterior, for scalable Bayesian inference on a subclass of the multivariate discrete data models. Sampling is embarrassingly parallel since Bayesian mosaic is a multiplication of component posteriors that can be independently sampled from. Analogous to composite likelihood methods, these component posteriors are based on univariate or bivariate marginal densities. Utilizing the fact that the score functions of these densities are unbiased, we have shown that Bayesian mosaic is consistent and asymptotically normal under mild conditions. Since the evaluation of univariate or bivariate marginal densities could be done via numerical integration, sampling from Bayesian mosaic completely bypasses the traditional data augmented Markov chain Monte Carlo (DA-MCMC) method. Moreover, we have shown that sampling from Bayesian mosaic also has better scalability to large sample size than DA-MCMC.

The performance of the proposed methods and models will be demonstrated via both simulation studies and real world applications.

Item Open Access Bayesian Gaussian Copula Factor Models for Mixed Data.(J Am Stat Assoc, 2013-06-01) Murray, Jared S; Dunson, David B; Carin, Lawrence; Lucas, Joseph EGaussian factor models have proven widely useful for parsimoniously characterizing dependence in multivariate data. There is a rich literature on their extension to mixed categorical and continuous variables, using latent Gaussian variables or through generalized latent trait models acommodating measurements in the exponential family. However, when generalizing to non-Gaussian measured variables the latent variables typically influence both the dependence structure and the form of the marginal distributions, complicating interpretation and introducing artifacts. To address this problem we propose a novel class of Bayesian Gaussian copula factor models which decouple the latent factors from the marginal distributions. A semiparametric specification for the marginals based on the extended rank likelihood yields straightforward implementation and substantial computational gains. We provide new theoretical and empirical justifications for using this likelihood in Bayesian inference. We propose new default priors for the factor loadings and develop efficient parameter-expanded Gibbs sampling for posterior computation. The methods are evaluated through simulations and applied to a dataset in political science. The models in this paper are implemented in the R package bfa.Item Open Access Bayesian inference for genomic data integration reduces misclassification rate in predicting protein-protein interactions.(PLoS Comput Biol, 2011-07) Xing, Chuanhua; Dunson, David BProtein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility.Item Open Access Bayesian Inference in Large-scale Problems(2016) Johndrow, James EdwardMany modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.

Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.

One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.

Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.

Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.

The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.

Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.

Item Open Access Bayesian network-response regression.(Bioinformatics, 2017-01-06) Wang, Lu; Durante, Daniele; Jung, Rex E; Dunson, David BMotivation: There is increasing interest in learning how human brain networks vary as a function of a continuous trait, but flexible and efficient procedures to accomplish this goal are limited. We develop a Bayesian semiparametric model, which combines low-rank factorizations and flexible Gaussian process priors to learn changes in the conditional expectation of a network-valued random variable across the values of a continuous predictor, while including subject-specific random effects. Results: The formulation leads to a general framework for inference on changes in brain network structures across human traits, facilitating borrowing of information and coherently characterizing uncertainty. We provide an efficient Gibbs sampler for posterior computation along with simple procedures for inference, prediction and goodness-of-fit assessments. The model is applied to learn how human brain networks vary across individuals with different intelligence scores. Results provide interesting insights on the association between intelligence and brain connectivity, while demonstrating good predictive performance. Availability and Implementation: Source code implemented in R and data are available at https://github.com/wangronglu/BNRR. Contact: rl.wang@duke.edu. Supplementary information: Supplementary data are available at Bioinformatics online.Item Embargo Bayesian Nonparametric Methods for Epidemiology and Clustering(2023) Buch, David AnthonyBayesian nonparametric methods employ prior distributions with large support in the space of probabilistic models. The flexibility of these methods enables them to address challenging inference tasks. This thesis develops Bayesian nonparametric methodologies for problems in epidemiology and clustering. During an infectious disease outbreak, there is interest in (1) understanding the impact of environmental conditions, human behavior, genetic variants, and public policy on transmission; (2) monitoring the rate of transmissions across regions and over time; and (3) producing short-term forecasts of disease incidence to facilitate decision making and planning by policy makers and members of the public. The data which are typically available to address these questions are incidence data – cases, hospitalizations, or deaths occurring in a certain population during a certain time interval – pose a challenge to methodology as they have an indirect and nonlinear relationship with the transmission rate, and they may suffer from artifacts and systematic biases that can vary across time and across regions. In Chapter 2 we exploit the flexibility of Bayesian nonparametric models to account for the many irregularities in these data.

Cluster analysis is the task of identifying meaningful subgroups in data. A large variety of algorithms for clustering have been developed, but within the Bayesian paradigm, clustering has nearly always been performed by associating observations with components of a mixture distribution. These mixture models are inherently limited by a tradeoff between component flexibility and identifiability. Thus, relatively inflexible components are used, often leading to disappointing results. In Chapter 3, we develop a decision theoretic framework for Bayesian Level-Set (BALLET) clustering, which exploits Bayesian nonparametric density posteriors. The approach avoids some pitfalls of classical Bayesian clustering methods by leveraging ideas from the algorithmic and frequentist literature. Finally, we note that level-set clustering represents a simple example of clustering into non-exchangeable subsets, since one part is designated as noise points. In Chapter 4, we develop loss functions for non-exchangeable partitions with an arbitrary number of categories. We show that the notion of Categorized Partitions (CaPos) is useful in practical situations and that our novel loss functions yield sensible decision-theoretic point estimates.

Item Open Access Bayesian Semi-parametric Factor Models(2012) Bhattacharya, AnirbanIdentifying a lower-dimensional latent space for representation of high-dimensional observations is of significant importance in numerous biomedical and machine learning applications. In many such applications, it is now routine to collect data where the dimensionality of the outcomes is comparable or even larger than the number of available observations. Motivated in particular by the problem of predicting the risk of impending diseases from massive gene expression and single nucleotide polymorphism profiles, this dissertation focuses on building parsimonious models and computational schemes for high-dimensional continuous and unordered categorical data, while also studying theoretical properties of the proposed methods. Sparse factor modeling is fast becoming a standard tool for parsimonious modeling of such massive dimensional data and the content of this thesis is specifically directed towards methodological and theoretical developments in Bayesian sparse factor models.

The first three chapters of the thesis studies sparse factor models for high-dimensional continuous data. A class of shrinkage priors on factor loadings are introduced with attractive computational properties, with operating characteristics explored through a number of simulated and real data examples. In spite of the methodological advances over the past decade, theoretical justifications in high-dimensional factor models are scarce in the Bayesian literature. Part of the dissertation focuses on exploring estimation of high-dimensional covariance matrices using a factor model and studying the rate of posterior contraction as both the sample size & dimensionality increases.

To relax the usual assumption of a linear relationship among the latent and observed variables in a standard factor model, extensions to a non-linear latent factor model are also considered.

Although Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary and ordered categorical data, it leads to challenging computation and complex modeling structures for unordered categorical variables. As an alternative, a novel class of simplex factor models for massive-dimensional and enormously sparse contingency table data is proposed in the second part of the thesis. An efficient MCMC scheme is developed for posterior computation and the methods are applied to modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features. Building on a connection between the proposed model & sparse tensor decompositions, we propose new classes of nonparametric Bayesian models for testing associations between a massive dimensional vector of genetic markers and a phenotypical outcome.

Item Open Access Bayesian Sparse Learning for High Dimensional Data(2011) Shi, MinghuiIn this thesis, we develop some Bayesian sparse learning methods for high dimensional data analysis. There are two important topics that are related to the idea of sparse learning -- variable selection and factor analysis. We start with Bayesian variable selection problem in regression models. One challenge in Bayesian variable selection is to search the huge model space adequately, while identifying high posterior probability regions. In the past decades, the main focus has been on the use of Markov chain Monte Carlo (MCMC) algorithms for these purposes. In the first part of this thesis, instead of using MCMC, we propose a new computational approach based on sequential Monte Carlo (SMC), which we refer to as particle stochastic search (PSS). We illustrate PSS through applications to linear regression and probit models.

Besides the Bayesian stochastic search algorithms, there is a rich literature on shrinkage and variable selection methods for high dimensional regression and classification with vector-valued parameters, such as lasso (Tibshirani, 1996) and the relevance vector machine (Tipping, 2001). Comparing with the Bayesian stochastic search algorithms, these methods does not account for model uncertainty but are more computationally efficient. In the second part of this thesis, we generalize this type of ideas to matrix valued parameters and focus on developing efficient variable selection method for multivariate regression. We propose a Bayesian shrinkage model (BSM) and an efficient algorithm for learning the associated parameters .

In the third part of this thesis, we focus on the topic of factor analysis which has been widely used in unsupervised learnings. One central problem in factor analysis is related to the determination of the number of latent factors. We propose some Bayesian model selection criteria for selecting the number of latent factors based on a graphical factor model. As it is illustrated in Chapter 4, our proposed method achieves good performance in correctly selecting the number of factors in several different settings. As for application, we implement the graphical factor model for several different purposes, such as covariance matrix estimation, latent factor regression and classification.

Item Open Access Constrained Bayesian Inference through Posterior Projection with Applications(2019) Patra, SayanIn a broad variety of settings, prior information takes the form of parameter restrictions. Bayesian approaches are appealing in parameter constrained problems in allowing a probabilistic characterization of uncertainty in finite samples, while providing a computational machinery for the incorporation of complex constraints in hierarchical models. However, the usual Bayesian strategy of directly placing a prior measure on the constrained space, and then conducting posterior computation with Markov chain Monte Carlo algorithms is often intractable. An alternative is to initially conduct computation for an unconstrained or less constrained posterior, and then project draws from this initial posterior to the constrained space through a minimal distance mapping. This approach has been successful in monotone function estimation but has not been considered in broader settings.

In this dissertation, we develop a unified theory to justify posterior projection in general Banach spaces including for infinite-dimensional functional parameter space. For tractability, in chapter 2 we focus on the case in which the constrained parameter space corresponds to a closed, convex subset of the original space. A special class of non-convex sets called Stiefel manifolds is explored later in chapter 3. Specifically, we provide a general formulation of the projected posterior and show that it corresponds to a valid posterior distribution on the constrained space for particular classes of priors and likelihood functions. We also show how the asymptotic properties of the unconstrained posterior are transferred to the projected posterior. We then illustrate our proposed methodology via multiple examples, both in simulation studies and real data applications.

In chapter 4, we extend our proposed methodology of posterior projection to that of small area estimation (SAE), which focuses on estimating population parameters when there is little to no area-specific information. ``Areas" are often spatial regions, where they might be different demographic groups or experimental conditions. To improve the precision of estimates, a common strategy in SAE methods is to borrow information across several areas. This is generally achieved by using a hierarchical or empirical Bayesian model. However, parameter constraints arising naturally from surveys pose a challenge to the estimation procedure. Examples of such constraints include the variance of the estimate of an area being proportional to the geographic size of the area or the sum of the county level estimates being equal to the state level estimates. We utilize and extend the posterior projection approach to facilitate such computing and reduce parameter uncertainty.

This dissertation develops the fundamental new approaches for constrained Bayesian inference, and there are many possible directions for future endeavors. One such important generalization is considered in chapter 5 to allow for conditional posterior projections; for example, applying projection to a subset of parameters immediately after each update step within a Markov chain Monte Carlo algorithm. We identify several scenarios where such a modified algorithm converges to the underlying true distribution and develop a general theory to ensure consistency. We conclude the dissertation by discussing future directions of research in chapter 6, outlining many directions for continued research on these topics.

Item Open Access Data augmentation for models based on rejection sampling.(Biometrika, 2016-06) Rao, Vinayak; Lin, Lizhen; Dunson, David BWe present a data augmentation scheme to perform Markov chain Monte Carlo inference for models where data generation involves a rejection sampling algorithm. Our idea is a simple scheme to instantiate the rejected proposals preceding each data point. The resulting joint probability over observed and rejected variables can be much simpler than the marginal distribution over the observed variables, which often involves intractable integrals. We consider three problems: modelling flow-cytometry measurements subject to truncation; the Bayesian analysis of the matrix Langevin distribution on the Stiefel manifold; and Bayesian inference for a nonparametric Gaussian process density model. The latter two are instances of doubly-intractable Markov chain Monte Carlo problems, where evaluating the likelihood is intractable. Our experiments demonstrate superior performance over state-of-the-art sampling algorithms for such problems.Item Open Access Easy and Efficient Bayesian Infinite Factor Analysis(2020) Poworoznek, EvanBayesian latent factor models are key tools for modeling linear structure in data and performing dimension reduction for correlated variables. Recent advances in prior specification allow the estimation of semi- and non-parametric infinite factor mod- els. These models provide significant theoretical and practical advantages at the cost of computationally intensive sampling and non-identifiability of some parameters. We provide a package for the R programming environment that includes functions for sampling from the posterior distributions of several recent latent factor mod- els. These computationally efficient samplers are provided for R with C++ source code to facilitate fast sampling of standard models and provide component sam- pling functions for more complex models. We also present an efficient algorithm to remove the non-identifiability that results from the included shrinkage priors. The infinitefactor package is available in developmental version on GitHub at https://github.com/poworoznek/infinitefactor and in release version on the CRAN package repository.

Item Open Access Efficient Gaussian process regression for large datasets.(Biometrika, 2013-03) Banerjee, Anjishnu; Dunson, David B; Tokdar, Surya TGaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n(3) where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-of-regressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.Item Open Access Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics(2014) Shang, YanIn cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for the customer and freight forwarder. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Interestingly, our preliminary data analysis shows a strong multimodal feature in the transport risks, driven by unobserved events, such as cargo missing flights. To accommodate this feature, we introduce a Bayesian nonparametric model -- the probit stick-breaking process (PSBP) mixture model -- for flexible estimation of the conditional (i.e., state-dependent) density function of transport risk. We demonstrate that using simpler methods, such as OLS linear regression, can lead to misleading inferences. Our model provides a tool for the forwarder to offer customized price and service quotes. It can also generate baseline airline performance to enable fair supplier evaluation. Furthermore, the method allows us to separate recurrent risks from disruption risks. This is important, because hedging strategies for these two kinds of risks are often drastically different.

Item Open Access Gaussian beta process(2014) Wang, YingjianThis thesis presents a new framework for constituting a group of dependent completely random measures, unifying and extending methods in the literature. The dependent completely random measures are constructed based on a shared completely random measure, which is extended to the covariate space, and further differentiated by the covariate information associated with the data for which the completely random measures serve as priors. As a concrete example of the flexibility provided by the framework, a group of dependent feature learning measures are constructed based on a shared beta process, with Gaussian processes applied to build adaptive dependencies learnt from the practical data, denoted as the Gaussian beta process. Experiment results are presented for gene-expression series data (time as covariate), as well as digital image data (spatial location as covariate).

Item Open Access Generalized admixture mapping for complex traits.(G3 (Bethesda), 2013-07-08) Zhu, Bin; Ashley-Koch, Allison E; Dunson, David BAdmixture mapping is a popular tool to identify regions of the genome associated with traits in a recently admixed population. Existing methods have been developed primarily for identification of a single locus influencing a dichotomous trait within a case-control study design. We propose a generalized admixture mapping (GLEAM) approach, a flexible and powerful regression method for both quantitative and qualitative traits, which is able to test for association between the trait and local ancestries in multiple loci simultaneously and adjust for covariates. The new method is based on the generalized linear model and uses a quadratic normal moment prior to incorporate admixture prior information. Through simulation, we demonstrate that GLEAM achieves lower type I error rate and higher power than ANCESTRYMAP both for qualitative traits and more significantly for quantitative traits. We applied GLEAM to genome-wide SNP data from the Illumina African American panel derived from a cohort of black women participating in the Healthy Pregnancy, Healthy Baby study and identified a locus on chromosome 2 associated with the averaged maternal mean arterial pressure during 24 to 28 weeks of pregnancy.Item Open Access Joint eQTL assessment of whole blood and dura mater tissue from individuals with Chiari type I malformation.(BMC Genomics, 2015-01-22) Lock, Eric F; Soldano, Karen L; Garrett, Melanie E; Cope, Heidi; Markunas, Christina A; Fuchs, Herbert; Grant, Gerald; Dunson, David B; Gregory, Simon G; Ashley-Koch, Allison EBACKGROUND: Expression quantitative trait loci (eQTL) play an important role in the regulation of gene expression. Gene expression levels and eQTLs are expected to vary from tissue to tissue, and therefore multi-tissue analyses are necessary to fully understand complex genetic conditions in humans. Dura mater tissue likely interacts with cranial bone growth and thus may play a role in the etiology of Chiari Type I Malformation (CMI) and related conditions, but it is often inaccessible and its gene expression has not been well studied. A genetic basis to CMI has been established; however, the specific genetic risk factors are not well characterized. RESULTS: We present an assessment of eQTLs for whole blood and dura mater tissue from individuals with CMI. A joint-tissue analysis identified 239 eQTLs in either dura or blood, with 79% of these eQTLs shared by both tissues. Several identified eQTLs were novel and these implicate genes involved in bone development (IPO8, XYLT1, and PRKAR1A), and ribosomal pathways related to marrow and bone dysfunction, as potential candidates in the development of CMI. CONCLUSIONS: Despite strong overall heterogeneity in expression levels between blood and dura, the majority of cis-eQTLs are shared by both tissues. The power to detect shared eQTLs was improved by using an integrative statistical approach. The identified tissue-specific and shared eQTLs provide new insight into the genetic basis for CMI and related conditions.Item Open Access Latent Stick-Breaking Processes.(J Am Stat Assoc, 2010-04-01) Rodríguez, Abel; Dunson, David B; Gelfand, Alan EWe develop a model for stochastic processes with random marginal distributions. Our model relies on a stick-breaking construction for the marginal distribution of the process, and introduces dependence across locations by using a latent Gaussian copula model as the mechanism for selecting the atoms. The resulting latent stick-breaking process (LaSBP) induces a random partition of the index space, with points closer in space having a higher probability of being in the same cluster. We develop an efficient and straightforward Markov chain Monte Carlo (MCMC) algorithm for computation and discuss applications in financial econometrics and ecology. This article has supplementary material online.Item Open Access Learning and Exploiting Low-Dimensional Structure in High-Dimensional Data(2020) Li, DidongData lying in a high dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional manifold. If one does not exploit the hidden geometry in the data but instead deal with the ambient high dimensional Euclidean spaces directly, both the statistical and computation efficiency are extremely low. In contrast, an accurate approximation of the unknown manifold will benefit a variety of aspects including dimension reduction, feature selection, density estimation, classification, clustering, data denoising, data visualization and so on. Most of the literature for data analysis relies on linear or locally linear methods. However, when the manifold has essential curvature, these linear methods suffer from low accuracy and efficiency. There is also an immense literature focused on non-linear methods like Variational Auto Encoders and Gaussian Process Latent Variable Model, to improve the approximation performance. However, these methods are complex black boxes lacking reproducibility, identifiability and interpretability. As a result, new non-linear tools need to be developed without introducing too much extra complexity.

This dissertation focuses on exploiting the geometry in the data through the curvature of the unknown manifold to efficiently estimate the manifold, while keeping the simple and clean close forms as in linear methods. In particular, a simple and general alternative of locally linear manifold learning method is proposed, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown manifold. The spherical principal components analysis (SPCA) is developed as a non-linear alternative of PCA, to find the best sphere fitting the data. SPCA provides simple tools that can be implemented efficiency for big and complex data and allow one to learn about geometric structure in the data, without introducing much more complexity than linear methods.

Inspired by spherelets, a curved kernel called the Fisher-Gaussian (FG) kernel is introduced, which outperforms multivariate Gaussians for density estimation. In particular, the Dirichlet process mixture of FG kernels model is studied for density estimation, which is proved to be posterior consistent. In addition, some applications of spherelets, including classification, geodesic distance estimation and clustering are also considered, with a variety of real data applications.