# Browsing by Subject "Gaussian process"

###### Results Per Page

###### Sort Options

Item Open Access Accommodating the ecological fallacy in disease mapping in the absence of individual exposures.(Stat Med, 2017-09-19) Wang, Feifei; Wang, Jian; Gelfand, Alan; Li, FanIn health exposure modeling, in particular, disease mapping, the ecological fallacy arises because the relationship between aggregated disease incidence on areal units and average exposure on those units differs from the relationship between the event of individual incidence and the associated individual exposure. This article presents a novel modeling approach to address the ecological fallacy in the least informative data setting. We assume the known population at risk with an observed incidence for a collection of areal units and, separately, environmental exposure recorded during the period of incidence at a collection of monitoring stations. We do not assume any partial individual level information or random allocation of individuals to observed exposures. We specify a conceptual incidence surface over the study region as a function of an exposure surface resulting in a stochastic integral of the block average disease incidence. The true block level incidence is an unavailable Monte Carlo integration for this stochastic integral. We propose an alternative manageable Monte Carlo integration for the integral. Modeling in this setting is immediately hierarchical, and we fit our model within a Bayesian framework. To alleviate the resulting computational burden, we offer 2 strategies for efficient model fitting: one is through modularization, the other is through sparse or dimension-reduced Gaussian processes. We illustrate the performance of our model with simulations based on a heat-related mortality dataset in Ohio and then analyze associated real data.Item Open Access Bayesian Nonparametric Modeling and Theory for Complex Data(2012) Pati, DebdeepThe dissertation focuses on solving some important theoretical and methodological problems associated with Bayesian modeling of infinite dimensional `objects', popularly called nonparametric Bayes. The term `infinite dimensional object' can refer to a density, a conditional density, a regression surface or even a manifold. Although Bayesian density estimation as well as function estimation are well-justified in the existing literature, there has been little or no theory justifying the estimation of more complex objects (e.g. conditional density, manifold, etc.). Part of this dissertation focuses on exploring the structure of the spaces on which the priors for conditional densities and manifolds are supported while studying how the posterior concentrates as increasing amounts of data are collected.

With the advent of new acquisition devices, there has been a need to model complex objects associated with complex data-types e.g. millions of genes affecting a bio-marker, 2D pixelated images, a cloud of points in the 3D space, etc. A significant portion of this dissertation has been devoted to developing adaptive nonparametric Bayes approaches for learning low-dimensional structures underlying higher-dimensional objects e.g. a high-dimensional regression function supported on a lower dimensional space, closed curves representing the boundaries of shapes in 2D images and closed surfaces located on or near the point cloud data. Characterizing the distribution of these objects has a tremendous impact in several application areas ranging from tumor tracking for targeted radiation therapy, to classifying cells in the brain, to model based methods for 3D animation and so on.

The first three chapters are devoted to Bayesian nonparametric theory and modeling in unconstrained Euclidean spaces e.g. mean regression and density regression, the next two focus on Bayesian modeling of manifolds e.g. closed curves and surfaces, and the final one on nonparametric Bayes spatial point pattern data modeling when the sampling locations are informative of the outcomes.

Item Open Access Computational Methods for Investigating Dendritic Cell Biology(2011) de Oliveira Sales, Ana PaulaThe immune system is constantly faced with the daunting task of protecting the host from a large number of ever-evolving pathogens. In vertebrates, the immune response results from the interplay of two cellular systems: the innate immunity and the adaptive immunity. In the past decades, dendritic cells have emerged as major players in the modulation of the immune response, being one of the primary links between these two branches of the immune system.

Dendritic cells are pathogen-sensing cells that alert the rest of the immune system of the presence of infection. The signals sent by dendritic cells result in the recruitment of the appropriate cell types and molecules required for effectively clearing the infection. A question of utmost importance in our understanding of the immune response and our ability to manipulate it in the development of vaccines and therapies is: "How do dendritic cells translate the various cues they perceive from the environment into different signals that specifically activate the appropriate parts of the immune system that result in an immune response streamlined to clear the given pathogen?"

Here we have developed computational and statistical methods aimed to address specific aspects of this question. In particular, understanding how dendritic cells ultimately modulate the immune response requires an understanding of the subtleties of their maturation process in response to different environmental signals. Hence, the first part of this dissertation focuses on elucidating the changes in the transcriptional

program of dendritic cells in response to the detection of two common pathogen- associated molecules, LPS and CpG. We have developed a method based on Langevin and Dirichlet processes to model and cluster gene expression temporal data, and have used it to identify, on a large scale, genes that present unique and common transcriptional behaviors in response to these two stimuli. Additionally, we have also investigated a different, but related, aspect of dendritic cell modulation of the adaptive immune response. In the second part of this dissertation, we present a method to predict peptides that will bind to MHC molecules, a requirement for the activation of pathogen-specific T cells. Together, these studies contribute to the elucidation of important aspects of dendritic cell biology.

Item Open Access Data augmentation for models based on rejection sampling.(Biometrika, 2016-06) Rao, Vinayak; Lin, Lizhen; Dunson, David BWe present a data augmentation scheme to perform Markov chain Monte Carlo inference for models where data generation involves a rejection sampling algorithm. Our idea is a simple scheme to instantiate the rejected proposals preceding each data point. The resulting joint probability over observed and rejected variables can be much simpler than the marginal distribution over the observed variables, which often involves intractable integrals. We consider three problems: modelling flow-cytometry measurements subject to truncation; the Bayesian analysis of the matrix Langevin distribution on the Stiefel manifold; and Bayesian inference for a nonparametric Gaussian process density model. The latter two are instances of doubly-intractable Markov chain Monte Carlo problems, where evaluating the likelihood is intractable. Our experiments demonstrate superior performance over state-of-the-art sampling algorithms for such problems.Item Open Access Development and Implementation of Bayesian Computer Model Emulators(2011) Lopes, Danilo LourencoOur interest is the risk assessment of rare natural hazards, such as

large volcanic pyroclastic flows. Since catastrophic consequences of

volcanic flows are rare events, our analysis benefits from the use of

a computer model to provide information about these events under

natural conditions that may not have been observed in reality.

A common problem in the analysis of computer experiments, however, is the high computational cost associated with each simulation of a complex physical process. We tackle this problem by using a statistical approximation (emulator) to predict the output of this computer model at untried values of inputs. Gaussian process response surface is a technique commonly used in these applications, because it is fast and easy to use in the analysis.

We explore several aspects of the implementation of Gaussian process emulators in a Bayesian context. First, we propose an improvement for the implementation of the plug-in approach to Gaussian processes. Next, we also evaluate the performance of a spatial model for large data sets in the context of computer experiments.

Computer model data can also be combined to field observations in order to calibrate the emulator and obtain statistical approximations to the computer model that are closer to reality. We present an application where we learn the joint distribution of inputs from field data and then bind this auxiliary information to the emulator in a calibration process.

One of the outputs of our computer model is a surface of maximum volcanic flow height over some geographical area. We show how the topography of the volcano area plays an important role in determining the shape of this surface, and we propose methods

to incorporate geophysical information in the multivariate analysis of computer model output.

Item Open Access Efficient Bayesian Nonparametric Methods for Model-Free Reinforcement Learning in Centralized and Decentralized Sequential Environments(2014) Liu, MiaoAs a growing number of agents are deployed in complex environments for scientific research and human well-being, there are increasing demands for designing efficient learning algorithms for these agents to improve their control polices. Such policies must account for uncertainties, including those caused by environmental stochasticity, sensor noise and communication restrictions. These challenges exist in missions such as planetary navigation, forest firefighting, and underwater exploration. Ideally, good control policies should allow the agents to deal with all the situations in an environment and enable them to accomplish their mission within the budgeted time and resources. However, a correct model of the environment is not typically available in advance, requiring the policy to be learned from data. Model-free reinforcement learning (RL) is a promising candidate for agents to learn control policies while engaged in complex tasks, because it allows the control policies to be learned directly from a subset of experiences and with time efficiency. Moreover, to ensure persistent performance improvement for RL, it is important that the control policies be concisely represented based on existing knowledge, and have the flexibility to accommodate new experience. Bayesian nonparametric methods (BNPMs) both allow the complexity of models to be adaptive to data, and provide a principled way for discovering and representing new knowledge.

In this thesis, we investigate approaches for RL in centralized and decentralized sequential decision-making problems using BNPMs. We show how the control policies can be learned efficiently under model-free RL schemes with BNPMs. Specifically, for centralized sequential decision-making, we study Q learning with Gaussian processes to solve Markov decision processes, and we also employ hierarchical Dirichlet processes as the prior for the control policy parameters to solve partially observable Markov decision processes. For decentralized partially observable Markov decision processes, we use stick-breaking processes as the prior for the controller of each agent. We develop efficient inference algorithms for learning the corresponding control policies. We demonstrate that by combining model-free RL and BNPMs with efficient algorithm design, we are able to scale up RL methods for complex problems that cannot be solved due to the lack of model knowledge. We adaptively learn control policies with concise structure and high value, from a relatively small amount of data.

Item Open Access Efficient Gaussian process regression for large datasets.(Biometrika, 2013-03) Banerjee, Anjishnu; Dunson, David B; Tokdar, Surya TGaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n(3) where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-of-regressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.Item Open Access Kernel Averaged Predictors for Space and Space-Time Processes(2011) Heaton, MatthewIn many spatio-temporal applications a vector of covariates is measured alongside a spatio-temporal response. In such cases, the purpose of the statistical model is to quantify the change, in expectation or otherwise, in the response due to a change in the predictors while adequately accounting for the spatio-temporal structure of the response, the predictors, or both. The most common approach for building such a model is to confine the relationship between the response and the predictors to a single spatio-temporal coordinate. For spatio-temporal problems, however, the relationship between the response and predictors may not be so confined. For example, spatial models are often used to quantify the effect of pollution exposure on mortality. Yet, an unknown lag exists between time of exposure to pollutants and mortality. Furthermore, due to mobility and atmospheric movement, a spatial lag between pollution concentration and mortality may also exist (e.g. subjects may live in the suburbs where pollution levels are low but work in the city where pollution levels are high).

The contribution of this thesis is to propose a hierarchical modeling framework which captures complex spatio-temporal relationships between responses and covariates. Specifically, the models proposed here use kernels to capture spatial and/or temporal lagged effects. Several forms of kernels are proposed with varying degrees of complexity. In each case, however, the kernels are assumed to be parametric with parameters that are easily interpretable and estimable from the data. Full distributional results are given for the Gaussian setting along with consequences of model misspecification. The methods are shown to be effective in understanding the complex relationship between responses and covariates through various simulated examples and analyses of physical data sets.

Item Open Access Multivariate Spatial Process Gradients with Environmental Applications(2014) Terres, Maria AntoniaPrevious papers have elaborated formal gradient analysis for spatial processes, focusing on the distribution theory for directional derivatives associated with a response variable assumed to follow a Gaussian process model. In the current work, these ideas are extended to additionally accommodate one or more continuous covariate(s) whose directional derivatives are of interest and to relate the behavior of the directional derivatives of the response surface to those of the covariate surface(s). It is of interest to assess whether, in some sense, the gradients of the response follow those of the explanatory variable(s), thereby gaining insight into the local relationships between the variables. The joint Gaussian structure of the spatial random effects and associated directional derivatives allows for explicit distribution theory and, hence, kriging across the spatial region using multivariate normal theory. The gradient analysis is illustrated for bivariate and multivariate spatial models, non-Gaussian responses such as presence-absence and point patterns, and outlined for several additional spatial modeling frameworks that commonly arise in the literature. Working within a hierarchical modeling framework, posterior samples enable all gradient analyses to occur as post model fitting procedures.

Item Open Access Nonlinear Prediction in Credit Forecasting and Cloud Computing Deployment Optimization(2015) Jarrett, Nicholas Walton DanielThis thesis presents data analysis and methodology for two prediction problems. The first problem is forecasting midlife credit ratings from personality information collected during early adulthood. The second problem is analysis of matrix multiplication in cloud computing.

The goal of the credit forecasting problem is to determine if there is a link between personality assessments of young adults with their propensity to develop credit in middle age. The data we use is from a long term longitudinal study of over 40 years. We do find an association between credit risk and personality in this cohort Such a link has obvious implications for lenders but also can be used to improve social utility via more efficient resource allocation

We analyze matrix multiplication in the cloud and model I/O and local computation for individual tasks. We established conditions for which the distribution of job completion times can be explicitly obtained. We further generalize these results to cases where analytic derivations are intractable.

We develop models that emulate the multiplication procedure, allowing job times for different deployment parameter settings to be emulated after only witnessing a subset of tasks, or subsets of tasks for nearby deployment parameter settings.

The modeling framework developed sheds new light on the problem of determining expected job completion time for sparse matrix multiplication.

Item Open Access Numerical method for parameter inference of systems of nonlinear ordinary differential equations with partial observations.(Royal Society open science, 2021-07-28) Chen, Yu; Cheng, Jin; Gupta, Arvind; Huang, Huaxiong; Xu, ShixinParameter inference of dynamical systems is a challenging task faced by many researchers and practitioners across various fields. In many applications, it is common that only limited variables are observable. In this paper, we propose a method for parameter inference of a system of nonlinear coupled ordinary differential equations with partial observations. Our method combines fast Gaussian process-based gradient matching and deterministic optimization algorithms. By using initial values obtained by Bayesian steps with low sampling numbers, our deterministic optimization algorithm is both accurate, robust and efficient with partial observations and large noise.Item Open Access Predictive Models for Point Processes(2015) Lian, WenzhaoPoint process data are commonly observed in fields like healthcare and social science. Designing predictive models for such event streams is an under-explored problem, due to often scarce training data. In this thesis, a multitask point process model via a hierarchical Gaussian Process (GP) is proposed, to leverage statistical strength across multiple point processes. Nonparametric learning functions implemented by a GP, which map from past events to future rates, allow analysis of flexible arrival patterns. To facilitate efficient inference, a sparse construction for this hierarchical model is proposed, and a variational Bayes method is derived for learning and inference. Experimental results are shown on both synthetic data and as well as real electronic health-records data.

Item Open Access Scalable Nonparametric Bayes Learning(2013) Banerjee, AnjishnuCapturing high dimensional complex ensembles of data is becoming commonplace in a variety of application areas. Some examples include

biological studies exploring relationships between genetic mutations and diseases, atmospheric and spatial data, and internet usage and online behavioral data. These large complex data present many challenges in their modeling and statistical analysis. Motivated by high dimensional data applications, in this thesis, we focus on building scalable Bayesian nonparametric regression algorithms and on developing models for joint distributions of complex object ensembles.

We begin with a scalable method for Gaussian process regression, a commonly used tool for nonparametric regression, prediction and spatial modeling. A very common bottleneck for large data sets is the need for repeated inversions of a big covariance matrix, which is required for likelihood evaluation and inference. Such inversion can be practically infeasible and even if implemented, highly numerically unstable. We propose an algorithm utilizing random projection ideas to construct flexible, computationally efficient and easy to implement approaches for generic scenarios. We then further improve the algorithm incorporating some structure and blocking ideas in our random projections and demonstrate their applicability in other contexts requiring inversion of large covariance matrices. We show theoretical guarantees for performance as well as substantial improvements over existing methods with simulated and real data. A by product of the work is that we discover hitherto unknown equivalences between approaches in machine learning, random linear algebra and Bayesian statistics. We finally connect random projection methods for large dimensional predictors and large sample size under a unifying theoretical framework.

The other focus of this thesis is joint modeling of complex ensembles of data from different domains. This goes beyond traditional relational modeling of ensembles of one type of data and relies on probability mixing measures over tensors. These models have added flexibility over some existing product mixture model approaches in letting each component of the ensemble have its own dependent cluster structure. We further investigate the question of measuring dependence between variables of different types and propose a very general novel scaled measure based on divergences between the joint and marginal distributions of the objects. Once again, we show excellent performance in both simulated and real data scenarios.

Item Open Access Sensor Planning for Bayesian Nonparametric Target Modeling(2016) Wei, HongchuanBayesian nonparametric models, such as the Gaussian process and the Dirichlet process, have been extensively applied for target kinematics modeling in various applications including environmental monitoring, traffic planning, endangered species tracking, dynamic scene analysis, autonomous robot navigation, and human motion modeling. As shown by these successful applications, Bayesian nonparametric models are able to adjust their complexities adaptively from data as necessary, and are resistant to overfitting or underfitting. However, most existing works assume that the sensor measurements used to learn the Bayesian nonparametric target kinematics models are obtained a priori or that the target kinematics can be measured by the sensor at any given time throughout the task. Little work has been done for controlling the sensor with bounded field of view to obtain measurements of mobile targets that are most informative for reducing the uncertainty of the Bayesian nonparametric models. To present the systematic sensor planning approach to leaning Bayesian nonparametric models, the Gaussian process target kinematics model is introduced at first, which is capable of describing time-invariant spatial phenomena, such as ocean currents, temperature distributions and wind velocity fields. The Dirichlet process-Gaussian process target kinematics model is subsequently discussed for modeling mixture of mobile targets, such as pedestrian motion patterns.

Novel information theoretic functions are developed for these introduced Bayesian nonparametric target kinematics models to represent the expected utility of measurements as a function of sensor control inputs and random environmental variables. A Gaussian process expected Kullback Leibler divergence is developed as the expectation of the KL divergence between the current (prior) and posterior Gaussian process target kinematics models with respect to the future measurements. Then, this approach is extended to develop a new information value function that can be used to estimate target kinematics described by a Dirichlet process-Gaussian process mixture model. A theorem is proposed that shows the novel information theoretic functions are bounded. Based on this theorem, efficient estimators of the new information theoretic functions are designed, which are proved to be unbiased with the variance of the resultant approximation error decreasing linearly as the number of samples increases. Computational complexities for optimizing the novel information theoretic functions under sensor dynamics constraints are studied, and are proved to be NP-hard. A cumulative lower bound is then proposed to reduce the computational complexity to polynomial time.

Three sensor planning algorithms are developed according to the assumptions on the target kinematics and the sensor dynamics. For problems where the control space of the sensor is discrete, a greedy algorithm is proposed. The efficiency of the greedy algorithm is demonstrated by a numerical experiment with data of ocean currents obtained by moored buoys. A sweep line algorithm is developed for applications where the sensor control space is continuous and unconstrained. Synthetic simulations as well as physical experiments with ground robots and a surveillance camera are conducted to evaluate the performance of the sweep line algorithm. Moreover, a lexicographic algorithm is designed based on the cumulative lower bound of the novel information theoretic functions, for the scenario where the sensor dynamics are constrained. Numerical experiments with real data collected from indoor pedestrians by a commercial pan-tilt camera are performed to examine the lexicographic algorithm. Results from both the numerical simulations and the physical experiments show that the three sensor planning algorithms proposed in this dissertation based on the novel information theoretic functions are superior at learning the target kinematics with

little or no prior knowledge

Item Open Access Statistical Analysis of Response Distribution for Dependent Data via Joint Quantile Regression(2021) Chen, XuLinear quantile regression is a powerful tool to investigate how predictors may affect a response heterogeneously across different quantile levels. Unfortunately, existing approaches find it extremely difficult to adjust for any dependency between observation units, largely because such methods are not based upon a fully generative model of the data. In this dissertation, we address this difficulty for analyzing spatial point-referenced data and hierarchical data. Several models are introduced by generalizing the joint quantile regression model of Yang and Tokdar (2017) and characterizing different dependency structures via a copula model on the underlying quantile levels of the observation units. A Bayesian semiparametric approach is introduced to perform inference of model parameters and carry out prediction. Multiple copula families are discussed for modeling response data with tail dependence and/or tail asymmetry. An effective model comparison criterion is provided for selecting between models with different combinations of sets of predictors, marginal base distributions and copula models.

Extensive simulation studies and real applications are presented to illustrate substantial gains of the proposed models in inference quality, prediction accuracy and uncertainty quantification over existing alternatives. Through case studies, we highlight that the proposed models admit great interpretability and are competent in offering insightful new discoveries of response-predictor relationship at non-central parts of the response distribution. The effectiveness of the proposed model comparison criteria is verified with both empirical and theoretical evidence.

Item Open Access Topics in Bayesian Spatiotemporal Prediction of Environmental Exposure(2019) White, Philip AndrewWe address predictive modeling for spatial and spatiotemporal modeling in a variety of settings. First, we discuss spatial and spatiotemporal data and corresponding model types used in later chapters. Specifically, we discuss Markov random fields, Gaussian processes, and Bayesian inference. Then, we outline the dissertation.

In Chapter 2, we consider the setting where areal unit data are only partially observed. First, we consider setting where a portion of the areal units have been observed, and we seek prediction of the remainder. Second, we leverage these ideas for model comparison where we fit models of interest to a portion of the data and hold out the rest for model comparison.

In Chapters 3 and 4, we consider pollution data from Mexico City in 2017. In Chapter 3 we forecast pollution emergencies. Mexico City defines pollution emergencies using thresholds that rely on regional maxima for ozone and for particulate matter with diameter less than 10 micrometers (PM10). To predict local pollution emergencies and to assess compliance with Mexican ambient air quality standards, we analyze hourly ozone and PM10 measurements from 24 stations across Mexico City from 2017 using a bivariate spatiotemporal model. With this model, we predict future pollutant levels using current weather conditions and recent pollutant concentrations. Employing hourly pollutant projections, we predict regional maxima needed to estimate the probability of future pollution emergencies. We discuss how predicted compliance with legislated pollution limits varies across regions within Mexico City in 2017.

In Chapter 4, we propose a continuous spatiotemporal model for Mexico City ozone levels that accounts for distinct daily seasonality, as well as variation across the city and over the peak ozone season (April and May) of 2017. To account for these patterns, we use covariance models over space, circles, and time. We review relevant existing covariance models and develop new classes of nonseparable covariance models appropriate for seasonal data collected at many locations. We compare the predictive performance of a variety of models that utilize various nonseparable covariance functions. We use the best model to predict hourly ozone levels at unmonitored locations in April and May to infer compliance with Mexican air quality standards and to estimate respiratory health risk associated with ozone exposure.