# Browsing by Subject "Bayesian Statistics"

###### Results Per Page

###### Sort Options

Item Open Access Bayesian and Information-Theoretic Learning of High Dimensional Data(2012) Chen, MinhuaThe concept of sparseness is harnessed to learn a low dimensional representation of high dimensional data. This sparseness assumption is exploited in multiple ways. In the Bayesian Elastic Net, a small number of correlated features are identified for the response variable. In the sparse Factor Analysis for biomarker trajectories, the high dimensional gene expression data is reduced to a small number of latent factors, each with a prototypical dynamic trajectory. In the Bayesian Graphical LASSO, the inverse covariance matrix of the data distribution is assumed to be sparse, inducing a sparsely connected Gaussian graph. In the nonparametric Mixture of Factor Analyzers, the covariance matrices in the Gaussian Mixture Model are forced to be low-rank, which is closely related to the concept of block sparsity.

Finally in the information-theoretic projection design, a linear projection matrix is explicitly sought for information-preserving dimensionality reduction. All the methods mentioned above prove to be effective in learning both simulated and real high dimensional datasets.

Item Open Access Bayesian Emulation for Sequential Modeling, Inference and Decision Analysis(2016) Irie, KaoruThe advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.

Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.

Item Open Access Bayesian Statistical Analysis in Coastal Eutrophication Models: Challenges and Solutions(2014) Nojavan Asghari, FarnazEstuaries interfacing with the land, atmosphere and open oceans can be influenced in a variety of ways by anthropogenic activities. Centuries of overexploitation, habitat transformation, and pollution have degraded estuarine ecological health. Key concerns of public and environmental managers of estuaries include water quality, particularly the enrichment of nutrients, increased chlorophyll a concentrations, increased hypoxia/anoxia, and increased Harmful Algal Blooms (HABs). One reason for the increased nitrogen loading over the past two decades is the proliferation of concentrated animal feeding operations (CAFOs) in coastal areas. This dissertation documents a study of estuarine eutrophication modeling, including modeling of major source of nitrogen in the watershed, the use of the Bayesian Networks (BNs) for modeling eutrophication dynamics in an estuary, a documentation of potential problems of using BNs, and a continuous BN model for addressing these problems.

Environmental models have emerged as great tools to transform data into useful information for managers and policy makers. Environmental models contain uncertainty due to natural ecosystems variability, current knowledge of environmental processes, modeling structure, computational restrictions, and problems with data/observations due to measurement error or missingness. Many methodologies capable of quantifying uncertainty have been developed in the scientic literature. Examples of such methods are BNs, which utilize conditional probability tables to describe the relationships among variables. This doctoral dissertation demonstrates how BNs, as probabilistic models, can be used to model eutrophication in estuarine ecosystems and to explore the effects of plausible future climatic and nutrient pollution management scenarios on water quality indicators. The results show interaction among various predictors and their impact on ecosystem health. The synergistic eftects between nutrient concentrations and climate variability caution future management actions.

BNs have several distinct strengths such as the ability to update knowledge based on Bayes' theorem, modularity, accommodation of various knowledge sources and data types, suitability to both data-rich and data-poor systems, and incorporation of uncertainty. Further, BNs' graphical representation facilitates communicating models and results with environmental managers and decision-makers. However, BNs have certain drawbacks as well. For example, they can only handle continuous variables under severe restrictions (1- Each continuous variable be assigned a (linear) conditional Normal distribution; 2- No discrete variable have continuous parents). The solution, thus far, to address this constraint has been discretizing variables. I designed an experiment to evaluate and compare the impact of common discretization methods on BNs. The results indicate that the choice of discretization method severely impacts the model results; however, I was unable to provide any criteria to select an optimal discretization method.

Finally, I propose a continuous variable Bayesian Network methodology and demonstrate its application for water quality modeling in estuarine ecosystems. The proposed method retains advantageous characteristics of BNs, while it avoids the drawbacks of discretization by specifying the relationships among the nodes using statistical and conditional probability models. The Bayesian nature of the proposed model enables prompt investigation of observed patterns, as new conditions unfold. The network structure presents the underlying ecological ecosystem processes and provides a basis for science communication. I demonstrate model development and temporal updating using the New River Estuary, NC data set and spatial updating using the Neuse River Estuary, NC data set.

Item Open Access Finite Sample Bounds and Path Selection for Sequential Monte Carlo(2018) Marion, JosephSequential Monte Carlo (SMC) samplers have received attention as an alternative to Markov chain Monte Carlo for Bayesian inference problems due to their strong empirical performance on difficult multimodal problems, natural synergy with parallel computing environments, and accuracy when estimating ratios of normalizing constants. However, while these properties have been demonstrated empirically, the extent of these advantages remain unexplored theoretically. Typical convergence results for SMC are limited to root N results; they obscure the relationship between the algorithmic factors (weights, Markov kernels, target distribution) and the error of the resulting estimator. This limitation makes it difficult to compare SMC to other estimation methods and challenging to design efficient SMC algorithms from a theoretical perspective.

In this thesis, we provide conditions under which SMC provides a randomized approximation scheme, showing how to choose the number of of particles and Markov kernel transitions at each SMC step in order to ensure an accurate approximation with bounded error. These conditions rely on the sequence of SMC interpolating distributions and the warm mixing times of the Markov kernels, explicitly relating the algorithmic choices to the error of the SMC estimate. This allows us to provide finite-sample complexity bounds for SMC in a variety of settings, including finite state-spaces, product spaces, and log-concave target distributions.

A key advantage of this approach is that the bounds provide insight into the selection of efficient sequences of SMC distributions. When the target distribution is spherical Gaussian or log-concave, we show that judicious selection of interpolating distributions results in an SMC algorithm with a smaller complexity bound than MCMC. These results are used to motivate the use of a well known SMC algorithm that adaptively chooses interpolating distributions. We provide conditions under which the adaptive algorithm gives a randomized approximation scheme, providing theoretical validation for the automatic selection of SMC distributions.

Selecting efficient sequences of distributions is a problem that also arises in the estimation of normalizing constants using path sampling. In the final chapter of this thesis, we develop automatic methods for choosing sequences of distributions that provide low-variance path sampling estimators. These approaches are motived by properties of the theoretically optimal, lowest-variance path, which is given by the geodesic of the Riemann manifold associated with the path sampling family. For one dimensional paths we provide a greedy approach to step size selection that has good empirical performance. For multidimensional paths, we present an approach using Gaussian process emulation that efficiently finds low variance paths in this more complicated setting.

Item Open Access New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data(2016) Zhao, ShiwenConstant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

Item Open Access On Uncertainty Quantification for Systems of Computer Models(2017) Kyzyurova, KseniaScientific inquiry about natural phenomena and processes are increasingly relying on the use of computer models as simulators of such processes. The challenge of using computer models for scientific investigation is that they are expensive in terms of computational cost and resources. However, the core methodology of fast statistical emulation (approximation) of a computer model overcomes this computational problem.

Complex phenomena and processes are often described not by a single computer model, but by a system of computer models or simulators. Direct emulation of a system of simulators may be infeasible for computational and logistical reasons.

This thesis proposes a statistical framework for fast emulation of systems of computer models and demonstrates its potential for inferential and predictive scientific goals.

The first chapter of the thesis introduces the Gaussian stochastic process (GaSP) emulator of a single simulator and summarizes ideas and findings in the rest of the thesis. The second chapter investigates the possibility of using independent GaSP emulators of computer models for fast construction of emulators of systems of computer models. The resulting approximation to a system of computer models is called the linked emulator. The third chapter discusses the irrelevance of attempting to model multivariate output of a computer model, for the purpose of emulation of that model. The linear model of coregionalization (LMC) is used to demonstrate this irrelevance, from both a theoretical perspective and from simulation studies. The fourth chapter introduces a framework for calibration of a system of computer models, using its linked emulator. The linked emulator allows for development of independent emulators of submodels on their own separately constructed design spaces, thus leading to effective dimension reduction in explored parameter space. The fifth chapter addresses the use of some non-Gaussian emulators, in particular censored and truncated GaSP emulators. The censored emulator is constructed to appropriately account for zero-inflated output of a computer model, arising when there are large regions of the input space for which the computer model output is zero. The truncated GaSP accommodates computer model output that is constrained to appear in a certain region. The linked emulator, for systems of computer models whose individual subemulators are either censored or truncated, is also presented. The last chapter concludes with an exposition of further research directions based on the ideas explored in the thesis.

The methodology developed in this thesis is illustrated by an application to quantification of the hazard from pyroclastic flow from the Soufri\`{e}re Hills Volcano on the island of Montserrat; a case study on prediction of volcanic ash transport and dispersal from the Eyjafjallaj{\"o}kull volcano, Iceland in April 14-16, 2010; and calibration of a vapour-liquid equilibrium model, a submodel of the Aspen Plus \textcopyright~chemical process software for design and deployment of amine-based $\mathrm{CO_2}$ capture systems.

Item Open Access U.S. Fiscal Multipliers(2015) Lusompa, Amaze BasilwaThis paper investigates whether government spending multipliers are time-varying.

The multipliers are measured using time-varying parameter (TVP) local projections.

This paper uses a simple modication to local projections that corrects for the inherent

autocorrelated errors in local projections. The results indicate that there is

evidence of time variation in government spending multipliers and that the results of

previous studies should be seriously questioned. The results also indicate that there

is significant time variation in the strength of Blanchard-Perotti and defense news

identified shocks.