Browsing by Subject "Bayesian"
Results Per Page
Sort Options
Item Open Access A Bayesian Approach to Understanding Music Popularity(2015-05-08) Shapiro, HeatherThe Billboard Hot 100 has been the main record chart for popular music in the American music industry since its first official release in 1958. Today, this rank- ing is based upon the frequency of which a song is played on the radio, streamed online, and its sales. Over time, however, the limitations of the chart have become more pronounced and record labels have tried different strategies to maximize a song’s potential on the chart in order to increase sales and success. This paper intends to analyze metadata and audio analysis features from a random sample of one million popular tracks, dating back to 1922, and assess their potential on the Billboard Hot 100 list. We compare the results of Bayesian Additive Regression Trees (BART) to other decision tree methods for predictive accuracy. Through the use of such trees, we can determine the interaction and importance of differ- ent variables over time and their effects on a single’s success on the Billboard chart. With such knowledge, we can assess and identify past music trends, and provide producers with the steps to create the ‘perfect’ commercially successful song, ultimately removing the creative artistry from music making.Item Open Access A Bayesian Strategy to the 20 Question Game with Applications to Recommender Systems(2017) Suresh, Sunith RajIn this paper, we develop an algorithm that utilizes a Bayesian strategy to determine a sequence of questions to play the 20 Question game. The algorithm is motivated with an application to active recommender systems. We first develop an algorithm that constructs a sequence of questions where each question inquires only about a single binary feature. We test the performance of the algorithm utilizing simulation studies, and find that it performs relatively well under an informed prior. We modify the algorithm to construct a sequence of questions where each question inquires about 2 binary features via AND conjunction. We test the performance of the modified algorithm
via simulation studies, and find that it does not significantly improve performance.
Item Open Access Advances in Bayesian Hierarchical Modeling with Tree-based Methods(2020) Mao, JialiangDeveloping flexible tools that apply to datasets with large size and complex structure while providing interpretable outputs is a major goal of modern statistical modeling. A family of models that are especially suitable for this task is the P\'olya tree type models. Following a divide-and-conquer strategy, these tree-based methods transform the original task into a series of tasks that are smaller in size and easier to solve while their nonparametric nature guarantees the modeling flexibility to cope with datasets with a complex structure. In this work, we develop three novel tree-based methods that tackle different challenges in Bayesian hierarchical modeling. Our first two methods are designed specifically for the microbiome sequencing data, which consists of high dimensional counts with a complex, domain-specific covariate structure and exhibits large cross-sample variations. These features limit the performance of generic statistical tools and require special modeling considerations. Both methods inherit the flexibility and computation efficiency from the general tree-based methods and directly utilize the domain knowledge to help infer the complex dependency structure among different microbiome categories by bringing the phylogenetic tree into the modeling framework. An important task in microbiome research is to compare the composition of the microbial community of groups of subjects. We first propose a model for this classic two-sample problem in the microbiome context by transforming the original problem into a multiple testing problem, with a series of tests defined at the internal nodes of the phylogenetic tree. To improve the power of the test, we use a graphical model to allow information sharing among the tests. A regression-type adjustment is also considered to reduce the chance of false discovery. Next, we introduce a model-based clustering method for the microbiome count data with a Dirichlet process mixtures setup. The phylogenetic tree is used for constructing the mixture kernels to offer a flexible covariate structure. To improve the ability to detect clusters determined not only by the dominating microbiome categories, a subroutine is introduced in the clustering procedure that selects a subset of internal nodes of the tree which are relevant for clustering. This subroutine is also important in avoiding potential overfitting. Our third contribution proposes a framework for causal inference through Bayesian recursive partitioning that allows joint modeling of the covariate balancing and the potential outcome. With a retrospective perspective, we model the covariates and the outcome conditioning on the treatment assignment status. For the challenging multivariate covariate modeling, we adopt a flexible nonparametric prior that focuses on the relation of the covariate distributions under the two treatment groups, while integrating out other aspects of these distributions that are irrelevant for estimating the causal effect.
Item Open Access Applications and Computation of Stateful Polya Trees(2017) Christensen, JonathanPolya trees are a class of nonparametric priors on distributions which are able to model absolutely continuous distributions directly, rather than modeling a discrete distribution over parameters of a mixing kernel to obtain an absolutely continuous distribution. The Polya tree discretizes the state space with a recursive partition, generating a distribution by assigning mass to the child elements at each level of the recursive partition according to a Beta distribution. Stateful Polya trees are an extension of the Polya tree where each set in the recursive partition has one or more discrete state variables associated with it. We can learn the posterior distributions of these state variables along with the posterior of the distribution. State variables may be of interest in their own right, or may be nuisance parameters which we use to achieve more flexible models but wish to integrate out in the posterior. We discuss the development of stateful Polya trees and discuss the Hierarchical Adaptive Polya Tree, which uses state variables to flexibly model the concentration parameter of Polya trees in a hierarchical Bayesian model. We also consider difficulties with the use of marginal likelihoods to determine posterior probabilities of states.
Item Open Access Bayesian Computation for Variable Selection and Multivariate Forecasting in Dynamic Models(2020) Lavine, IsaacChallenges arise in time series analysis due to the need for sequential forecasting and updating of model parameters as data is observed. This dissertation presents techniques for efficient Bayesian computation in multivariate time series analysis. Computational scalability is a core focus of this work, and often rests on the decouple-recouple concept in which multivariate models are decoupled into univariate models for efficient inference, and then recoupled to produce joint forecasts. The first section of this dissertation develops novel methods for variable selection in which models are scored and weighted based on specific forecasting and decision goals. In the time series setting, standard marginal likelihoods correspond to 1−step forecast densities, and considering alternate objectives is shown to improve long-term forecast accuracy. Scoring models based on forecast objectives can be computationally intensive, so the model space is reduced by evaluating univariate models separately along each dimension. This enables an efficient search over large, higher dimensional model spaces. A second area of focus in this dissertation is product demand forecasting, driven by applied considerations in grocery store sales. A novel copula model is developed for multivariate forecasting with Dynamic Generalized Linear Models (DGLMs), with a variational Bayes strategy for inference in latent factor DGLMs. Three applied case studies demonstrate that these techniques increase computational efficiency by several orders of magnitude over comparable multivariate models, without any loss of forecast accuracy. An additional area of interest in product demand forecasting is the effect of holidays and special events. An error correction model is introduced for this context, demonstrating strong predictive performance across a variety of holidays and retail item categories. Finally, a new Python package for Bayesian DGLM analysis, PyBATS, provides a set of tools for user-friendly analysis of univariate and multivariate time series.
Item Open Access Bayesian Inference in Large-scale Problems(2016) Johndrow, James EdwardMany modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Item Open Access Bayesian Learning with Dependency Structures via Latent Factors, Mixtures, and Copulas(2016) Han, ShaoboBayesian methods offer a flexible and convenient probabilistic learning framework to extract interpretable knowledge from complex and structured data. Such methods can characterize dependencies among multiple levels of hidden variables and share statistical strength across heterogeneous sources. In the first part of this dissertation, we develop two dependent variational inference methods for full posterior approximation in non-conjugate Bayesian models through hierarchical mixture- and copula-based variational proposals, respectively. The proposed methods move beyond the widely used factorized approximation to the posterior and provide generic applicability to a broad class of probabilistic models with minimal model-specific derivations. In the second part of this dissertation, we design probabilistic graphical models to accommodate multimodal data, describe dynamical behaviors and account for task heterogeneity. In particular, the sparse latent factor model is able to reveal common low-dimensional structures from high-dimensional data. We demonstrate the effectiveness of the proposed statistical learning methods on both synthetic and real-world data.
Item Open Access Bayesian Multivariate Count Models for the Analysis of Microbiome Studies(2019) Silverman, Justin DavidAdvances in high-throughput DNA sequencing allow for rapid and affordable surveys of thousands of bacterial taxa across thousands of samples. The exploding availability of sequencing data has poised microbiota research to advance our understanding of fields as diverse as ecology, evolution, medicine, and agriculture. Yet, while microbiota data is now ubiquitous, methods for the analysis of such data remain underdeveloped. This gap reflects the challenge of analyzing sparse high-dimensional count data that contains compositional (relative abundance) information. To address these challenges this dissertation introduces a number of tools for Bayesian inference applied to microbiome data. A central theme throughout this work is the use of multinomial logistic-normal models which are found to concisely address these challenges. In particular, the connection between the logistic-normal distribution and the Aitchison geometry of the simplex is commonly used to develop interpretable tools for the analysis of microbiome data.
The structure of this dissertation is as follows. Chapter 1 introduces key challenges in the analysis of microbiome data. Chapter 2 introduces a novel log-ratio transform between the simplex and Real space to enable the development of statistical tools for compositional data with phylogenetic structure. Chapter 3 introduces a multinomial logistic-normal generalized dynamic linear modelling framework for analysis of microbiome time-series data. Chapter 4 explores the analysis of zero values in sequence count data from a stochastic process perspective and demonstrates that zero-inflated models often produce counter-intuitive results in this this regime. Finally, Chapter 5 introduces the theory of Marginally Latent Matrix-T Processes as a means of developing efficient accurate inference for a large class of both multinomial logistic-normal models including linear regression, non-linear regression, and dynamic linear models. Notably, the inference schemes developed in Chapter 5 are found to often be orders of magnitude faster than Hamiltonian Monte Carlo without sacrificing accuracy in point estimation or uncertainty quantification.
Item Open Access Bayesian Semi-parametric Factor Models(2012) Bhattacharya, AnirbanIdentifying a lower-dimensional latent space for representation of high-dimensional observations is of significant importance in numerous biomedical and machine learning applications. In many such applications, it is now routine to collect data where the dimensionality of the outcomes is comparable or even larger than the number of available observations. Motivated in particular by the problem of predicting the risk of impending diseases from massive gene expression and single nucleotide polymorphism profiles, this dissertation focuses on building parsimonious models and computational schemes for high-dimensional continuous and unordered categorical data, while also studying theoretical properties of the proposed methods. Sparse factor modeling is fast becoming a standard tool for parsimonious modeling of such massive dimensional data and the content of this thesis is specifically directed towards methodological and theoretical developments in Bayesian sparse factor models.
The first three chapters of the thesis studies sparse factor models for high-dimensional continuous data. A class of shrinkage priors on factor loadings are introduced with attractive computational properties, with operating characteristics explored through a number of simulated and real data examples. In spite of the methodological advances over the past decade, theoretical justifications in high-dimensional factor models are scarce in the Bayesian literature. Part of the dissertation focuses on exploring estimation of high-dimensional covariance matrices using a factor model and studying the rate of posterior contraction as both the sample size & dimensionality increases.
To relax the usual assumption of a linear relationship among the latent and observed variables in a standard factor model, extensions to a non-linear latent factor model are also considered.
Although Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary and ordered categorical data, it leads to challenging computation and complex modeling structures for unordered categorical variables. As an alternative, a novel class of simplex factor models for massive-dimensional and enormously sparse contingency table data is proposed in the second part of the thesis. An efficient MCMC scheme is developed for posterior computation and the methods are applied to modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features. Building on a connection between the proposed model & sparse tensor decompositions, we propose new classes of nonparametric Bayesian models for testing associations between a massive dimensional vector of genetic markers and a phenotypical outcome.
Item Open Access Bayesian Structural Phylogenetics(2013) Challis, ChristopherThis thesis concerns the use of protein structure to improve phylogenetic inference. There has been growing interest in phylogenetics as the number of available DNA and protein sequences continues to grow rapidly and demand from other scientific fields increases. It is now well understood that phylogenies should be inferred jointly with alignment through use of stochastic evolutionary models. It has not been possible, however, to incorporate protein structure in this framework. Protein structure is more strongly conserved than sequence over long distances, so an important source of information, particularly for alignment, has been left out of analyses.
I present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model and mutations follow a standard substitution matrix, while backbone atoms diffuse in three-dimensional space according to an Ornstein-Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables pairwise evolutionary distance estimation on time scales not previously attainable with sequence evolution models. Ideally inference should not be performed in a pairwise fashion between proteins, but in a fully Bayesian setting simultaneously estimating the phylogenetic tree, alignment, and model parameters. I extend the initial pairwise model to this framework and explore model variants which improve agreement between sequence and structure information. The model also allows for estimation of heterogeneous rates of structural evolution throughout the tree, identifying groups of proteins structurally evolving at different speeds. In order to explore the posterior over topologies by Markov chain Monte Carlo sampling, I also introduce novel topology + alignment proposals which greatly improve mixing of the underlying Markov chain. I show that the inclusion of structural information reduces both alignment and topology uncertainty. The software is available as plugin to the package StatAlign.
Finally, I also examine limits on statistical inference of phylogeny through sequence information models. These limits arise due to the `cutoff phenomenon,' a term from probability which describes processes which remain far from their equilibrium distribution for some period of time before swiftly transitioning to stationarity. Evolutionary sequence models all exhibit a cutoff; I show how to find the cutoff for specific models and sequences and relate the cutoff explicitly to increased uncertainty in inference of evolutionary distances. I give theoretical results for symmetric models, and demonstrate with simulations that these results apply to more realistic and widespread models as well. This analysis also highlights several drawbacks to common default priors for phylogenetic analysis, I and suggest a more useful class of priors.
Item Open Access Bayesian Techniques for Adaptive Acoustic Surveillance(2010) Morton, Kenneth DAutomated acoustic sensing systems are required to detect, classify and localize acoustic signals in real-time. Despite the fact that humans are capable of performing acoustic sensing tasks with ease in a variety of situations, the performance of current automated acoustic sensing algorithms is limited by seemingly benign changes in environmental or operating conditions. In this work, a framework for acoustic surveillance that is capable of accounting for changing environmental and operational conditions, is developed and analyzed. The algorithms employed in this work utilize non-stationary and nonparametric Bayesian inference techniques to allow the resulting framework to adapt to varying background signals and allow the system to characterize new signals of interest when additional information is available. The performance of each of the two stages of the framework is compared to existing techniques and superior performance of the proposed methodology is demonstrated. The algorithms developed operate on the time-domain acoustic signals in a nonparametric manner, thus enabling them to operate on other types of time-series data without the need to perform application specific tuning. This is demonstrated in this work as the developed models are successfully applied, without alteration, to landmine signatures resulting from ground penetrating radar data. The nonparametric statistical models developed in this work for the characterization of acoustic signals may ultimately be useful not only in acoustic surveillance but also other topics within acoustic sensing.
Item Open Access Continuous-Time Models of Arrival Times and Optimization Methods for Variable Selection(2018) Lindon, Michael ScottThis thesis naturally divides itself into two sections. The first two chapters concern
the development of Bayesian semi-parametric models for arrival times. Chapter 2
considers Bayesian inference for a Gaussian process modulated temporal inhomogeneous Poisson point process, made challenging by an intractable likelihood. The intractable likelihood is circumvented by two novel data augmentation strategies which result in Gaussian measurements of the Gaussian process, connecting the model with a larger literature on modelling time-dependent functions from Bayesian non-parametric regression to time series. A scalable state-space representation of the Matern Gaussian process in 1 dimension is used to provide access to linear time filtering algorithms for performing inference. An MCMC algorithm based on Gibbs sampling with slice-sampling steps is provided and illustrated on simulated and real datasets. The MCMC algorithm exhibits excellent mixing and scalability.
Chapter 3 builds on the previous model to detect specific signals in temporal point patterns arising in neuroscience. The firing of a neuron over time in response to an external stimulus generates a temporal point pattern or ``spike train''. Of special interest is how neurons encode information from dual simultaneous external stimuli. Among many hypotheses is the presence multiplexing - interleaving periods of firing as it would for each individual stimulus in isolation. Statistical models are developed to quantify evidence for a variety of experimental hypotheses. Each experimental hypothesis translates to a particular form of intensity function for the dual stimuli trials. The dual stimuli intensity is modelled as a dynamic superposition of single stimulus intensities, defined by a time-dependent weight function that is modelled non-parametrically as a transformed Gaussian process. Experiments on simulated data demonstrate that the model is able to learn the weight function very well, but other model parameters which have meaningful physical interpretations less well.
Chapters 4 and 5 concern mathematical optimization and theoretical properties of Bayesian models for variable selection. Such optimizations are challenging due to non-convexity, non-smoothness and discontinuity of the objective. Chapter 4 presents advances in continuous optimization algorithms based on relating mathematical and statistical approaches defined in connection with several iterative algorithms for penalized linear
regression. I demonstrate the equivalence of parameter mappings using EM under
several data augmentation strategies - location-mixture representations, orthogonal data augmentation and LQ design matrix decompositions. I show that these
model-based approaches are equivalent to algorithmic derivation via proximal
gradient methods. This provides new perspectives on model-based and algorithmic
approaches, connects across several research themes in optimization and statistics,
and provides access, beyond EM, to relevant theory from the proximal gradient
and convex analysis literatures.
Chapter 5 presents a modern and technologically up-to-date approach to discrete optimization for variable selection models through their formulation as mixed integer programming models. Mixed integer quadratic and quadratically constrained programs are developed for the point-mass-Laplace and g-prior. Combined with warm-starts and optimality-based bounds tightening procedures provided by the heuristics of the previous chapter, the MIQP model developed for the point-mass-Laplace prior converges to global optimality in a matter of seconds for moderately sized real datasets. The obtained estimator is demonstrated to possess superior predictive performance over that obtained by cross-validated lasso in a number of real datasets. The MIQCP model for the g-prior struggles to match the performance of the former and highlights the fact that the performance of the mixed integer solver depends critically on the ability of the prior to rapidly concentrate posterior mass on good models.
Item Open Access Data-driven Analysis of Heavy Quark Transport in Ultra-relativistic Heavy-ion Collisions(2019) Xu, YingruHeavy flavor observables provide valuable information on the properties of the hot and dense Quark-Gluon Plasma (QGP) created in ultra-relativistic heavy-ion collisions.
Previous study has made significant progress regarding the heavy quark in-medium interaction, energy loss and collective behaviors. Various theoretical models are developed to describe the evolution of heavy quarks in heavy-ion collisions, but also show limited performance as they experience challenges to simultaneously describe all the experimental data.
In this thesis, I present a state-of-the-art Bayesian model-to-data analysis to calibrate a heavy quark evolution model on the experimental data at different collision systems and different energies: the heavy quark evolution model incorporates an improved Langevin dynamics for heavy quarks with an event-by-event viscous hydrodynamical model for the expanding QGP medium, and considers both heavy quark collisional and radiative energy loss. By applying the Bayesian analysis to such a modularized framework, the heavy quark evolution model is able to describe the heavy flavor observables in multiple collision system and make prediction of unseen observables. In addition, the estimated heavy quark diffusion coefficient shows a strong positive temperature dependence and strong interaction around the critical temperature.
Finally, by comparing the transport coefficients estimated by various theoretical approaches, I have quantitatively evaluated the contribution from different sources of deviation, which can provide a reference for the theoretical uncertainties regarding the heavy quark transport coefficients.
Item Open Access Dependent Hierarchical Bayesian Models for Joint Analysis of Social Networks and Associated Text(2012) Wang, Eric XunThis thesis presents spatially and temporally dependent hierarchical Bayesian models for the analysis of social networks and associated textual data. Social network analysis has received significant recent attention and has been applied to fields as varied as analysis of Supreme Court votes, Congressional roll call data, and inferring links between authors of scientific papers. In many traditional social network analysis models, temporal and spatial dependencies are not considered due to computational difficulties, even though significant such dependencies often play a significant role in the underlying generative process of the observed social network data.
Thus motivated, this thesis presents four new models that consider spatial and/or temporal dependencies and (when available) the associated text. The first is a time-dependent (dynamic) relational topic model that models nodes by their relevant documents and uses probit regression construction to map topic overlap between nodes to a link. The second is a factor model with dynamic random effects that is used to analyze the voting patterns of the United States Supreme Court. hTe last two models present the primary contribution of this thesis two spatially and temporally dependent models that jointly analyze legislative roll call data and the their associated legislative text and introduce a new paradigm for social network factor analysis: being able to predict new columns (or rows) of matrices from the text. The first uses a nonparametric joint clustering approach to link the factor and topic models while the second uses a text regression construction. Finally, two other models on analysis of and tracking in video are also presented and discussed.
Item Open Access Development and Calibration of Reaction Models for Multilayered Nanocomposites(2015) Vohra, ManavThis dissertation focuses on the development and calibration of reaction models for multilayered nanocomposites. The nanocomposites comprise sputter deposited alternating layers of distinct metallic elements. Specifically, we focus on the equimolar Ni-Al and Zr-Al multilayered systems. Computational models are developed to capture the transient reaction phenomena as well as understand the dependence of reaction properties on the microstructure, composition and geometry of the multilayers. Together with the available experimental data, simulations are used to calibrate the models and enhance the accuracy of their predictions.
Recent modeling efforts for the Ni-Al system have investigated the nature of self-propagating reactions in the multilayers. Model fidelity was enhanced by incorporating melting effects due to aluminum [Besnoin et al. (2002)]. Salloum and Knio formulated a reduced model to mitigate computational costs associated with multi-dimensional reaction simulations [Salloum and Knio (2010a)]. However, exist- ing formulations relied on a single Arrhenius correlation for diffusivity, estimated for the self-propagating reactions, and cannot be used to quantify mixing rates at lower temperatures within reasonable accuracy [Fritz (2011)]. We thus develop a thermal model for a multilayer stack comprising a reactive Ni-Al bilayer (nanocalorimeter) and exploit temperature evolution measurements to calibrate the diffusion parameters associated with solid state mixing (720 K - 860 K) in the bilayer.
The equimolar Zr-Al multilayered system when reacted aerobically is shown to
exhibit slow aerobic oxidation of zirconium (in the intermetallic), sustained for about 2-10 seconds after completion of the formation reaction. In a collaborative effort, we aim to exploit the sustained heat release for bio-agent defeat application. A simplified computational model is developed to capture the extended reaction regime characterized by oxidation of Zr-Al multilayers. Simulations provide insight into the growth phenomenon for the zirconia layer during the oxidation process. It is observed that the growth of zirconia is predominantly governed by surface-reaction. However, once the layer thickens, the growth is controlled by the diffusion of oxygen in zirconia.
A computational model is developed for formation reactions in Zr-Al multilayers. We estimate Arrhenius diffusivity correlations for a low temperature mixing regime characterized by homogeneous ignition in the multilayers, and a high temperature mixing regime characterized by self-propagating reactions in the multilayers. Experimental measurements for temperature and reaction velocity are used for this purpose. Diffusivity estimates for the two regimes are first inferred using regression analysis and full posterior distributions are then estimated for the diffusion parameters using Bayesian statistical approaches. A tight bound on posteriors is observed in the ignition regime whereas estimates for the self-propagating regime exhibit large levels of uncertainty. We further discuss a framework for optimal design of experiments to assess and optimize the utility of a set of experimental measurements for application to reaction models.
Item Open Access Ecosystem Response to a Changing Climate: Vulnerability, Impacts and Monitoring(2017) Seyednasrollah, BijanRising temperatures with increased drought pose three challenges for management of future biodiversity. First, are the species expected to be vulnerable concentrated in specific regions and habitats? Second, are the impacts of drought and warming varying across regions? Third, could recent advances in remote sensing techniques help us in monitoring the impacts in real-time? This dissertation is an effort to address the above questions in the three chapters.
First, I used foliar chemistry as a proxy for drought vulnerability. I used soil and moisture gradients to quantify habitat variation that could be critical for alleviating drought. I used a large dataset of forest plots covering the eastern united states to understand how community weighted mean foliar nitrogen and phosphorus vary across climate and soil gradients. I exploited trends in these variables between species, traits, and habitats to evaluate sensitivity. Critical to our approach is the capacity to jointly model trait responses. Our data showed that nutrient demanding species strongly respond to environmental gradients. I identified a wide range of sites across low to high latitudes threatened by drought. The sensitivity of species to high temperatures is largely explained by soil variations. Drought vulnerability of nutrient and moisture demanding species could be amplified depending on local soil and moisture gradients. Although local soil moisture may dampen drought-induced stress for species with large leaves and high water use, nutrient demanding species remain vulnerable in wet regions during droughts. Phosphorus demanding species adapted to dry sites are drought resilient compared to communities in wet sites. This research is consistent with the studies that supports declining nutrient demanding species with increasing temperature and decreasing moisture. I also detected strong soil effects on shaping community weighted traits across a large geographical and environmental range. Our data showed that soil effects on controlling foliar traits strongly vary across different climates. The findings are critical for conservations and maintaining the biodiversity.
Next, I used space-borne remotely sensed vegetation indices to monitor the process of leaf development across climate gradients and ecoregions in the southeastern United States. A hierarchical state-space Bayesian model was developed to quantify how air temperature, drought severity, and canopy thermal stress contribute to changes in leaf opening from mountainous to coastal regions. I synthesized daily field climate data with daily remotely sensed vegetation indices and canopy surface temperature during spring green-up season. The study was focused on observation of leaf phenology at 59 sites in the southeast United States between 2001 to 2012. Our results suggest strong interaction effects between ecosystem properties and climate variables across ecoregions. The findings showed that despite the much faster spring green-up in the mountains, coastal forests express a larger sensitivity to inter-annual anomaly in temperature than mountain sites. In spite of the decreasing trend in sensitivity to warming with temperature in all regions, there is an ecosystem interaction: Deciduous-dominated forests are less sensitive to warming than are those with few deciduous trees, possibly due to the presence of developed leaves in evergreen species throughout the season. The findings revealed mountainous forests are more susceptible to intensifying drought and moisture deficit, while coastal areas are relatively resilient. I found that increasing canopy thermal stress, defined as canopy-air temperature difference, slows the leaf-development following a dry year, accelerates it after a wet year.
Finally, I demonstrate how space-borne canopy “thermal stress”, i.e. surface-air temperature difference, could be used as a surrogate for drought-induced stress to estimate forest transpiration. Using physics-based relationships that accommodates uncertainties, I showed how changes in canopy water flux may be reflected in surface energy balance and in remotely-sensed thermal stress. Validating with field measurements of canopy transpiration in the southeastern US, I quantified sensitivity of transpiration to thermal stress in a range of atmospheric and climate conditions. I found that a 1 mm change in daily transpiration may cause 3 to 4 °C of thermal stress, depending on site conditions. The cooling effect is large when solar radiation is high or wind speed is low. The effect has the highest control on water-use during warm and dry seasons, when monitoring drought is essential. I applied our model to available satellite and metrological data to detect patterns of drought. Using only air and surface temperatures, I predicted anomaly in water-use across the contiguous United States over the past 15 years, and then compared with anomaly in soil water content and conventional drought indices. Our simple model showed a reliable accuracy in compare to the state-of-the-art general circulation models. The technique can be used in varying time-scales to monitor surface water-use and drought in large scales.
Item Open Access Essays on Macroeconomics in Mixed Frequency Estimations(2011) Kim, Tae BongThis dissertation asks whether frequency misspecification of a New Keynesian model
results in temporal aggregation bias of the Calvo parameter. First, when a
New Keynesian model is estimated at a quarterly frequency while the true
data generating process is the same but at a monthly frequency, the Calvo
parameter is upward biased and hence implies longer average price duration.
This suggests estimating a New Keynesian model at a monthly frequency may
yield different results. However, due to mixed frequency datasets in macro
time series recorded at quarterly and monthly intervals, an estimation
methodology is not straightforward. To accommodate mixed frequency datasets,
this paper proposes a data augmentation method borrowed from Bayesian
estimation literature by extending MCMC algorithm with
"Rao-Blackwellization" of the posterior density. Compared to two alternative
estimation methods in context of Bayesian estimation of DSGE models, this
augmentation method delivers lower root mean squared errors for parameters
of interest in New Keynesian model. Lastly, a medium scale New Keynesian
model is brought to the actual data, and the benchmark estimation, i.e. the
data augmentation method, finds that the average price duration implied by
the monthly model is 5 months while that by the quarterly model is 20.7
months.
Item Open Access Incorporating Photogrammetric Uncertainty in UAS-based Morphometric Measurements of Baleen Whales(2021) Bierlich, Kevin CharlesIncreasingly, drone-based photogrammetry has been used to measure size and body condition changes in marine megafauna. A broad range of platforms, sensors, and altimeters are being applied for these purposes, but there is no unified way to predict photogrammetric uncertainty across this methodological spectrum. As such, it is difficult to make robust comparisons across studies, disrupting collaborations amongst researchers using platforms with varying levels of measurement accuracy.
In this dissertation, I evaluate the major drivers of photogrammetric error and develop a framework to easily quantify and incorporate uncertainty associated with different UAS platforms. To do this, I take an experimental approach to train a Bayesian statistical model using a known-sized object floating at the water’s surface to quantify how measurement error scales with altitude for several different drones equipped with different cameras, focal length lenses, and altimeters. I then use the fitted model to predict the length distributions of unknown-sized humpback whales and assess how predicted uncertainty can affect quantities derived from photogrammetric measurements such as the age class of an animal (Chapter 1). I also use the fitted model to predict body condition of blue whales, humpback whales, and Antarctic minke whales, providing the first comparison of how uncertainty scales across commonly used 1-, 2-, and 3-dimensional (1D, 2D, and 3D, respectively) body condition measurements (Chapter 2). This statistical framework jointly estimates errors from altitude and length measurements and accounts for altitudes measured with both barometers and laser altimeters while incorporating errors specific to each. This Bayesian statistical model outputs a posterior predictive distribution of measurement uncertainty around length and body condition measurements and allows for the construction of highest posterior density intervals to define measurement uncertainty, which allows one to make probabilistic statements and stronger inferences pertaining to morphometric features critical for understanding life history patterns and potential impacts from anthropogenically altered habitats. From these studies, I find that altimeters can greatly influence measurement predictions, with measurements using a barometer producing larger and greater uncertainty compared to using a laser altimeter, which can influence age classifications. I also find that while the different body condition measurements are highly correlated with one another, uncertainty does not scale linearly across 1D, 2D, and 3D body condition measurements, with 2D and 3D uncertainty increasing by a factor of 1.44 and 2.14 compared to 1D measurements, respectively. I find that body area index (BAI) accounts for potential variation along the body for each species and was the most precise body condition measurement.
I then use the model to incorporate uncertainty associated with different drone platforms to measure how body condition (as BAI) changes over the course of the foraging season for humpback whales along the Western Antarctic Peninsula (Chapter 3). I find that BAI increases curvilinearly for each reproductive class, with rapid increases in body condition early in the season compared to later in the season. Lactating females had the lowest BAI, reflecting the high energetic costs of reproduction, whereas mature whales had the largest BAI, reflecting their high energy stores for financing the costs of reproduction on the breeding grounds. Calves also increased BAI opposed to strictly increasing length, while immature whales may increase their BAI and commence an early migration by mid-season. These results set a baseline for monitoring this healthy population in the future as they face potential impacts from climate change and anthropogenic stresses. This dissertation concludes with a best practices guide for minimizing, quantifying, and incorporating uncertainty associated with photogrammetry data. This work provides novel insights into how to obtain more accurate morphological measurements to help increase our understanding of how animals perform and function in their environment, as well as better track the health of populations over time and space.
Item Open Access MCMC Sampling Geospatial Partitions for Linear Models(2021) Wyse, Evan TGeospatial statistical approaches must frequently confront the problem of correctlypartitioning a group of geographical sub-units, such as counties, states, or precincts,into larger blocks which share information. Since the space of potential partitions isquite large, sophisticated approaches are required, particularly when this partitioninginteracts with other parts of a larger model, as is frequent with Bayesian inference.Authors such as Balocchi et al. (2021) provide stochastic search algorithms whichprovide certain theoretical guarantees about this partition in the context of Bayesianmodel averaging. We borrow tools from Herschlag et al. (2020) to examine a potentialapproach to sampling these clusters efficiently using a Markov Chain Monte Carlo(MCMC) approach.
Item Open Access Measuring Baseball Defensive Value Using Statcast Data(2017) Jordan, DrewMultiple methods of measuring the defensive value of baseball players have been developed. These methods commonly rely on human batted ball charters, which inherently introduces the possibility of measurement error and lack of objectivity to these metrics. Using newly available Statcast data, we construct a new metric, SAFE 2.0, that utilizes Bayesian hierarchical logistic regression to calculate the probability that a given batted ball will be caught by a fielder. We use kernel density estimation to approximate the relative frequency of each batted ball in our data. We also incorporate the run consequence of each batted ball. Combining the catch probability, the relative frequency, and the run consequence of batted balls over a grid, we arrive at our new metric, SAFE 2.0. We apply our method to all batted balls hit to centerfield in the 2016 Major League Baseball season, and rank all centerfielders according to their relative performance for the 2016 season as measured by SAFE 2.0. We then compare these rankings to the rankings of the most commonly used measure of defensive value, Ultimate Zone Rating.