Browsing by Subject "Bayesian statistics"
- Results Per Page
- Sort Options
Item Open Access Advancements in Probabilistic Machine Learning and Causal Inference for Personalized Medicine(2019) Lorenzi, Elizabeth CatherineIn this dissertation, we present four novel contributions to the field of statistics with the shared goal of personalizing medicine to individual patients. These methods are developed to directly address problems in health care through two subfields of statistics: probabilistic machine learning and causal inference. These projects include improving predictions of adverse events after surgeries, or learning the effectiveness of treatments for specific subgroups and for individuals. We begin the dissertation in Chapter 1 with a discussion of personalized medicine, the use of electronic health record (EHR) data, and a brief discussion on learning heterogeneous treatment effects. In chapter 2, we present a novel algorithm, Predictive Hierarchical Clustering (PHC), for agglomerative hierarchical clustering of current procedural terminology (CPT) codes. Our predictive hierarchical clustering aims to cluster subgroups, not individual observations, found within our data, such that the clusters discovered result in optimal performance of a classification model, specifically for predicting surgical complications. In chapter 3, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data across subpopulations while sharing information to improve inference and prediction. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients at Duke University Health System (DUHS). The last chapters of the dissertation address personalized medicine from a causal perspective, where the goal is to understand how interventions affect individuals not full populations. In chapter 4, we address heterogeneous treatment effects across subgroups, where guidance for observational comparisons within subgroups is lacking as is a connection to classic design principles for propensity score (PS) analyses. We address these shortcomings by proposing a novel propensity score method for subgroup analysis (SGA) that seeks to balance existing strategies in an automatic and efficient way. With the use of overlap weights, we prove that an over-specified propensity model including interactions between subgroups and all covariates results in exact covariate balance within subgroups. This is paired with variable selection approaches to adjust for a possibly overspecified propensity score model. Finally, chapter 5 discusses our final contribution, a longitudinal matching algorithm aiming to predict individual treatment effects of a medication change for diabetes patients. This project aims to develop a novel and generalizable causal inference framework for learning heterogeneous treatment effects from Electronic Health Records (EHR) data. The key methodological innovation is to cast the sparse and irregularly-spaced EHR time series into functional data analysis in the design stage to adjust for confounding that changes over time. We conclude the dissertation and discuss future work in Section 6, outlining many directions for continued research on these topics.
Item Open Access Advances in Bayesian Modeling of Protein Structure Evolution(2018) Larson, GaryThis thesis contributes to a statistical modeling framework for protein sequence and structure evolution. An existing Bayesian model for protein structure evolution is extended in two unique ways. Each of these model extensions addresses an important limitation which has not yet been satisfactorily addressed in the wider literature. These extensions are followed by work regarding inherent statistical bias in models for sequence evolution.
Most available models for protein structure evolution do not model interdependence between the backbone sites of the protein, yet the assumption that the sites evolve independently is known to be false. I argue that ignoring such dependence leads to biased estimation of evolutionary distance between proteins. To mitigate this bias, I express an existing Bayesian model in a generalized form and introduce site-dependence via the generalized model. In the process, I show that the effect of protein structure information on the measure of evolutionary distance can be suppressed by the model formulation, and I further modify the model to help mitigate this problem. In addition to the statistical model itself, I provide computational details and computer code. I modify a well-known bioinformatics algorithm in order to preserve efficient computation under this model. The modified algorithm can be easily understood and used by practitioners familiar with the original algorithm. My approach to modeling dependence is computationally tractable and interpretable with little additional computational burden over the model on which it is based.
The second model expansion allows for evolutionary inference on protein pairs having structural discrepancies attributable to backbone flexion. Thus, the model expansion exposes flexible protein structures to the capabilities of Bayesian protein structure alignment and phylogenetics. Unlike most of the few existing methods that deal with flexible protein structures, our Bayesian flexible alignment model requires no prior knowledge of the presence or absence of flexion points in the protein structure, and uncertainty measures are available for the alignment and other parameters of interest. The model can detect subtle flexion while not overfitting non-flexible protein pairs, and is demonstrated to improve phylogenetic inference in a simulated data setting and in a difficult-to-align set of proteins. The flexible model is a unique addition to the small but growing set of tools available for analysis of flexible protein structure. The ability to perform inference on flexible proteins in a Bayesian framework is likely to be of immediate interest to the structural phylogenetics community.
Finally, I present work related to the study of bias in site-independent models for sequence evolution. In the case of binary sequences, I discuss strategies for theoretical proof of bias and provide various details to that end, including detailing efforts undertaken to produce a site-dependent sequence model with similar properties to the site-dependent structural model introduced in an earlier chapter. I highlight the challenges of theoretical proof for this bias and include miscellaneous related work of general interest to researchers studying dependent sequence models.
Item Open Access Bayesian Adjustment for Multiplicity(2009) Scott, James GordonThis thesis is about Bayesian approaches for handling multiplicity. It considers three main kinds of multiple-testing scenarios: tests of exchangeable experimental units, tests for variable inclusion in linear regresson models, and tests for conditional independence in jointly normal vectors. Multiplicity adjustment in these three areas will be seen to have many common structural features. Though the modeling approach throughout is Bayesian, frequentist reasoning regarding error rates will often be employed.
Chapter 1 frames the issues in the context of historical debates about Bayesian multiplicity adjustment. Chapter 2 confronts the problem of large-scale screening of functional data, where control over Type-I error rates is a crucial issue. Chapter 3 develops new theory for comparing Bayes and empirical-Bayes approaches for multiplicity correction in regression variable selection. Chapters 4 and 5 describe new theoretical and computational tools for Gaussian graphical-model selection, where multiplicity arises in performing many simultaneous tests of pairwise conditional independence. Chapter 6 introduces a new approach to sparse-signal modeling based upon local shrinkage rules. Here the focus is not on multiplicity per se, but rather on using ideas from Bayesian multiple-testing models to motivate a new class of multivariate scale-mixture priors. Finally, Chapter 7 describes some directions for future study, many of which are the subjects of my current research agenda.
Item Open Access Bayesian Analysis and Computational Methods for Dynamic Modeling(2009) Niemi, JaradDynamic models, also termed state space models, comprise an extremely rich model class for time series analysis. This dissertation focuses on building state space models for a variety of contexts and computationally efficient methods for Bayesian inference for simultaneous estimation of latent states and unknown fixed parameters.
Chapter 1 introduces state space models and methods of inference in these models. Chapter 2 describes a novel method for jointly sampling the entire latent state vector in a nonlinear Gaussian state space model using a computationally efficient adaptive mixture modeling procedure. This method is embedded in an overall Markov chain Monte Carlo algorithm for estimating fixed parameters as well as states. In Chapter 3 the method of the previous chapter is implemented in a few illustrative
nonlinear models and compared to standard existing methods. This chapter also looks at the effect of the number of mixture components as well as length of the time series on the efficiency of the method. I then turn to an biological application in Chapter 4. I discuss modeling choices as well as derivation of the state space model to be used in this application. Parameter and state estimation are analyzed in these models for both simulated and real data. Chapter 5 extends the methodology introduced in Chapter 2 from nonlinear Gaussian models to general state space models. The method is then applied to a financial
stochastic volatility model on US $ - British £ exchange rates. Bayesian inference in the previous chapter is accomplished through Markov chain Monte Carlo which is suitable for batch analyses, but computationally limiting in sequential analysis. Chapter 6 introduces sequential Monte Carlo. It discusses two methods currently available for simultaneous sequential estimation of latent states and fixed parameters and then introduces a novel algorithm that reduces the key, limiting degeneracy issue while being usable in a wide model class. Chapter 7 implements the novel algorithm in a disease surveillance context modeling influenza epidemics. Finally, Chapter 8 suggests areas for future work in both modeling and Bayesian inference. Several appendices provide detailed technical support material as well as relevant related work.
Item Open Access Bayesian and Information-Theoretic Learning of High Dimensional Data(2012) Chen, MinhuaThe concept of sparseness is harnessed to learn a low dimensional representation of high dimensional data. This sparseness assumption is exploited in multiple ways. In the Bayesian Elastic Net, a small number of correlated features are identified for the response variable. In the sparse Factor Analysis for biomarker trajectories, the high dimensional gene expression data is reduced to a small number of latent factors, each with a prototypical dynamic trajectory. In the Bayesian Graphical LASSO, the inverse covariance matrix of the data distribution is assumed to be sparse, inducing a sparsely connected Gaussian graph. In the nonparametric Mixture of Factor Analyzers, the covariance matrices in the Gaussian Mixture Model are forced to be low-rank, which is closely related to the concept of block sparsity.
Finally in the information-theoretic projection design, a linear projection matrix is explicitly sought for information-preserving dimensionality reduction. All the methods mentioned above prove to be effective in learning both simulated and real high dimensional datasets.
Item Open Access Bayesian Emulation for Sequential Modeling, Inference and Decision Analysis(2016) Irie, KaoruThe advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.
Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.
Item Open Access Bayesian Hierarchical Models for Model Choice(2013) Li, YingboWith the development of modern data collection approaches, researchers may collect hundreds to millions of variables, yet may not need to utilize all explanatory variables available in predictive models. Hence, choosing models that consist of a subset of variables often becomes a crucial step. In linear regression, variable selection not only reduces model complexity, but also prevents over-fitting. From a Bayesian perspective, prior specification of model parameters plays an important role in model selection as well as parameter estimation, and often prevents over-fitting through shrinkage and model averaging.
We develop two novel hierarchical priors for selection and model averaging, for Generalized Linear Models (GLMs) and normal linear regression, respectively. They can be considered as "spike-and-slab" prior distributions or more appropriately "spike- and-bell" distributions. Under these priors we achieve dimension reduction, since their point masses at zero allow predictors to be excluded with positive posterior probability. In addition, these hierarchical priors have heavy tails to provide robust- ness when MLE's are far from zero.
Zellner's g-prior is widely used in linear models. It preserves correlation structure among predictors in its prior covariance, and yields closed-form marginal likelihoods which leads to huge computational savings by avoiding sampling in the parameter space. Mixtures of g-priors avoid fixing g in advance, and can resolve consistency problems that arise with fixed g. For GLMs, we show that the mixture of g-priors using a Compound Confluent Hypergeometric distribution unifies existing choices in the literature and maintains their good properties such as tractable (approximate) marginal likelihoods and asymptotic consistency for model selection and parameter estimation under specific values of the hyper parameters.
While the g-prior is invariant under rotation within a model, a potential problem with the g-prior is that it inherits the instability of ordinary least squares (OLS) estimates when predictors are highly correlated. We build a hierarchical prior based on scale mixtures of independent normals, which incorporates invariance under rotations within models like ridge regression and the g-prior, but has heavy tails like the Zeller-Siow Cauchy prior. We find this method out-performs the gold standard mixture of g-priors and other methods in the case of highly correlated predictors in Gaussian linear models. We incorporate a non-parametric structure, the Dirichlet Process (DP) as a hyper prior, to allow more flexibility and adaptivity to the data.
Item Open Access Bayesian meta-analysis models for heterogeneous genomics data(2013) Zheng, LinglingThe accumulation of high-throughput data from vast sources has drawn a lot attentions to develop methods for extracting meaningful information out of the massive data. More interesting questions arise from how to combine the disparate information, which goes beyond modeling sparsity and dimension reduction. This dissertation focuses on the innovations in the area of heterogeneous data integration.
Chapter 1 contextualizes this dissertation by introducing different aspects of meta-analysis and model frameworks for high-dimensional genomic data.
Chapter 2 introduces a novel technique, joint Bayesian sparse factor analysis model, to vertically integrate multi-dimensional genomic data from different platforms.
Chapter 3 extends the above model to a nonparametric Bayes formula. It directly infers number of factors from a model-based approach.
On the other hand, chapter 4 deals with horizontal integration of diverse gene expression data; the model infers pathway activities across various experimental conditions.
All the methods mentioned above are demonstrated in both simulation studies and real data applications in chapters 2-4.
Finally, chapter 5 summarizes the dissertation and discusses future directions.
Item Open Access Bayesian Mixture Modeling Approaches for Intermediate Variables and Causal Inference(2010) Schwartz, Scott LeeThis thesis examines causal inference related topics involving intermediate variables, and uses Bayesian methodologies to advance analysis capabilities in these areas. First, joint modeling of outcome variables with intermediate variables is considered in the context of birthweight and censored gestational age analyses. The proposed methodology provides improved inference capabilities for birthweight and gestational age, avoids post-treatment selection bias problems associated with conditional on gestational age analyses, and appropriately assesses the uncertainty associated with censored gestational age. Second, principal stratification methodology for settings where causal inference analysis requires appropriate adjustment of intermediate variables is extended to observational settings with binary treatments and binary intermediate variables. This is done by uncovering the structural pathways of unmeasured confounding affecting principal stratification analysis and directly incorporating them into a model based sensitivity analysis methodology. Demonstration focuses on a study of the efficacy of influenza vaccination in elderly populations. Third, flexibility, interpretability, and capability of principal stratification analyses for continuous intermediate variables are improved by replacing the current fully parametric methodologies with semiparametric Bayesian alternatives. This presentation is one of the first uses of nonparametric techniques in causal inference analysis,
and opens a connection between these two fields. Demonstration focuses on two studies, one involving a cholesterol reduction drug, and one examine the effect of physical activity on cardiovascular disease as it relates to body mass index.
Item Open Access Bayesian Statistical Analysis in Coastal Eutrophication Models: Challenges and Solutions(2014) Nojavan Asghari, FarnazEstuaries interfacing with the land, atmosphere and open oceans can be influenced in a variety of ways by anthropogenic activities. Centuries of overexploitation, habitat transformation, and pollution have degraded estuarine ecological health. Key concerns of public and environmental managers of estuaries include water quality, particularly the enrichment of nutrients, increased chlorophyll a concentrations, increased hypoxia/anoxia, and increased Harmful Algal Blooms (HABs). One reason for the increased nitrogen loading over the past two decades is the proliferation of concentrated animal feeding operations (CAFOs) in coastal areas. This dissertation documents a study of estuarine eutrophication modeling, including modeling of major source of nitrogen in the watershed, the use of the Bayesian Networks (BNs) for modeling eutrophication dynamics in an estuary, a documentation of potential problems of using BNs, and a continuous BN model for addressing these problems.
Environmental models have emerged as great tools to transform data into useful information for managers and policy makers. Environmental models contain uncertainty due to natural ecosystems variability, current knowledge of environmental processes, modeling structure, computational restrictions, and problems with data/observations due to measurement error or missingness. Many methodologies capable of quantifying uncertainty have been developed in the scientic literature. Examples of such methods are BNs, which utilize conditional probability tables to describe the relationships among variables. This doctoral dissertation demonstrates how BNs, as probabilistic models, can be used to model eutrophication in estuarine ecosystems and to explore the effects of plausible future climatic and nutrient pollution management scenarios on water quality indicators. The results show interaction among various predictors and their impact on ecosystem health. The synergistic eftects between nutrient concentrations and climate variability caution future management actions.
BNs have several distinct strengths such as the ability to update knowledge based on Bayes' theorem, modularity, accommodation of various knowledge sources and data types, suitability to both data-rich and data-poor systems, and incorporation of uncertainty. Further, BNs' graphical representation facilitates communicating models and results with environmental managers and decision-makers. However, BNs have certain drawbacks as well. For example, they can only handle continuous variables under severe restrictions (1- Each continuous variable be assigned a (linear) conditional Normal distribution; 2- No discrete variable have continuous parents). The solution, thus far, to address this constraint has been discretizing variables. I designed an experiment to evaluate and compare the impact of common discretization methods on BNs. The results indicate that the choice of discretization method severely impacts the model results; however, I was unable to provide any criteria to select an optimal discretization method.
Finally, I propose a continuous variable Bayesian Network methodology and demonstrate its application for water quality modeling in estuarine ecosystems. The proposed method retains advantageous characteristics of BNs, while it avoids the drawbacks of discretization by specifying the relationships among the nodes using statistical and conditional probability models. The Bayesian nature of the proposed model enables prompt investigation of observed patterns, as new conditions unfold. The network structure presents the underlying ecological ecosystem processes and provides a basis for science communication. I demonstrate model development and temporal updating using the New River Estuary, NC data set and spatial updating using the Neuse River Estuary, NC data set.
Item Open Access Bayesian Statistical Models of Cell-Cycle Progression at Single-Cell and Population Levels(2014) Mayhew, Michael BenjaminCell division is a biological process fundamental to all life. One aspect of the process that is still under investigation is whether or not cells in a lineage are correlated in their cell-cycle progression. Data on cell-cycle progression is typically acquired either in lineages of single cells or in synchronized cell populations, and each source of data offers complementary information on cell division. To formally assess dependence in cell-cycle progression, I develop a hierarchical statistical model of single-cell measurements and extend a previously proposed model of population cell division in the budding yeast, Saccharomyces cerevisiae. Both models capture correlation and cell-to-cell heterogeneity in cell-cycle progression, and parameter inference is carried out in a fully Bayesian manner. The single-cell model is fit to three published time-lapse microscopy datasets and the population-based model is fit to simulated data for which the true model is known. Based on posterior inferences and formal model comparisons, the single-cell analysis demonstrates that budding yeast mother and daughter cells do not appear to correlate in their cell-cycle progression in two of the three experimental settings. In contrast, mother cells grown in a less preferred sugar source, glycerol/ethanol, did correlate in their rate of cell division in two successive cell cycles. Population model fitting to simulated data suggested that, under typical synchrony experimental conditions, population-based measurements of the cell-cycle were not informative for correlation in cell-cycle progression or heterogeneity in daughter-specific G1 phase progression.
Item Open Access Clustering Multiple Related Datasets with a Hierarchical Dirichlet Process(2011) de Oliveira Sales, Ana PaulaI consider the problem of clustering multiple related groups of data. My approach entails mixture models in the context of hierarchical Dirichlet processes, focusing on their ability to perform inference on the unknown number of components in the mixture, as well as to facilitate the sharing of information and borrowing of strength across the various data groups. Here, I build upon the hierarchical Dirichlet process model proposed by Muller et al. (2004), revising some relevant aspects of the model, as well as improving the MCMC sampler's convergence by combining local Gibbs sampler moves with global Metropolis-Hastings split-merge moves. I demonstrate the strengths of my model by employing it to cluster both synthetic and real datasets.
Item Open Access Computational Systems Biology of Saccharomyces cerevisiae Cell Growth and Division(2014) Mayhew, Michael BenjaminCell division and growth are complex processes fundamental to all living organisms. In the budding yeast, Saccharomyces cerevisiae, these two processes are known to be coordinated with one another as a cell's mass must roughly double before division. Moreover, cell-cycle progression is dependent on cell size with smaller cells at birth generally taking more time in the cell cycle. This dependence is a signature of size control. Systems biology is an emerging field that emphasizes connections or dependencies between biological entities and processes over the characteristics of individual entities. Statistical models provide a quantitative framework for describing and analyzing these dependencies. In this dissertation, I take a statistical systems biology approach to study cell division and growth and the dependencies within and between these two processes, drawing on observations from richly informative microscope images and time-lapse movies. I review the current state of knowledge on these processes, highlighting key results and open questions from the biological literature. I then discuss my development of machine learning and statistical approaches to extract cell-cycle information from microscope images and to better characterize the cell-cycle progression of populations of cells. In addition, I analyze single cells to uncover correlation in cell-cycle progression, evaluate potential models of dependence between growth and division, and revisit classical assertions about budding yeast size control. This dissertation presents a unique perspective and approach towards comprehensive characterization of the coordination between growth and division.
Item Open Access Dynamic modeling and Bayesian predictive synthesis(2017) McAlinn, KenichiroThis dissertation discusses model and forecast comparison, calibration, and combination from a foundational perspective. For nearly five decades, the field of forecast combination has grown exponentially. Its practicality and effectiveness in important real world problems concerning forecasting, uncertainty, and decisions propels this. Ample research-- theoretical and empirical-- into new methods and justifications have been produced. However, its foundations-- the philosophical/theoretical underpinnings on which methods and strategies are built upon-- have been unexplored in recent literature. Bayesian predictive synthesis (BPS) defines a coherent theoretical basis for combining multiple forecast densities, whether from models, individuals, or other sources, and generalizes existing forecast pooling and Bayesian model mixing methods. By understanding the underlying foundation that defines the combination of forecasts, multiple extensions are revealed, resulting in significant advances in the understanding and efficacy of the methods for decision making in multiple fields.
The extensions discussed in this dissertation are into the temporal domain. Many important decision problems are time series, including policy decisions in macroeconomics and investment decisions in finance, where decisions are sequentially updated over time. Time series extensions of BPS are implicit dynamic latent factor models, allowing adaptation to time-varying biases, mis-calibration, and dependencies among models or forecasters. Multiple studies using different data and different decision problems are presented, demonstrating the effectiveness of dynamic BPS, in terms of forecast accuracy and improved decision making, and highlighting the unique insight it provides.
Item Open Access Employing Neural Language Models and A Bayesian Hierarchical Framework for Classification and Engagement Analysis of Misinformation on Social Media(2022-04) List, AbbeyWhile social media can be an effective tool for maintaining personal relationships and making global connections, it has become a powerful force in the damaging spread of misinformation, especially during universally difficult and taxing events such as the COVID-19 pandemic. In this study, we collected a sample of Tweets related to COVID-19 from Twitter accounts of influential political media commentators and news organizations, assigning labels of misinformation, misleading, or legitimate to each Tweet. We constructed a Bayesian hierarchical negative binomial regression model to analyze any associations between Tweet engagement and misleading status while controlling for factors such as political lean, lexical diversity, and Retweet status. We found evidence that engagement had a positive association with misleading status and text readability, as well as a negative association with Retweets. We also employed a DeBERTa neural language classification model to predict the presence of misinformative or misleading content in Tweets, and we experimented with external datasets, multitask fine-tuning, backtranslation, and weighted loss to achieve accuracy of 0.683 and a macro F1-score of 0.593. We then examined DeBERTa explainability through word attributions with integrated gradients and found that tokens with the highest influence on model predictions often possessed connotations or context that was understandably related to the predicted label. The results of this study indicate that misleading status, Retweet status, and linguistic features may hold associations with overall Tweet engagement, and the DeBERTa model represents a potentially useful tool that can examine Tweet text alone without an external knowledge base and determine whether misinformation is present.Item Open Access Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics(2014) Shang, YanIn cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for the customer and freight forwarder. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Interestingly, our preliminary data analysis shows a strong multimodal feature in the transport risks, driven by unobserved events, such as cargo missing flights. To accommodate this feature, we introduce a Bayesian nonparametric model -- the probit stick-breaking process (PSBP) mixture model -- for flexible estimation of the conditional (i.e., state-dependent) density function of transport risk. We demonstrate that using simpler methods, such as OLS linear regression, can lead to misleading inferences. Our model provides a tool for the forwarder to offer customized price and service quotes. It can also generate baseline airline performance to enable fair supplier evaluation. Furthermore, the method allows us to separate recurrent risks from disruption risks. This is important, because hedging strategies for these two kinds of risks are often drastically different.
Item Open Access Finite Sample Bounds and Path Selection for Sequential Monte Carlo(2018) Marion, JosephSequential Monte Carlo (SMC) samplers have received attention as an alternative to Markov chain Monte Carlo for Bayesian inference problems due to their strong empirical performance on difficult multimodal problems, natural synergy with parallel computing environments, and accuracy when estimating ratios of normalizing constants. However, while these properties have been demonstrated empirically, the extent of these advantages remain unexplored theoretically. Typical convergence results for SMC are limited to root N results; they obscure the relationship between the algorithmic factors (weights, Markov kernels, target distribution) and the error of the resulting estimator. This limitation makes it difficult to compare SMC to other estimation methods and challenging to design efficient SMC algorithms from a theoretical perspective.
In this thesis, we provide conditions under which SMC provides a randomized approximation scheme, showing how to choose the number of of particles and Markov kernel transitions at each SMC step in order to ensure an accurate approximation with bounded error. These conditions rely on the sequence of SMC interpolating distributions and the warm mixing times of the Markov kernels, explicitly relating the algorithmic choices to the error of the SMC estimate. This allows us to provide finite-sample complexity bounds for SMC in a variety of settings, including finite state-spaces, product spaces, and log-concave target distributions.
A key advantage of this approach is that the bounds provide insight into the selection of efficient sequences of SMC distributions. When the target distribution is spherical Gaussian or log-concave, we show that judicious selection of interpolating distributions results in an SMC algorithm with a smaller complexity bound than MCMC. These results are used to motivate the use of a well known SMC algorithm that adaptively chooses interpolating distributions. We provide conditions under which the adaptive algorithm gives a randomized approximation scheme, providing theoretical validation for the automatic selection of SMC distributions.
Selecting efficient sequences of distributions is a problem that also arises in the estimation of normalizing constants using path sampling. In the final chapter of this thesis, we develop automatic methods for choosing sequences of distributions that provide low-variance path sampling estimators. These approaches are motived by properties of the theoretically optimal, lowest-variance path, which is given by the geodesic of the Riemann manifold associated with the path sampling family. For one dimensional paths we provide a greedy approach to step size selection that has good empirical performance. For multidimensional paths, we present an approach using Gaussian process emulation that efficiently finds low variance paths in this more complicated setting.
Item Open Access General and Efficient Bayesian Computation through Hamiltonian Monte Carlo Extensions(2017) Nishimura, AkihikoHamiltonian Monte Carlo (HMC) is a state-of-the-art sampling algorithm for Bayesian computation. Popular probabilistic programming languages Stan and PyMC rely on HMC’s generality and efficiency to provide automatic Bayesian inference platforms for practitioners. Despite its wide-spread use and numerous success stories, HMC has several well known pitfalls. This thesis presents extensions of HMC that overcome its two most prominent weaknesses: inability to handle discrete parameters and slow mixing on multi-modal target distributions.
Discontinuous HMC (DHMC) presented in Chapter 2 extends HMC to discontinuous target distributions – and hence to discrete parameter distributions through embedding them into continuous spaces — using an idea of event-driven Monte Carlo from the computational physics literature. DHMC is guaranteed to outperform a Metropolis-within-Gibbs algorithm since, as it turns out, the two algorithms coincide under a specific (and sub-optimal) implementation of DHMC. The theoretical justification of DHMC extends an existing theory of non-smooth Hamiltonian mechanics and of measure-valued differential inclusions.
Geometrically tempered HMC (GTHMC) presented in Chapter 3 improves HMC’s performance on multi-modal target distributions. The efficiency improvement is achieved through differential geometric techniques, relating a target distribution to
another distribution with less severe multi-modality. We establish a geometric theory behind Riemannian manifold HMC to motivate our geometric tempering methods. We then develop an explicit variable stepsize reversible integrator for simulating
Hamiltonian dynamics to overcome a stability issue of the usual Stormer-Verlet integrator. The integrator is of independent interest, being the first of its kind designed specifically for HMC variants.
In addition to the two extensions described above, Chapter 4 describes a variable trajectory length algorithm that generalizes the acceptance and rejection procedure of HMC — and in fact of any reversible dynamics based samplers — to allow for more flexible choices of trajectory lengths. The algorithm in particular enables an effective application of a variable stepsize integrator to HMC extensions, including GTHMC. The algorithm is widely applicable and provides a recipe for constructing valid dynamics based samplers beyond the known HMC variants. Chapter 5 concludes the thesis with a simple and practical algorithm to improve computational efficiencies of HMC and related algorithms over their traditional implementations.
Item Open Access Interfaces between Bayesian and Frequentist Multiplte Testing(2015) Chang, Shih-HanThis thesis investigates frequentist properties of Bayesian multiple testing procedures in a variety of scenarios and depicts the asymptotic behaviors of Bayesian methods. Both Bayesian and frequentist approaches to multiplicity control are studied and compared, with special focus on understanding the multiplicity control behavior in situations of dependence between test statistics.
Chapter 2 examines a problem of testing mutually exclusive hypotheses with dependent data. The Bayesian approach is shown to have excellent frequentist properties and is argued to be the most effective way of obtaining frequentist multiplicity control without sacrificing power. Chapter 3 further generalizes the model such that multiple signals are acceptable, and depicts the asymptotic behavior of false positives rates and the expected number of false positives. Chapter 4 considers the problem of dealing with a sequence of different trials concerning some medical or scientific issue, and discusses the possibilities for multiplicity control of the sequence. Chapter 5 addresses issues and efforts in reconciling frequentist and Bayesian approaches in sequential endpoint testing. We consider the conditional frequentist approach in sequential endpoint testing and show several examples in which Bayesian and frequentist methodologies cannot be made to match.
Item Open Access Modeling Heterogeneity With Bayesian Additive Regression Trees(2023) Orlandi, VittorioThis work focuses on using Bayesian Additive Regression Trees (BART), a flexible and computationally efficient regression method, to model heterogeneity in data. In particular, we focus on the closely related tasks of hierarchical modeling, latent variable modeling, and density regression. We begin by introducing BART in Chapter 2, presenting the prior, various extensions, and an in-depth case study using BART to analyze the impact of ABO-incompatible cardiac transplant on infants. Chapter 3 describes a methodological contribution, in which we use BART to model data structured within known groups by allowing for group-specific forests, each of which is only updated using units corresponding to that group. We further introduce an intercept forest common to all units and a hierarchical prior across the leaf variances in order to allow for sharing of information. We find that such an approach yields more parsimonious models than other BART-based approaches in the literature, which in turn translates to better out-of-sample accuracy, at virtually no added computational cost. In Chapter 4, we consider models involving latent variables within BART. The original motivation is to extend the known-group approach in Chapter 3 to a setting where group information is unavailable. However, this idea lends itself well to many different analyses, including those involving continuous omitted or latent variables. Another application is a generalization of a BART-based approach to sensitivity analysis, in which we allow for the unobserved confounder to flexibly influence the outcome. The latent variable framework we consider is computationally efficient, can help BART model data much more accurately than if restricting oneself to observed covariates, and is widely applicable to many different settings. In Chapter 5, we study one such application in great detail: using BART for density regression. By integrating out the latent variable in our model, we can model conditional densities in a way that outperforms a variety of other approaches on simulated tasks, and also allows us to bound its posterior concentration rate. We hope that the tools we develop in this work are useful to practitioners seeking to model heterogeneity in their data and also serve as a foundation for future methodological advances.