Browsing by Subject "Markov chain Monte Carlo"
Results Per Page
Sort Options
Item Open Access Advances in Bayesian Modeling of Protein Structure Evolution(2018) Larson, GaryThis thesis contributes to a statistical modeling framework for protein sequence and structure evolution. An existing Bayesian model for protein structure evolution is extended in two unique ways. Each of these model extensions addresses an important limitation which has not yet been satisfactorily addressed in the wider literature. These extensions are followed by work regarding inherent statistical bias in models for sequence evolution.
Most available models for protein structure evolution do not model interdependence between the backbone sites of the protein, yet the assumption that the sites evolve independently is known to be false. I argue that ignoring such dependence leads to biased estimation of evolutionary distance between proteins. To mitigate this bias, I express an existing Bayesian model in a generalized form and introduce site-dependence via the generalized model. In the process, I show that the effect of protein structure information on the measure of evolutionary distance can be suppressed by the model formulation, and I further modify the model to help mitigate this problem. In addition to the statistical model itself, I provide computational details and computer code. I modify a well-known bioinformatics algorithm in order to preserve efficient computation under this model. The modified algorithm can be easily understood and used by practitioners familiar with the original algorithm. My approach to modeling dependence is computationally tractable and interpretable with little additional computational burden over the model on which it is based.
The second model expansion allows for evolutionary inference on protein pairs having structural discrepancies attributable to backbone flexion. Thus, the model expansion exposes flexible protein structures to the capabilities of Bayesian protein structure alignment and phylogenetics. Unlike most of the few existing methods that deal with flexible protein structures, our Bayesian flexible alignment model requires no prior knowledge of the presence or absence of flexion points in the protein structure, and uncertainty measures are available for the alignment and other parameters of interest. The model can detect subtle flexion while not overfitting non-flexible protein pairs, and is demonstrated to improve phylogenetic inference in a simulated data setting and in a difficult-to-align set of proteins. The flexible model is a unique addition to the small but growing set of tools available for analysis of flexible protein structure. The ability to perform inference on flexible proteins in a Bayesian framework is likely to be of immediate interest to the structural phylogenetics community.
Finally, I present work related to the study of bias in site-independent models for sequence evolution. In the case of binary sequences, I discuss strategies for theoretical proof of bias and provide various details to that end, including detailing efforts undertaken to produce a site-dependent sequence model with similar properties to the site-dependent structural model introduced in an earlier chapter. I highlight the challenges of theoretical proof for this bias and include miscellaneous related work of general interest to researchers studying dependent sequence models.
Item Open Access Advances in Log-concave Sampling and Domain Adaptation(2023) Wu, KeruIn the vast field of machine learning, the study of distributions and the innovation of algorithms driving from them serve as foundation pursuits. This dissertation delves into two critical topics tied to distributions: log-concave sampling and domain adaptation. Our research encompasses both theoretical and methodological developments in these fields, supported by thorough numerical experiments.
Beginning with log-concave sampling, we establish the minimax mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. Our proof is divided into two main components: an upper bound and a lower bound. First, for a d-dimensional log-concave density with condition number $\kappa$, we show that MALA with a warm start mixes in $\tilde O(\kappa\sqrt(d))$ iterations up to logarithmic factors. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde\Omega(\kappa\sqrt{d})$ steps to mix. The lower bound aligns with our upper bound in terms of condition number and dimension, thereby demonstrating the minimax mixing time of MALA for log-concave sampling.
Shifting focus to Domain Adaptation (DA), we tackle the statistical learning problem where the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. Our work is grounded in assumption of the presence of conditionally invariant components (CICs) --- feature representations that are relevant for prediction and remain conditionally invariant across the source and target distributions. We demonstrate three prominent roles of CICs in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features.
Through rigorous analysis, we advance insights into log-concave sampling and domain adaptation. Our exploration underscores the importance of probabilistic understanding in designing algorithms tailored to intricate data distributions encountered in machine learning.
Item Open Access Approximate Bayesian Computation for Complex Dynamic Systems(2013) Bonassi, Fernando VieiraThis thesis focuses on the development of ABC methods for statistical modeling in complex dynamic systems. Motivated by real applications in biology, I propose computational strategies for Bayesian inference in contexts where standard Monte Carlo methods cannot be directly applied due to the high complexity of the dynamic model and/or data limitations.
Chapter 2 focuses on stochastic bionetwork models applied to data generated from the marginal distribution of a few network nodes at snapshots in time. I present a Bayesian computational strategy, coupled with an approach to summarizing and numerically characterizing biological phenotypes that are represented in terms of the resulting sample distributions of cellular markers. ABC and mixture modeling are used to define the approach to linking mechanistic mathematical models of network dynamics to snapshot data, using a toggle switch example integrating simulated and real data as context.
Chapter 3 focuses on the application of the methodology presented in Chapter 2 to the Myc/Rb/E2F network. This network involves a relatively high number of parameters and stochastic equations in the model specification and, thus, is substantially more complex than the toggle switch example. The analysis of the Myc/Rb/E2F network is performed with simulated and real data. I demonstrate that the proposed method can indicate which parameters can be learned about using the marginal data.
In Chapter 4, I present an ABC SMC method that uses data-based adaptive weights. This easily implemented and computationally trivial extension of ABC SMC can substantially improve acceptance rates. This is demonstrated through a series of examples with simulated and real data, including the toggle switch example. Theoretical justification is also provided to explain why this method is expected to improve the effectiveness of ABC SMC.
In Chapter 5, I present an integrated Bayesian computational strategy for fitting complex dynamic models to sparse time-series data. This is applied to experimental data from an immunization response study with Indian Rhesus macaques. The computational strategy consists of two stages: first, MCMC is implemented based on simplified sampling steps, and then, the resulting approximate output is used to generate a proposal distribution for the parameters that results in an efficient ABC procedure. The incorporation of ABC as a correction tool improves the model fit, as is demonstrated through predictive posterior analysis on the data sets of the study.
Chapter 6 presents additional discussion and comments on potential future research directions.
Item Open Access Bayesian Analysis and Computational Methods for Dynamic Modeling(2009) Niemi, JaradDynamic models, also termed state space models, comprise an extremely rich model class for time series analysis. This dissertation focuses on building state space models for a variety of contexts and computationally efficient methods for Bayesian inference for simultaneous estimation of latent states and unknown fixed parameters.
Chapter 1 introduces state space models and methods of inference in these models. Chapter 2 describes a novel method for jointly sampling the entire latent state vector in a nonlinear Gaussian state space model using a computationally efficient adaptive mixture modeling procedure. This method is embedded in an overall Markov chain Monte Carlo algorithm for estimating fixed parameters as well as states. In Chapter 3 the method of the previous chapter is implemented in a few illustrative
nonlinear models and compared to standard existing methods. This chapter also looks at the effect of the number of mixture components as well as length of the time series on the efficiency of the method. I then turn to an biological application in Chapter 4. I discuss modeling choices as well as derivation of the state space model to be used in this application. Parameter and state estimation are analyzed in these models for both simulated and real data. Chapter 5 extends the methodology introduced in Chapter 2 from nonlinear Gaussian models to general state space models. The method is then applied to a financial
stochastic volatility model on US $ - British £ exchange rates. Bayesian inference in the previous chapter is accomplished through Markov chain Monte Carlo which is suitable for batch analyses, but computationally limiting in sequential analysis. Chapter 6 introduces sequential Monte Carlo. It discusses two methods currently available for simultaneous sequential estimation of latent states and fixed parameters and then introduces a novel algorithm that reduces the key, limiting degeneracy issue while being usable in a wide model class. Chapter 7 implements the novel algorithm in a disease surveillance context modeling influenza epidemics. Finally, Chapter 8 suggests areas for future work in both modeling and Bayesian inference. Several appendices provide detailed technical support material as well as relevant related work.
Item Open Access Bayesian Inference in Large-scale Problems(2016) Johndrow, James EdwardMany modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Item Open Access Bayesian Modeling for Identifying Selection in B cell Maturation(2023) Tang, TengjieThis thesis focuses on modeling the selection effects on B cell antibody mutations to identify amino acids under strong selection. Site-wise selection coefficients are parameterized by the fitnesses of amino acids. First, we conduct simulation studies to evaluate the accuracy of the Monte Carlo p-value approach for identifying selection for specific amino acid/location combinations. Then, we adopt Bayesian methods to infer location-specific fitness parameters for each amino acid. In particular, we propose the use of a spike-and-slab prior and implement Markov chain Monte Carlo (MCMC) algorithms for posterior sampling. Further simulation studies are conducted to evaluate the performance of the proposed Bayesian methods in inferring fitness parameters and identifying strong selection. The results demonstrate the reliable inference and detection performance of the proposed Bayesian methods. Finally, an example using real antibody sequences is provided. This work can help identify important early mutations in B cell antibodies, which is crucial for developing an effective HIV vaccine.
Item Open Access Computational Challenges to Bayesian Density Discontinuity Regression(2022) Zheng, HaoliangMany policies subject an underlying continuous variable to an artificial cutoff. Agents may regulate the magnitude of the variable to stay on the preferred side of a known cutoff, which results in the form of a jump discontinuity of the distribution of the variable at the cutoff value. In this paper, we present a statistical method to estimate the presence and magnitude of such jump discontinuities as functions of measured covariates.
For the posterior computation of our model, we use a Gibbs sampling scheme as the overall structure. For each iteration, we have two layers of data augmentation. We first adopt the rejection history strategy to remove the intractable integral and then generate Pólya-Gamma latent variables to ease the computation. We implement algorithms including adaptive Metropolis, ensemble MCMC, and independent Metropolis for each parameter within the Gibbs sampler. We justify our method based on the simulation study.
As for the real data, we focus on a study of corporate proposal voting, where we encounter several computational challenges. We discuss the multimodality issue from two aspects. In an effort to solve this problem, we borrow the idea from parallel tempering. We build an adaptive parallel tempered version of our sampler. The result shows that introducing the tempering method indeed improves the performance of our original sampler.
Item Open Access Data augmentation for models based on rejection sampling.(Biometrika, 2016-06) Rao, Vinayak; Lin, Lizhen; Dunson, David BWe present a data augmentation scheme to perform Markov chain Monte Carlo inference for models where data generation involves a rejection sampling algorithm. Our idea is a simple scheme to instantiate the rejected proposals preceding each data point. The resulting joint probability over observed and rejected variables can be much simpler than the marginal distribution over the observed variables, which often involves intractable integrals. We consider three problems: modelling flow-cytometry measurements subject to truncation; the Bayesian analysis of the matrix Langevin distribution on the Stiefel manifold; and Bayesian inference for a nonparametric Gaussian process density model. The latter two are instances of doubly-intractable Markov chain Monte Carlo problems, where evaluating the likelihood is intractable. Our experiments demonstrate superior performance over state-of-the-art sampling algorithms for such problems.Item Open Access Finite population estimators in stochastic search variable selection(BIOMETRIKA, 2012-12) Clyde, MA; Ghosh, JItem Open Access General and Efficient Bayesian Computation through Hamiltonian Monte Carlo Extensions(2017) Nishimura, AkihikoHamiltonian Monte Carlo (HMC) is a state-of-the-art sampling algorithm for Bayesian computation. Popular probabilistic programming languages Stan and PyMC rely on HMC’s generality and efficiency to provide automatic Bayesian inference platforms for practitioners. Despite its wide-spread use and numerous success stories, HMC has several well known pitfalls. This thesis presents extensions of HMC that overcome its two most prominent weaknesses: inability to handle discrete parameters and slow mixing on multi-modal target distributions.
Discontinuous HMC (DHMC) presented in Chapter 2 extends HMC to discontinuous target distributions – and hence to discrete parameter distributions through embedding them into continuous spaces — using an idea of event-driven Monte Carlo from the computational physics literature. DHMC is guaranteed to outperform a Metropolis-within-Gibbs algorithm since, as it turns out, the two algorithms coincide under a specific (and sub-optimal) implementation of DHMC. The theoretical justification of DHMC extends an existing theory of non-smooth Hamiltonian mechanics and of measure-valued differential inclusions.
Geometrically tempered HMC (GTHMC) presented in Chapter 3 improves HMC’s performance on multi-modal target distributions. The efficiency improvement is achieved through differential geometric techniques, relating a target distribution to
another distribution with less severe multi-modality. We establish a geometric theory behind Riemannian manifold HMC to motivate our geometric tempering methods. We then develop an explicit variable stepsize reversible integrator for simulating
Hamiltonian dynamics to overcome a stability issue of the usual Stormer-Verlet integrator. The integrator is of independent interest, being the first of its kind designed specifically for HMC variants.
In addition to the two extensions described above, Chapter 4 describes a variable trajectory length algorithm that generalizes the acceptance and rejection procedure of HMC — and in fact of any reversible dynamics based samplers — to allow for more flexible choices of trajectory lengths. The algorithm in particular enables an effective application of a variable stepsize integrator to HMC extensions, including GTHMC. The algorithm is widely applicable and provides a recipe for constructing valid dynamics based samplers beyond the known HMC variants. Chapter 5 concludes the thesis with a simple and practical algorithm to improve computational efficiencies of HMC and related algorithms over their traditional implementations.
Item Open Access Joint modeling of multiple repeated measures and survival data using multidimensional latent trait linear mixed model.(Statistical methods in medical research, 2018-10-11) Wang, Jue; Luo, ShengImpairment caused by Amyotrophic lateral sclerosis (ALS) is multidimensional (e.g. bulbar, fine motor, gross motor) and progressive. Its multidimensional nature precludes a single outcome to measure disease progression. Clinical trials of ALS use multiple longitudinal outcomes to assess the treatment effects on overall improvement. A terminal event such as death or dropout can stop the follow-up process. Moreover, the time to the terminal event may be dependent on the multivariate longitudinal measurements. In this article, we develop a joint model consisting of a multidimensional latent trait linear mixed model (MLTLMM) for the multiple longitudinal outcomes, and a proportional hazards model with piecewise constant baseline hazard for the event time data. Shared random effects are used to link together two models. The model inference is conducted using a Bayesian framework via Markov chain Monte Carlo simulation implemented in Stan language. Our proposed model is evaluated by simulation studies and is applied to the Ceftriaxone study, a motivating clinical trial assessing the effect of ceftriaxone on ALS patients.Item Open Access METROPOLIZED FOREST RECOMBINATION FOR MONTE CARLO SAMPLING OF GRAPH PARTITIONS(SIAM Journal on Applied Mathematics, 2023-08-01) Autry, E; Carter, D; Herschlag, GJ; Hunter, Z; Mattingly, JCWe develop a new Markov chain on graph partitions that makes relatively global moves yet is computationally feasible to be used as the proposal in the Metropolis-Hastings method. Our resulting algorithm is able to sample from a specified measure on partitions or spanning forests. Being able to sample from a specified measure is a requirement of what we consider as the gold standard in quantifying the extent to which a particular map is a gerrymander. Our proposal chain modifies the recently developed method called recombination (ReCom), which draws spanning trees on joined partitions and then randomly cuts them to repartition. We improve the computational efficiency by augmenting the statespace from partitions to spanning forests. The extra information accelerates the computation of the forward and backward proposal probabilities which are required for the Metropolis-Hastings algorithm. We demonstrate this method by sampling redistricting plans on several measures of interest and find promising convergence results on several key observables of interest. We also explore some limitations in the measures that are efficient to sample from and investigate the feasibility of using parallel tempering to extend this space of measures.Item Open Access Monitoring and Improving Markov Chain Monte Carlo Convergence by Partitioning(2015) VanDerwerken, DouglasSince Bayes' Theorem was first published in 1762, many have argued for the Bayesian paradigm on purely philosophical grounds. For much of this time, however, practical implementation of Bayesian methods was limited to a relatively small class of "conjugate" or otherwise computationally tractable problems. With the development of Markov chain Monte Carlo (MCMC) and improvements in computers over the last few decades, the number of problems amenable to Bayesian analysis has increased dramatically. The ensuing spread of Bayesian modeling has led to new computational challenges as models become more complex and higher-dimensional, and both parameter sets and data sets become orders of magnitude larger. This dissertation introduces methodological improvements to deal with these challenges. These include methods for enhanced convergence assessment, for parallelization of MCMC, for estimation of the convergence rate, and for estimation of normalizing constants. A recurring theme across these methods is the utilization of one or more chain-dependent partitions of the state space.
Item Open Access Random Orthogonal Matrices with Applications in Statistics(2019) Jauch, MichaelThis dissertation focuses on random orthogonal matrices with applications in statistics. While Bayesian inference for statistical models with orthogonal matrix parameters is a recurring theme, several of the results on random orthogonal matrices may be of interest to those in the broader probability and random matrix theory communities. In Chapter 2, we parametrize the Stiefel and Grassmann manifolds, represented as subsets of orthogonal matrices, in terms of Euclidean parameters using the Cayley transform and then derive Jacobian terms for change of variables formulas. This allows for Markov chain Monte Carlo simulation from probability distributions defined on the Stiefel and Grassmann manifolds. We also establish an asymptotic independent normal approximation for the distribution of the Euclidean parameters corresponding to the uniform distribution on the Stiefel manifold. In Chapter 3, we present polar expansion, a general approach to Monte Carlo simulation from probability distributions on the Stiefel manifold. When combined with modern Markov chain Monte Carlo software, polar expansion allows for routine and flexible posterior inference in models with orthogonal matrix parameters. Chapter 4 addresses prior distributions for structured orthogonal matrices. We introduce an approach to constructing prior distributions for structured orthogonal matrices which leads to tractable posterior simulation via polar expansion. We state two main results which support our approach and offer a new perspective on approximating the entries of random orthogonal matrices.
Item Open Access Stochastic Latent Domain Approaches to the Recovery and Prediction of High Dimensional Missing Data(2023) Cannella, Christopher BrianThis work presents novel techniques for approaching missing data using generative models. The main focus of these techniques is on leveraging the latent spaces of generative models, both to improve inference performance and to overcome many of the architectural challenges missing data poses for current generative models. This work includes methodologies that are broadly applicable regardless of model architecture and model specific techniques.
The first half of this work is dedicated to model agnostic techniques. Here, we present our Linearized-Marginal Restricted Boltzmann Machine (LM-RBM), a method for directly approximating the conditional and marginal distributions of RBMs used to infer missing data. We also present our Semi-Empirical Ab Initio objective functions for Markov Chain Monte Carlo (MCMC) proposal optimization, which are objective functions of a restricted functional class that are fit to recover analytically known optimal proposals. These Semi-Empirical Ab Initio objective functions are shown to avoid failures exhibited by current objective functions for MCMC propsal optimization with highly expressive neural proposals and enable the more confident optimization of deep generative architectures for MCMC techniques.
The second half of this work is dedicated to techniques applicable to specific generative architectures. We present Projected-Latent Markov Chain Monte Carlo (PL-MCMC), a technique for performing asymptotically exact conditional inference of missing data using normalizing flows. We evaluate the performance of PL-MCMC based on its applicability to tasks of training from and inferring missing data. We also present our Perceiver Attentional Copula for Time Series (PrACTiS), which utilizes attention with learned latent vectors to significantly improve the computational efficiency of attention based modeling in light of the additional challenges that time series data pose with respect to missing data inference.
Item Open Access Stochastic Study of Gerrymandering(2015-05-06) Vaughn, ChristyIn the 2012 election for the US House of Representatives, only four of North Carolina’s thirteen congressional districts elected a democrat, despite a majority democratic vote. This raises the question of whether gerrymandering, the process of drawing districts to favor a political party, was employed. This study explores election outcomes under different choices of district boundaries. We represent North Carolina as a graph of voting tabulation districts. A districting is a division of this graph into thirteen connected subgraphs. We define a probability distribution on districtings that favors more compact districts with close to an equal population in each district. To sample from this distribution, we employ the Metropolis-Hastings variant of Markov Chain Monte Carlo. After sampling, election data from the 2012 US House of Representatives election is used to determine how many representatives would have been elected for each party under the different districtings. Of our randomly drawn districts, we find an average of 6.8 democratic representatives elected. Furthermore, none of the districtings elect as few as four democratic representatives, as was the case in the 2012 election.