Browsing by Author "Schmidler, Scott C"
Results Per Page
Sort Options
Item Open Access Advances in Bayesian Modeling of Protein Structure Evolution(2018) Larson, GaryThis thesis contributes to a statistical modeling framework for protein sequence and structure evolution. An existing Bayesian model for protein structure evolution is extended in two unique ways. Each of these model extensions addresses an important limitation which has not yet been satisfactorily addressed in the wider literature. These extensions are followed by work regarding inherent statistical bias in models for sequence evolution.
Most available models for protein structure evolution do not model interdependence between the backbone sites of the protein, yet the assumption that the sites evolve independently is known to be false. I argue that ignoring such dependence leads to biased estimation of evolutionary distance between proteins. To mitigate this bias, I express an existing Bayesian model in a generalized form and introduce site-dependence via the generalized model. In the process, I show that the effect of protein structure information on the measure of evolutionary distance can be suppressed by the model formulation, and I further modify the model to help mitigate this problem. In addition to the statistical model itself, I provide computational details and computer code. I modify a well-known bioinformatics algorithm in order to preserve efficient computation under this model. The modified algorithm can be easily understood and used by practitioners familiar with the original algorithm. My approach to modeling dependence is computationally tractable and interpretable with little additional computational burden over the model on which it is based.
The second model expansion allows for evolutionary inference on protein pairs having structural discrepancies attributable to backbone flexion. Thus, the model expansion exposes flexible protein structures to the capabilities of Bayesian protein structure alignment and phylogenetics. Unlike most of the few existing methods that deal with flexible protein structures, our Bayesian flexible alignment model requires no prior knowledge of the presence or absence of flexion points in the protein structure, and uncertainty measures are available for the alignment and other parameters of interest. The model can detect subtle flexion while not overfitting non-flexible protein pairs, and is demonstrated to improve phylogenetic inference in a simulated data setting and in a difficult-to-align set of proteins. The flexible model is a unique addition to the small but growing set of tools available for analysis of flexible protein structure. The ability to perform inference on flexible proteins in a Bayesian framework is likely to be of immediate interest to the structural phylogenetics community.
Finally, I present work related to the study of bias in site-independent models for sequence evolution. In the case of binary sequences, I discuss strategies for theoretical proof of bias and provide various details to that end, including detailing efforts undertaken to produce a site-dependent sequence model with similar properties to the site-dependent structural model introduced in an earlier chapter. I highlight the challenges of theoretical proof for this bias and include miscellaneous related work of general interest to researchers studying dependent sequence models.
Item Open Access Bayesian Modeling and Adaptive Monte Carlo with Geophysics Applications(2013) Wang, JianyuThe first part of the thesis focuses on the development of Bayesian modeling motivated by geophysics applications. In Chapter 2, we model the frequency of pyroclastic flows collected from the Soufriere Hills volcano. Multiple change points within the dataset reveal several limitations of existing methods in literature. We propose Bayesian hierarchical models (BBH) by introducing an extra level of hierarchy with hyper parameters, adding a penalty term to constrain close consecutive rates, and using a mixture prior distribution to more accurately match certain circumstances in reality. We end the chapter with a description of the prediction procedure, which is the biggest advantage of the BBH in comparison with other existing methods. In Chapter 3, we develop new statistical techniques to model and relate three complex processes and datasets: the process of extrusion of magma into the lava dome, the growth of the dome as measured by its height, and the rockfalls as an indication of the dome's instability. First, we study the dynamic Negative Binomial branching process and use it to model the rockfalls. Moreover, a generalized regression model is proposed to regress daily rockfall numbers on the extrusion rate and dome height. Furthermore, we solve an inverse problem from the regression model and predict extrusion rate based on rockfalls and dome height.
The other focus of the thesis is adaptive Markov chain Monte Carlo (MCMC) method. In Chapter 4, we improve upon the Wang-Landau (WL) algorithm. The WL algorithm is an adaptive sampling scheme that modifies the target distribution to enable the chain to visit low-density regions of the state space. However, the approach relies heavily on a partition of the state space that is left to the user to specify. As a result, the implementation and the use of the algorithm are time-consuming and less automatic. We propose an automatic, adaptive partitioning scheme which continually refines the initial partition as needed during sampling. We show that this overcomes the limitations of the input user-specified partition, making the algorithm significantly more automatic and user-friendly while also making the performance dramatically more reliable and robust. In Chapter 5, we consider the convergence and autocorrelation aspects of MCMC. We propose an Exploration/Exploitation (XX) approach to constructing adaptive MCMC algorithms, which combines adaptation schemes of distinct types. The exploration piece uses adaptation strategies aiming at exploring new regions of the target distribution and thus improving the rate of convergence to equilibrium. The exploitation piece involves an adaptation component which decreases autocorrelation for sampling among regions already discovered. We demonstrate that the combined XX algorithm significantly outperforms either original algorithm on difficult multimodal sampling problems.
Item Open Access Bayesian Modeling for Identifying Selection in B cell Maturation(2023) Tang, TengjieThis thesis focuses on modeling the selection effects on B cell antibody mutations to identify amino acids under strong selection. Site-wise selection coefficients are parameterized by the fitnesses of amino acids. First, we conduct simulation studies to evaluate the accuracy of the Monte Carlo p-value approach for identifying selection for specific amino acid/location combinations. Then, we adopt Bayesian methods to infer location-specific fitness parameters for each amino acid. In particular, we propose the use of a spike-and-slab prior and implement Markov chain Monte Carlo (MCMC) algorithms for posterior sampling. Further simulation studies are conducted to evaluate the performance of the proposed Bayesian methods in inferring fitness parameters and identifying strong selection. The results demonstrate the reliable inference and detection performance of the proposed Bayesian methods. Finally, an example using real antibody sequences is provided. This work can help identify important early mutations in B cell antibodies, which is crucial for developing an effective HIV vaccine.
Item Open Access Bayesian Models for Relating Gene Expression and Morphological Shape Variation in Sea Urchin Larvae(2012) Runcie, Daniel EA general goal of biology is to understand how two or more sets of traits in an organism are related - for example, disease state and genetics, physiology and behavior, or phenotypic variation and gene function. Many of the early advancements in statistical analysis dealt with relating measured traits when one could be represented as a single number. However, many traits are inherently multi-dimensional, and technologies are advancing for rapidly measuring many types of such highly complex traits. Making efficient use of these new, larger datasets requires new statistical models for to biological inference. In this thesis, I develop a method for relating two very different types of traits in sea urchin larvae: morphological shape, and developmental gene expression. In particular, I develop an approach for regression modeling using shape as a response variable. I use this method to address the question of whether variation in the expression of regulatory genes during development predicts later morphological variation in the larvae. I propose a hierarchical random effects factor regression model with shape as a response variable for relating morphology and gene expression when the individuals in each dataset are related, but not identical. I fit an approximation to the general model by breaking it into three discrete steps. I find that gene expression can explain ~25% of mean symmetric form variation among cultures of related larvae, and identify several groups of related genes that are correlated with aspects of morphological variation.
Item Open Access Bayesian Structural Phylogenetics(2013) Challis, ChristopherThis thesis concerns the use of protein structure to improve phylogenetic inference. There has been growing interest in phylogenetics as the number of available DNA and protein sequences continues to grow rapidly and demand from other scientific fields increases. It is now well understood that phylogenies should be inferred jointly with alignment through use of stochastic evolutionary models. It has not been possible, however, to incorporate protein structure in this framework. Protein structure is more strongly conserved than sequence over long distances, so an important source of information, particularly for alignment, has been left out of analyses.
I present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model and mutations follow a standard substitution matrix, while backbone atoms diffuse in three-dimensional space according to an Ornstein-Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables pairwise evolutionary distance estimation on time scales not previously attainable with sequence evolution models. Ideally inference should not be performed in a pairwise fashion between proteins, but in a fully Bayesian setting simultaneously estimating the phylogenetic tree, alignment, and model parameters. I extend the initial pairwise model to this framework and explore model variants which improve agreement between sequence and structure information. The model also allows for estimation of heterogeneous rates of structural evolution throughout the tree, identifying groups of proteins structurally evolving at different speeds. In order to explore the posterior over topologies by Markov chain Monte Carlo sampling, I also introduce novel topology + alignment proposals which greatly improve mixing of the underlying Markov chain. I show that the inclusion of structural information reduces both alignment and topology uncertainty. The software is available as plugin to the package StatAlign.
Finally, I also examine limits on statistical inference of phylogeny through sequence information models. These limits arise due to the `cutoff phenomenon,' a term from probability which describes processes which remain far from their equilibrium distribution for some period of time before swiftly transitioning to stationarity. Evolutionary sequence models all exhibit a cutoff; I show how to find the cutoff for specific models and sequences and relate the cutoff explicitly to increased uncertainty in inference of evolutionary distances. I give theoretical results for symmetric models, and demonstrate with simulations that these results apply to more realistic and widespread models as well. This analysis also highlights several drawbacks to common default priors for phylogenetic analysis, I and suggest a more useful class of priors.
Item Open Access Conditions for Rapid and Torpid Mixing of Parallel and Simulated Tempering on Multimodal Distributions(2007-09-14) Woodard, Dawn BanisterStochastic sampling methods are ubiquitous in statistical mechanics, Bayesian statistics, and theoretical computer science. However, when the distribution that is being sampled is multimodal, many of these techniques converge slowly, so that a great deal of computing time is necessary to obtain reliable answers. Parallel and simulated tempering are sampling methods that are designed to converge quickly even for multimodal distributions. In this thesis, we assess the extent to which this goal is acheived.We give conditions under which a Markov chain constructed via parallel or simulated tempering is guaranteed to be rapidly mixing, meaning that it converges quickly. These conditions are applicable to a wide range of multimodal distributions arising in Bayesian statistical inference and statistical mechanics. We provide lower bounds on the spectral gaps of parallel and simulated tempering. These bounds imply a single set of sufficient conditions for rapid mixing of both techniques. A direct consequence of our results is rapid mixing of parallel and simulated tempering for several normal mixture models in R^M as M increases, and for the mean-field Ising model.We also obtain upper bounds on the convergence rates of parallel and simulated tempering, yielding a single set of sufficient conditions for torpid mixing of both techniques. These conditions imply torpid mixing of parallel and simulated tempering on a normal mixture model with unequal covariances in $\R^M$ as $M$ increases and on the mean-field Potts model with $q \geq 3$, regardless of the number and choice of temperatures, as well as on the mean-field Ising model if an insufficient (fixed) set of temperatures is used. The latter result is in contrast to the rapid mixing of parallel and simulated tempering on the mean-field Ising model with a linearly increasing set of temperatures.Item Open Access Efficient Enumeration and Visualization of Helix-coil Ensembles.(bioRxiv, 2023-09-17) Schmidler, Scott C; Hughes, Roy Gene; Oas, Terrence G; Zhao, ShiwenHelix-coil models are routinely used to interpret CD data of helical peptides or predict the helicity of naturally-occurring and designed polypeptides. However, a helix-coil model contains significantly more information than mean helicity alone, as it defines the entire ensemble - the equilibrium population of every possible helix-coil configuration - for a given sequence. Many desirable quantities of this ensemble are either not obtained as ensemble averages, or are not available using standard helicity-averaging calculations. Enumeration of the entire ensemble can allow calculation of a wider set of ensemble properties, but the exponential size of the configuration space typically renders this intractable. We present an algorithm that efficiently approximates the helix-coil ensemble to arbitrary accuracy, by sequentially generating a list of the M highest populated configurations in descending order of population. Truncating this list of (configuration, population) pairs at a desired accuracy provides an approximating sub-ensemble. We demonstrate several uses of this approach for providing insight into helix-coil ensembles and folding mechanisms, including landscape visualization.Item Open Access Finite Sample Bounds and Path Selection for Sequential Monte Carlo(2018) Marion, JosephSequential Monte Carlo (SMC) samplers have received attention as an alternative to Markov chain Monte Carlo for Bayesian inference problems due to their strong empirical performance on difficult multimodal problems, natural synergy with parallel computing environments, and accuracy when estimating ratios of normalizing constants. However, while these properties have been demonstrated empirically, the extent of these advantages remain unexplored theoretically. Typical convergence results for SMC are limited to root N results; they obscure the relationship between the algorithmic factors (weights, Markov kernels, target distribution) and the error of the resulting estimator. This limitation makes it difficult to compare SMC to other estimation methods and challenging to design efficient SMC algorithms from a theoretical perspective.
In this thesis, we provide conditions under which SMC provides a randomized approximation scheme, showing how to choose the number of of particles and Markov kernel transitions at each SMC step in order to ensure an accurate approximation with bounded error. These conditions rely on the sequence of SMC interpolating distributions and the warm mixing times of the Markov kernels, explicitly relating the algorithmic choices to the error of the SMC estimate. This allows us to provide finite-sample complexity bounds for SMC in a variety of settings, including finite state-spaces, product spaces, and log-concave target distributions.
A key advantage of this approach is that the bounds provide insight into the selection of efficient sequences of SMC distributions. When the target distribution is spherical Gaussian or log-concave, we show that judicious selection of interpolating distributions results in an SMC algorithm with a smaller complexity bound than MCMC. These results are used to motivate the use of a well known SMC algorithm that adaptively chooses interpolating distributions. We provide conditions under which the adaptive algorithm gives a randomized approximation scheme, providing theoretical validation for the automatic selection of SMC distributions.
Selecting efficient sequences of distributions is a problem that also arises in the estimation of normalizing constants using path sampling. In the final chapter of this thesis, we develop automatic methods for choosing sequences of distributions that provide low-variance path sampling estimators. These approaches are motived by properties of the theoretically optimal, lowest-variance path, which is given by the geodesic of the Riemann manifold associated with the path sampling family. For one dimensional paths we provide a greedy approach to step size selection that has good empirical performance. For multidimensional paths, we present an approach using Gaussian process emulation that efficiently finds low variance paths in this more complicated setting.
Item Open Access Geometric Ergodicity of Two–dimensional Hamiltonian systems with a Lennard–Jones–like Repulsive Potential(arXiv preprint arXiv:1104.3842, 2011) Cooke, Ben; Mattingly, Jonathan C; McKinley, Scott A; Schmidler, Scott CItem Open Access Intergenic and genic sequence lengths have opposite relationships with respect to gene expression.(PLoS One, 2008) Colinas, Juliette; Schmidler, Scott C; Bohrer, Gil; Iordanov, Borislav; Benfey, Philip NEukaryotic genomes are mostly composed of noncoding DNA whose role is still poorly understood. Studies in several organisms have shown correlations between the length of the intergenic and genic sequences of a gene and the expression of its corresponding mRNA transcript. Some studies have found a positive relationship between intergenic sequence length and expression diversity between tissues, and concluded that genes under greater regulatory control require more regulatory information in their intergenic sequences. Other reports found a negative relationship between expression level and gene length and the interpretation was that there is selection pressure for highly expressed genes to remain small. However, a correlation between gene sequence length and expression diversity, opposite to that observed for intergenic sequences, has also been reported, and to date there is no testable explanation for this observation. To shed light on these varied and sometimes conflicting results, we performed a thorough study of the relationships between sequence length and gene expression using cell-type (tissue) specific microarray data in Arabidopsis thaliana. We measured median gene expression across tissues (expression level), expression variability between tissues (expression pattern uniformity), and expression variability between replicates (expression noise). We found that intergenic (upstream and downstream) and genic (coding and noncoding) sequences have generally opposite relationships with respect to expression, whether it is tissue variability, median, or expression noise. To explain these results we propose a model, in which the lengths of the intergenic and genic sequences have opposite effects on the ability of the transcribed region of the gene to be epigenetically regulated for differential expression. These findings could shed light on the role and influence of noncoding sequences on gene expression.Item Open Access Monitoring and Improving Markov Chain Monte Carlo Convergence by Partitioning(2015) VanDerwerken, DouglasSince Bayes' Theorem was first published in 1762, many have argued for the Bayesian paradigm on purely philosophical grounds. For much of this time, however, practical implementation of Bayesian methods was limited to a relatively small class of "conjugate" or otherwise computationally tractable problems. With the development of Markov chain Monte Carlo (MCMC) and improvements in computers over the last few decades, the number of problems amenable to Bayesian analysis has increased dramatically. The ensuing spread of Bayesian modeling has led to new computational challenges as models become more complex and higher-dimensional, and both parameter sets and data sets become orders of magnitude larger. This dissertation introduces methodological improvements to deal with these challenges. These include methods for enhanced convergence assessment, for parallelization of MCMC, for estimation of the convergence rate, and for estimation of normalizing constants. A recurring theme across these methods is the utilization of one or more chain-dependent partitions of the state space.
Item Open Access Path Optimization in Free Energy Calculations(2016) Muraglia, RyanFree energy calculations are a computational method for determining thermodynamic quantities, such as free energies of binding, via simulation.
Currently, due to computational and algorithmic limitations, free energy calculations are limited in scope.
In this work, we propose two methods for improving the efficiency of free energy calculations.
First, we expand the state space of alchemical intermediates, and show that this expansion enables us to calculate free energies along lower variance paths.
We use Q-learning, a reinforcement learning technique, to discover and optimize paths at low computational cost.
Second, we reduce the cost of sampling along a given path by using sequential Monte Carlo samplers.
We develop a new free energy estimator, pCrooks (pairwise Crooks), a variant on the Crooks fluctuation theorem (CFT), which enables decomposition of the variance of the free energy estimate for discrete paths, while retaining beneficial characteristics of CFT.
Combining these two advancements, we show that for some test models, optimal expanded-space paths have a nearly 80% reduction in variance relative to the standard path.
Additionally, our free energy estimator converges at a more consistent rate and on average 1.8 times faster when we enable path searching, even when the cost of path discovery and refinement is considered.
Item Open Access Theory and Practice in Replica-Exchange Molecular Dynamics Simulation(2008-11-26) Cooke, BenjaminWe study the comparison of computational simulations of biomolecules to experimental data. We study the convergence of these simulations to equilibrium and determine measures of variance of the data using statistical methods. We run replica-exchange molecular dynamics (REMD) simulations of eight helical peptides and compare the simulation helicity to the experimentally measured helicity of the peptides. We use one-way sensitivity analysis to determine which parameter changes have a large effect on helicity measurements and use Bayesian updating for a parameter of the AMBER potential. We then consider the theoretical convergence behavior of the REMD algorithm itself by evaluating the properties of the isothermal numerical integrators used in the underlying MD. The underlying constant-temperature integrators explored in this thesis represent a majority of the deterministic isothermal methods used with REMD simulations and we show that these methods either fail to be measure-invariant or are not ergodic. For each of the non-ergodic integrators we show that REMD fails to be ergodic when run with the integrator. We give computational results from examples to demonstrate the practical implications of non-ergodicity and describe hybrid Monte Carlo, a method that leads to ergodicity. Finally, we consider the use of stochastic Langevin dynamics to simulate isothermal MD. We show geometric ergodicity of the Langevin diffusion over a simplified system with the eventual goal of determining geometric ergodicity for Langevin dynamics over the full AMBER potential.