# Browsing by Author "Schmidler, Scott C"

###### Results Per Page

###### Sort Options

Item Open Access Advances in Bayesian Modeling of Protein Structure Evolution(2018) Larson, GaryThis thesis contributes to a statistical modeling framework for protein sequence and structure evolution. An existing Bayesian model for protein structure evolution is extended in two unique ways. Each of these model extensions addresses an important limitation which has not yet been satisfactorily addressed in the wider literature. These extensions are followed by work regarding inherent statistical bias in models for sequence evolution.

Most available models for protein structure evolution do not model interdependence between the backbone sites of the protein, yet the assumption that the sites evolve independently is known to be false. I argue that ignoring such dependence leads to biased estimation of evolutionary distance between proteins. To mitigate this bias, I express an existing Bayesian model in a generalized form and introduce site-dependence via the generalized model. In the process, I show that the effect of protein structure information on the measure of evolutionary distance can be suppressed by the model formulation, and I further modify the model to help mitigate this problem. In addition to the statistical model itself, I provide computational details and computer code. I modify a well-known bioinformatics algorithm in order to preserve efficient computation under this model. The modified algorithm can be easily understood and used by practitioners familiar with the original algorithm. My approach to modeling dependence is computationally tractable and interpretable with little additional computational burden over the model on which it is based.

The second model expansion allows for evolutionary inference on protein pairs having structural discrepancies attributable to backbone flexion. Thus, the model expansion exposes flexible protein structures to the capabilities of Bayesian protein structure alignment and phylogenetics. Unlike most of the few existing methods that deal with flexible protein structures, our Bayesian flexible alignment model requires no prior knowledge of the presence or absence of flexion points in the protein structure, and uncertainty measures are available for the alignment and other parameters of interest. The model can detect subtle flexion while not overfitting non-flexible protein pairs, and is demonstrated to improve phylogenetic inference in a simulated data setting and in a difficult-to-align set of proteins. The flexible model is a unique addition to the small but growing set of tools available for analysis of flexible protein structure. The ability to perform inference on flexible proteins in a Bayesian framework is likely to be of immediate interest to the structural phylogenetics community.

Finally, I present work related to the study of bias in site-independent models for sequence evolution. In the case of binary sequences, I discuss strategies for theoretical proof of bias and provide various details to that end, including detailing efforts undertaken to produce a site-dependent sequence model with similar properties to the site-dependent structural model introduced in an earlier chapter. I highlight the challenges of theoretical proof for this bias and include miscellaneous related work of general interest to researchers studying dependent sequence models.

Item Open Access Bayesian Modeling and Adaptive Monte Carlo with Geophysics Applications(2013) Wang, JianyuThe first part of the thesis focuses on the development of Bayesian modeling motivated by geophysics applications. In Chapter 2, we model the frequency of pyroclastic flows collected from the Soufriere Hills volcano. Multiple change points within the dataset reveal several limitations of existing methods in literature. We propose Bayesian hierarchical models (BBH) by introducing an extra level of hierarchy with hyper parameters, adding a penalty term to constrain close consecutive rates, and using a mixture prior distribution to more accurately match certain circumstances in reality. We end the chapter with a description of the prediction procedure, which is the biggest advantage of the BBH in comparison with other existing methods. In Chapter 3, we develop new statistical techniques to model and relate three complex processes and datasets: the process of extrusion of magma into the lava dome, the growth of the dome as measured by its height, and the rockfalls as an indication of the dome's instability. First, we study the dynamic Negative Binomial branching process and use it to model the rockfalls. Moreover, a generalized regression model is proposed to regress daily rockfall numbers on the extrusion rate and dome height. Furthermore, we solve an inverse problem from the regression model and predict extrusion rate based on rockfalls and dome height.

The other focus of the thesis is adaptive Markov chain Monte Carlo (MCMC) method. In Chapter 4, we improve upon the Wang-Landau (WL) algorithm. The WL algorithm is an adaptive sampling scheme that modifies the target distribution to enable the chain to visit low-density regions of the state space. However, the approach relies heavily on a partition of the state space that is left to the user to specify. As a result, the implementation and the use of the algorithm are time-consuming and less automatic. We propose an automatic, adaptive partitioning scheme which continually refines the initial partition as needed during sampling. We show that this overcomes the limitations of the input user-specified partition, making the algorithm significantly more automatic and user-friendly while also making the performance dramatically more reliable and robust. In Chapter 5, we consider the convergence and autocorrelation aspects of MCMC. We propose an Exploration/Exploitation (XX) approach to constructing adaptive MCMC algorithms, which combines adaptation schemes of distinct types. The exploration piece uses adaptation strategies aiming at exploring new regions of the target distribution and thus improving the rate of convergence to equilibrium. The exploitation piece involves an adaptation component which decreases autocorrelation for sampling among regions already discovered. We demonstrate that the combined XX algorithm significantly outperforms either original algorithm on difficult multimodal sampling problems.

Item Open Access Bayesian Modeling for Identifying Selection in B cell Maturation(2023) Tang, TengjieThis thesis focuses on modeling the selection effects on B cell antibody mutations to identify amino acids under strong selection. Site-wise selection coefficients are parameterized by the fitnesses of amino acids. First, we conduct simulation studies to evaluate the accuracy of the Monte Carlo p-value approach for identifying selection for specific amino acid/location combinations. Then, we adopt Bayesian methods to infer location-specific fitness parameters for each amino acid. In particular, we propose the use of a spike-and-slab prior and implement Markov chain Monte Carlo (MCMC) algorithms for posterior sampling. Further simulation studies are conducted to evaluate the performance of the proposed Bayesian methods in inferring fitness parameters and identifying strong selection. The results demonstrate the reliable inference and detection performance of the proposed Bayesian methods. Finally, an example using real antibody sequences is provided. This work can help identify important early mutations in B cell antibodies, which is crucial for developing an effective HIV vaccine.

Item Open Access Efficient Enumeration and Visualization of Helix-coil Ensembles.(bioRxiv, 2023-09-17) Schmidler, Scott C; Hughes, Roy Gene; Oas, Terrence G; Zhao, ShiwenHelix-coil models are routinely used to interpret CD data of helical peptides or predict the helicity of naturally-occurring and designed polypeptides. However, a helix-coil model contains significantly more information than mean helicity alone, as it defines the entire ensemble - the equilibrium population of every possible helix-coil configuration - for a given sequence. Many desirable quantities of this ensemble are either not obtained as ensemble averages, or are not available using standard helicity-averaging calculations. Enumeration of the entire ensemble can allow calculation of a wider set of ensemble properties, but the exponential size of the configuration space typically renders this intractable. We present an algorithm that efficiently approximates the helix-coil ensemble to arbitrary accuracy, by sequentially generating a list of the M highest populated configurations in descending order of population. Truncating this list of (configuration, population) pairs at a desired accuracy provides an approximating sub-ensemble. We demonstrate several uses of this approach for providing insight into helix-coil ensembles and folding mechanisms, including landscape visualization.Item Open Access Geometric Ergodicity of Two–dimensional Hamiltonian systems with a Lennard–Jones–like Repulsive Potential(arXiv preprint arXiv:1104.3842, 2011) Cooke, Ben; Mattingly, Jonathan C; McKinley, Scott A; Schmidler, Scott CItem Open Access Intergenic and genic sequence lengths have opposite relationships with respect to gene expression.(PLoS One, 2008) Colinas, Juliette; Schmidler, Scott C; Bohrer, Gil; Iordanov, Borislav; Benfey, Philip NEukaryotic genomes are mostly composed of noncoding DNA whose role is still poorly understood. Studies in several organisms have shown correlations between the length of the intergenic and genic sequences of a gene and the expression of its corresponding mRNA transcript. Some studies have found a positive relationship between intergenic sequence length and expression diversity between tissues, and concluded that genes under greater regulatory control require more regulatory information in their intergenic sequences. Other reports found a negative relationship between expression level and gene length and the interpretation was that there is selection pressure for highly expressed genes to remain small. However, a correlation between gene sequence length and expression diversity, opposite to that observed for intergenic sequences, has also been reported, and to date there is no testable explanation for this observation. To shed light on these varied and sometimes conflicting results, we performed a thorough study of the relationships between sequence length and gene expression using cell-type (tissue) specific microarray data in Arabidopsis thaliana. We measured median gene expression across tissues (expression level), expression variability between tissues (expression pattern uniformity), and expression variability between replicates (expression noise). We found that intergenic (upstream and downstream) and genic (coding and noncoding) sequences have generally opposite relationships with respect to expression, whether it is tissue variability, median, or expression noise. To explain these results we propose a model, in which the lengths of the intergenic and genic sequences have opposite effects on the ability of the transcribed region of the gene to be epigenetically regulated for differential expression. These findings could shed light on the role and influence of noncoding sequences on gene expression.Item Open Access Monitoring and Improving Markov Chain Monte Carlo Convergence by Partitioning(2015) VanDerwerken, DouglasSince Bayes' Theorem was first published in 1762, many have argued for the Bayesian paradigm on purely philosophical grounds. For much of this time, however, practical implementation of Bayesian methods was limited to a relatively small class of "conjugate" or otherwise computationally tractable problems. With the development of Markov chain Monte Carlo (MCMC) and improvements in computers over the last few decades, the number of problems amenable to Bayesian analysis has increased dramatically. The ensuing spread of Bayesian modeling has led to new computational challenges as models become more complex and higher-dimensional, and both parameter sets and data sets become orders of magnitude larger. This dissertation introduces methodological improvements to deal with these challenges. These include methods for enhanced convergence assessment, for parallelization of MCMC, for estimation of the convergence rate, and for estimation of normalizing constants. A recurring theme across these methods is the utilization of one or more chain-dependent partitions of the state space.