Browsing by Department "Computational Biology and Bioinformatics"
- Results Per Page
- Sort Options
Item Open Access A Bayesian Hierarchical Model with SNP-level Functional Priors Applied to a Pathway-wide Association Study.(2010) Huang, WeiziTremendous effort has been put into study of the etiology of complex
diseases including the breast cancer, type 2 diabetes,
cardiovascular diseases, and prostate cancers. Despite large numbers of reported disease-associated loci,
few associated loci have been replicated, and some true associations
does not belong to the group of the most significant loci
reported to be associated. We built a Bayesian hierarchical model incorporated
with SNP-level functional data that can help identify associated SNPs in pathway-wide association studies.
We applied the model to an association study for the serous invasive ovarian cancer based on the DNA repair and apoptosis pathways. We found that using our model, blocks of SNPs located in regions enriched for missense SNPs or gene inversions were more likely to be identified as candidates of the association.
Item Open Access A Cloud-Based Infrastructure for Cancer Genomics(2020) Panea, Razvan IoanThe advent of new genomic approaches, particularly next generation sequencing (NGS) has resulted in explosive growth of biological data. As the size of biological data keeps growing at exponential rates, new methods for data management and data processing are becoming essential in bioinformatics and computational biology. Indeed, data analysis has now become the central challenge in genomics.
NGS has provided rich tools for defining genomic alterations that cause cancer. The processing time and computing requirements have now become a serious bottleneck to the characterization and analysis of these genomic alterations. Moreover, as the adoption of NGS continues to increase, the computing power required often exceeds what any single institution can provide, leading to major restraints in the type and number of analyses that can be performed.
Cloud computing represents a potential solution to this problem. On a cloud platform, computing resources can be available on-demand, thus allowing users to implement scalable and highly parallel methods. However, few centralized frameworks exist to allow the average researcher the ability to apply bioinformatics workflows using cloud resources. Moreover, bioinformatics approaches are associated with multiple processing challenges, such as the variability in the methods or data used and the reproducibility requirements of the research analysis.
Here, we present CloudConductor, a software system that is specifically designed to harness the power of cloud computing to perform complex analysis pipelines on large biological datasets. CloudConductor was designed with five central features in mind: scalability, modularity, parallelism, reproducibility and platform agnosticism.
We demonstrate the processing power afforded by CloudConductor on a real-world genomics problem. Using CloudConductor, we processed and analyzed 101 whole genome tumor-normal paired samples from Burkitt lymphoma subtypes to identify novel genomic alterations. We identified a total of 72 driver genes associated with the disease. Somatic events were identified in both coding and non-coding regions of nearly all driver genes, notably in genes IGLL5, BACH2, SIN3A, and DNMT1. We have developed the analysis framework by implementing a graphical user interface, a back-end database system, a data loader and a workflow management system.
In this thesis, we develop the concepts and describe an implementation of automated cloud-based infrastructure to analyze genomics data, creating a fast and efficient analysis resource for genomics researchers.
Item Open Access A Computational Synthesis of Genes, Behavior, and Evolution Provides Insights into the Molecular Basis of Vocal Learning(2012) Pfenning, Andreas RVocal learning is the ability modify vocal output based on auditory input and is the basis of human speech acquisition. It is shared by few distantly related bird and mammal orders, and is thus very likely to be an example of convergent evolution, having evolved independently in multiple lineages. This complex behavior is presumed to require networks of regulated genes to develop the necessary neural circuits for learning and maintaining vocalizations. Deciphering these networks has been limited by the lack of high throughput genomic tools in vocal learning avian species and the lack of a solid computational framework to understand the relationship between gene expression and behavior. This dissertation provides new insights into the evolution and mechanisms of vocal learning by taking a top-down, systems biology approach to understanding gene expression regulation across avian and mammalian species. First, I worked with colleagues to develop a zebra finch Agilent oligonucleotide microarray, including developing programs for more accurate annotation of oligonucleotides and genes. I then used these arrays and tools in multiple collaborative, but related projects, to measure transcriptome expression data in vocal learning and non-learning avian species, under a number of behavioral paradigms, with a focus on song production. To make sense of the avian microarray data, I compiled microarray data from other sources, including expression analyses across over 900 human brain regions generated by Allen Brain Institute. To compare these data sets, I developed and performed a variety of computational analyses including clustering, linear models, gene set enrichment analysis, motif discovery, and phylogenetic inference, providing a novel framework to study the gene regulatory networks associated with a complex behavior. Using the developed framework, we are able to better understand vocal learning at different levels: how the brain regions for vocal learning evolved and how those brain regions function during the production of learned vocalizations. At the evolutionary level, we identified genes with unique expression patterns in the brains of vocal learning birds and humans. Interesting candidates include genes related to formation of neural connections, in particular the SLIT/ROBO axon guidance pathway. This algorithm also allowed us to identify the analogous regions that are a part of vocal learning circuit across species, providing the first quantitative evidence relating the human vocal learning circuit to the avian vocal learning circuit. With the avian song system verified as a model for human speech at the molecular level, we conducted an experiment to better understand what is happening in those brain regions during singing by profiling gene expression in a time course as birds are producing song. Surprisingly, an overwhelming majority of the gene expression identified was strongly enriched in a particular region. We also found a tight coupling between the behavioral function of a particular region and the gene expression pattern. To gain insight into the mechanisms of this gene regulation, we conducted a genomic scan of transcription factor binding sites in zebra finch. Many transcription factor binding sites were enriched in the promoters of genes with a particular temporal patterns, several of which had already been hypothesized to play a role in the neural system. Using this data set of gene expression profiles and transcription factor binding sites along with separate experiments conducted in mouse, we were able uncover evidence that the transcription factor CARF plays a role in neuron homeostasis. These results have broadened our understanding of the molecular basis of vocal learning at multiple levels. Overall, this dissertation outlines a novel way of approaching the study of the relationship between genes and behavior.
Item Open Access Analysis and Error Correction in Structures of Macromolecular Interiors and Interfaces(2009) Headd, Jeffrey JohnAs of late 2009, the Protein Data Bank (PDB) has grown to contain over 70,000 models. This recent increase in the amount of structural data allows for more extensive explication of the governing principles of macromolecular folding and association to complement traditional studies focused on a single molecule or complex. PDB-wide characterization of structural features yields insights that are useful in prediction and validation of the 3D structure of macromolecules and their complexes. Here, these insights lead to a deeper understanding of protein--protein interfaces, full-atom critical assessment of increasingly more accurate structure predictions, a better defined library of RNA backbone conformers for validation and building 3D models, and knowledge-based automatic correction of errors in protein sidechain rotamers.
My study of protein--protein interfaces identifies amino acid pairing preferences in a set of 146 transient interfaces. Using a geometric interface surface definition devoid of arbitrary cutoffs common to previous studies of interface composition, I calculate inter- and intrachain amino acid pairing preferences. As expected, salt-bridges and hydrophobic patches are prevalent, but likelihood correction of observed pairing frequencies reveals some surprising pairing preferences, such as Cys-His interchain pairs and Met-Met intrachain pairs. To complement my statistical observations, I introduce a 2D visualization of the 3D interface surface that can display a variety of interface characteristics, including residue type, atomic distance and backbone/sidechain composition.
My study of protein interiors finds that 3D structure prediction from sequence (as part of the CASP experiment) is very close to full-atom accuracy. Validation of structure prediction should therefore consider all atom positions instead of the traditional Calpha-only evaluation. I introduce six new full-model quality criteria to assess the accuracy of CASP predictions, which demonstrate that groups who use structural knowledge culled from the PDB to inform their prediction protocols produce the most accurate results.
My study of RNA backbone introduces a set of rotamer-like "suite" conformers. Initially hand-identified by the Richardson laboratory, these 7D conformers represent backbone segments that are found to be genuine and favorable. X-ray crystallographers can use backbone conformers for model building in often poor backbone density and in validation after refinement. Increasing amounts of high quality RNA data allow for improved conformer identification, but also complicate hand-curation. I demonstrate that affinity propagation successfully differentiates between two related but distinct suite conformers, and is a useful tool for automated conformer clustering.
My study of protein sidechain rotamers in X-ray structures identifies a class of systematic errors that results in sidechains misfit by approximately 180 degrees. I introduce Autofix, a method for automated detection and correction of such errors. Autofix corrects over 40% of errors for Leu, Thr, and Val residues, and a significant number of Arg residues. On average, Autofix made four corrections per PDB file in 945 X-ray structures. Autofix will be implemented into MolProbity and PHENIX for easy integration into X-ray crystallography workflows.
Item Open Access Automated Microscopy and High Throughput Image Analysis in Arabidopsis and Drosophila(2009) Mace, Daniel L.Development of a single cell into an adult organism is accomplished through an elaborate and complex cascade of spatiotemporal gene expression. While methods exist for capturing spatiotemporal expression patterns---in situ hybridization, reporter constructs, fluorescent tags---these methods have been highly laborious, and results are frequently assessed by subjective qualitative comparisons. To address these issues, methods must be developed for automating the capture of images, as well as for the normalization and quantification of the resulting data. In this thesis, I design computational approaches for high throughput image analysis which can be grouped into three main areas. First, I develop methods for the capture of high resolution images from high throughput platforms. In addition to the informatics aspect of this problem, I also devise a novel multiscale probabilistic model that allows us to identify and segment objects in an automated fashion. Second, high resolution images must be registered and normalized to a common frame of reference for cross image comparisons. To address these issues, I implement approaches for image registration using statistical shape models and non-rigid registration. Lastly, I validate the spatial expression data obtained from microscopy images to other known spatial expression methods, and develop methods for comparing and calculating the significance between spatial expression patterns. I demonstrate these methods on two model developmental organisms: Arabidopsis and Drosophila.
Item Open Access Bayesian meta-analysis models for heterogeneous genomics data(2013) Zheng, LinglingThe accumulation of high-throughput data from vast sources has drawn a lot attentions to develop methods for extracting meaningful information out of the massive data. More interesting questions arise from how to combine the disparate information, which goes beyond modeling sparsity and dimension reduction. This dissertation focuses on the innovations in the area of heterogeneous data integration.
Chapter 1 contextualizes this dissertation by introducing different aspects of meta-analysis and model frameworks for high-dimensional genomic data.
Chapter 2 introduces a novel technique, joint Bayesian sparse factor analysis model, to vertically integrate multi-dimensional genomic data from different platforms.
Chapter 3 extends the above model to a nonparametric Bayes formula. It directly infers number of factors from a model-based approach.
On the other hand, chapter 4 deals with horizontal integration of diverse gene expression data; the model infers pathway activities across various experimental conditions.
All the methods mentioned above are demonstrated in both simulation studies and real data applications in chapters 2-4.
Finally, chapter 5 summarizes the dissertation and discusses future directions.
Item Open Access Bayesian modeling of microbial physiology(2017) Tonner, PeterMicrobial population growth measurements are widespread in the study of microorganisms, providing insight into areas including genetics, physiology, and engineering. The most common models of microbial population growth data are parametric, and are derived from specific assumptions about the underlying growth process. While useful in cases where these assumptions are valid, these models are inadequate in many cases typically found in microbial growth studies, including presence of significant population death and the presence of multiple growth phases (e.g. diauxie). Here, we explore the use of the Bayesian non-parametric model Gaussian processes on microbial population growth. We first develop a general hypothesis-test using Gaussian process regression and false-discovery rate corrected Bayes factor scores. We then explore a fully Bayesian model with Gaussian process priors that can capture the latent growth processes of many population measurements under a single model. Finally, we develop hierarchical Bayesian model with GP priors in order to capture random effects in microbial population growth data.
Item Open Access Bayesian Multivariate Count Models for the Analysis of Microbiome Studies(2019) Silverman, Justin DavidAdvances in high-throughput DNA sequencing allow for rapid and affordable surveys of thousands of bacterial taxa across thousands of samples. The exploding availability of sequencing data has poised microbiota research to advance our understanding of fields as diverse as ecology, evolution, medicine, and agriculture. Yet, while microbiota data is now ubiquitous, methods for the analysis of such data remain underdeveloped. This gap reflects the challenge of analyzing sparse high-dimensional count data that contains compositional (relative abundance) information. To address these challenges this dissertation introduces a number of tools for Bayesian inference applied to microbiome data. A central theme throughout this work is the use of multinomial logistic-normal models which are found to concisely address these challenges. In particular, the connection between the logistic-normal distribution and the Aitchison geometry of the simplex is commonly used to develop interpretable tools for the analysis of microbiome data.
The structure of this dissertation is as follows. Chapter 1 introduces key challenges in the analysis of microbiome data. Chapter 2 introduces a novel log-ratio transform between the simplex and Real space to enable the development of statistical tools for compositional data with phylogenetic structure. Chapter 3 introduces a multinomial logistic-normal generalized dynamic linear modelling framework for analysis of microbiome time-series data. Chapter 4 explores the analysis of zero values in sequence count data from a stochastic process perspective and demonstrates that zero-inflated models often produce counter-intuitive results in this this regime. Finally, Chapter 5 introduces the theory of Marginally Latent Matrix-T Processes as a means of developing efficient accurate inference for a large class of both multinomial logistic-normal models including linear regression, non-linear regression, and dynamic linear models. Notably, the inference schemes developed in Chapter 5 are found to often be orders of magnitude faster than Hamiltonian Monte Carlo without sacrificing accuracy in point estimation or uncertainty quantification.
Item Open Access Characterization of Gene Interaction and Assessment of Ld Matrix Measures for the Analysis of Biological Pathway Association(2009) Crosslin, David RussellLeukotrienes are arachidonic acid derivatives long known for their inflammatory properties and their involvement with a number of human diseases, most notably asthma. Recently, leukotriene-based inflammation has also been implicated in atherosclerosis: ALOX5AP and LTA4H, two genes in the leukotriene biosynthesis pathway, have been associated with various cardiovascular disease (CVD) phenotypes. To assess the role of the leukotriene pathway in CVD pathogenesis, we performed genetic association studies of ALOX5AP and LTA4H in a non-familial data set of early onset coronary artery disease. Our results support a modest role for the leukotriene pathway in atherosclerosis pathogenesis, reveal important genomic interactions within the pathway, and suggest the importance of using pathway-based modeling for evaluating the genomics of atherosclerosis susceptibility. Motivated by this need, we investigated the statistical properties of a class of matrix-based statistics to assess epistasis. We simulated multiple two-variant disease models with haplotypes to gain an understanding of pathway interactions in terms of correlation patterns. Our goal was to detect an interaction between multiple disease-causing variants by means of their linkage disequlibrium (LD) patterns with other haplotype markers. The simulated models can be summarized into three categories: 1. No epistasis in the presence of marginal effects and LD; 2. Epistasis in the presence of LD and no marginal effects; and 3. Epistasis in the presence marginal effects and LD. We then assessed previously introduced single-gene methods that compare whole matrices of Single Nucleotide Polymorphism (SNP) LD between two samples. These methods include comparing two sets of principal components, a sum-of-squared-differences comparing pairwise LD, and a contrast test that controls for background LD. We also considered a partial least-square (PLS) approach for modeling gene-gene interactions. Our results indicate that these measures can be used to assess epistasis as well as marginal effects under certain disease models. Understanding and quantifying whole-gene variation and association to disease using multiple SNPs remains a difficult task. Providing a single statistical measure per gene will facilitate combining multiple types of genomic data at a gene-level and will serve as an alternative approach to assess epistasis in genome-wide association studies. The matrix-based measures can also be used in pathway ascertainment tools that require scores on a gene-level.
Item Open Access Characterization of Gene-by-Age Interaction and Gene-by-Gene Interaction In Coronary Artery Disease(2012) Zhao, YiThe success of genome-wide association studies (GWAS) has been limited by missing heritability and lack of biological relevance of identified variants. We sought to address these issues by characterizing interaction among genotypes and environment using case-control samples enrolled at Duke University Medical Center. First, we studied the impact of age on coronary artery disease (CAD). Gene-by-age (GxAGE) interactions were tested at genome-wide scale, along with genes' marginal effects in age-stratified groups. Based on the interaction model, age plays the role as a modifier of the age-CAD relationship. SNPs associated with CAD in both young and old demonstrate consistency in effect sizes and directions. In spite of these SNPs, vastly different CAD associated genes were discovered across age and race groups, suggesting age-dependent mechanisms of CAD onset. Second, we explored gene-by-gene interaction (GxG) using a statistical model and compared results to biological evidence. Specifically, we investigated GATA2 as a candidate gene transcription factor, and modeled the interaction with genome-wide SNPs. The genetic effects at interacting loci were modified by GATA2 genotype. Without taking GATA2 variants into account , no marginal main effects were detected. Open access ChIP-seq data was available for comparison with the statistical model, and to relate GWAS findings with biological mechanisms. The agreement between the statistical and biological models was very limited.
Item Open Access Chromatin Determinants of the Eukaryotic DNA Replication Program(2011) Eaton, Matthew LucasThe accurate and timely replication of eukaryotic DNA during S-phase is of critical importance for the cell and for the inheritance of genetic information. Missteps in the replication program can activate cell cycle checkpoints or, worse, trigger the genomic instability and aneuploidy associated with diseases such as cancer. Eukaryotic DNA replication initiates asynchronously from hundreds to tens of thousands of replication origins spread across the genome. The origins are acted upon independently, but patterns emerge in the form of large-scale replication timing domains. Each of these origins must be localized, and the activation time determined by a system of signals that, though they have yet to be fully understood, are not dependent on the primary DNA sequence. This regulation of DNA replication has been shown to be extremely plastic, changing to fit the needs of cells in development or effected by replication stress.
We have investigated the role of chromatin in specifying the eukaryotic DNA replication program. Chromatin elements, including histone variants, histone modifications and nucleosome positioning, are an attractive candidate for DNA replication control, as they are not specified fully by sequence, and they can be modified to fit the unique needs of a cell without altering the DNA template. The origin recognition complex (ORC) specifies replication origin location by binding the DNA of origins. The S. cerevisiae ORC recognizes the ARS (autonomously replicating sequence) consensus sequence (ACS), but only a subset of potential genomic sites are bound, suggesting other chromosomal features influence ORC binding. Using high-throughput sequencing to map ORC binding and nucleosome positioning, we show that yeast origins are characterized by an asymmetric pattern of positioned nucleosomes flanking the ACS. The origin sequences are sufficient to maintain a nucleosome-free origin; however, ORC is required for the precise positioning of nucleosomes flanking the origin. These findings identify local nucleosomes as an important determinant for origin selection and function. Next, we describe the D. melanogaster replication program in the context of the chromatin and transcription landscape for multiple cell lines using data generated by the modENCODE consortium. We find that while the cell lines exhibit similar replication programs, there are numerous cell line-specific differences that correlate with changes in the chromatin architecture. We identify chromatin features that are associated with replication timing, early origin usage, and ORC binding. Primary sequence, activating chromatin marks, and DNA-binding proteins (including chromatin remodelers) contribute in an additive manner to specify ORC-binding sites. We also generate accurate and predictive models from the chromatin data to describe origin usage and strength between cell lines. Multiple activating chromatin modifications contribute to the function and relative strength of replication origins, suggesting that the chromatin environment does not regulate origins of replication as a simple binary switch, but rather acts as a tunable rheostat to regulate replication initiation events.
Taken together our data and analyses imply that the chromatin contains sufficient information to direct the DNA replication program.
Item Open Access Computational Inference of Genome-Wide Protein-DNA Interactions Using High-Throughput Genomic Data(2015) Zhong, JianlingTranscriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Item Open Access Computational Methods for Comparative Analysis of Rare Cell Subsets in Flow Cytometry(2013) Frelinger, Jacob JeffreyAutomated analysis techniques for flow cytometry data can address many of the limitations of manual analysis by providing an objective approach for the identification of cellular subsets. While automated analysis has the potential to significantly improve automated analysis, challenges remain for automated methods in cross sample analysis for large scale studies. This thesis presents new methods for data normalization, sample enrichment for rare events of interest, and cell subset relabeling. These methods build upon and extend the use of Gaussian mixture models in automated flow cytometry analysis to enable practical large scale cell subset identification.
Item Open Access Computational Methods For Functional Motif Identification and Approximate Dimension Reduction in Genomic Data(2011) Georgiev, StoyanUncovering the DNA regulatory logic in complex organisms has been one of the important goals of modern biology in the post-genomic era. The sequencing of multiple genomes in combination with the advent of DNA microarrays and, more recently, of massively parallel high-throughput sequencing technologies has made possible the adoption of a global perspective to the inference of the regulatory rules governing the context-specific interpretation of the genetic code that complements the more focused classical experimental approaches. Extracting useful information and managing the complexity resulting from the sheer volume and the high-dimensionality of the data produced by these genomic assays has emerged as a major challenge which we attempt to address in this work by developing computational methods and tools, specifically designed for the study of the gene regulatory processes in this new global genomic context.
First, we focus on the genome-wide discovery of physical interactions between regulatory sequence regions and their cognate proteins at both the DNA and RNA level. We present a motif analysis framework that leverages the genome-wide
evidence for sequence-specific interactions between trans-acting factors and their preferred cis-acting regulatory regions. The utility of the proposed framework is demonstarted on DNA and RNA cross-linking high-throughput data.
A second goal of this thesis is the development of scalable approaches to dimension reduction based on spectral decomposition and their application to the study of population structure in massive high-dimensional genetic data sets. We have developed computational tools and have performed theoretical and empirical analyses of their statistical properties with particular emphasis on the analysis of the individual genetic variation measured by Single Nucleotide Polymorphism (SNP) microrarrays.
Item Open Access Computational Methods for Investigating Dendritic Cell Biology(2011) de Oliveira Sales, Ana PaulaThe immune system is constantly faced with the daunting task of protecting the host from a large number of ever-evolving pathogens. In vertebrates, the immune response results from the interplay of two cellular systems: the innate immunity and the adaptive immunity. In the past decades, dendritic cells have emerged as major players in the modulation of the immune response, being one of the primary links between these two branches of the immune system.
Dendritic cells are pathogen-sensing cells that alert the rest of the immune system of the presence of infection. The signals sent by dendritic cells result in the recruitment of the appropriate cell types and molecules required for effectively clearing the infection. A question of utmost importance in our understanding of the immune response and our ability to manipulate it in the development of vaccines and therapies is: "How do dendritic cells translate the various cues they perceive from the environment into different signals that specifically activate the appropriate parts of the immune system that result in an immune response streamlined to clear the given pathogen?"
Here we have developed computational and statistical methods aimed to address specific aspects of this question. In particular, understanding how dendritic cells ultimately modulate the immune response requires an understanding of the subtleties of their maturation process in response to different environmental signals. Hence, the first part of this dissertation focuses on elucidating the changes in the transcriptional
program of dendritic cells in response to the detection of two common pathogen- associated molecules, LPS and CpG. We have developed a method based on Langevin and Dirichlet processes to model and cluster gene expression temporal data, and have used it to identify, on a large scale, genes that present unique and common transcriptional behaviors in response to these two stimuli. Additionally, we have also investigated a different, but related, aspect of dendritic cell modulation of the adaptive immune response. In the second part of this dissertation, we present a method to predict peptides that will bind to MHC molecules, a requirement for the activation of pathogen-specific T cells. Together, these studies contribute to the elucidation of important aspects of dendritic cell biology.
Item Open Access Computational Methods to Study Diversification in Pathogens, and Invertebrate and Vertebrate Immune Systems(2010) Munshaw, Supriya ShaunakPathogens and host immune systems use strikingly similar methods of diversification. Mechanisms such as point mutations and recombination help pathogens escape the host immune system and similar mechanisms help the host immune system attack rapidly evolving pathogens. Understanding the interplay between pathogen and immune system evolution is crucial to effective drug and vaccine development. In this thesis we employ various computational methods to study diversification in a pathogen, an invertebrate and a vertebrate immune system.
First, we develop a technique for phylogenetic inference in the presence of recombination based on the principle of minimum description length, which assigns a cost-the description length-to each network topology given the observed sequence data. We show that the method performs well on simulated data and demonstrate its application on HIV env gene sequence data from 8 human subjects.
Next, we demonstrate via phylogenetic analysis that the evolution of repeats in an immune-related gene family in Strongylocentrotus purpuratus is the result of recombination and duplication and/or deletion. These results support the evidence suggesting that invertebrate immune systems are highly complex and may employ similar mechanisms for diversification as higher vertebrates.
Third, we develop a probabilistic model of the immunoglobulin (Ig) rearrangement process and a Bayesian method for estimating posterior probabilities for the comparison of multiple plausible rearrangements. We validate the software using various datasets and in all tests, SoDA2 performed better than other available software.
Finally, we characterize the somatic population genetics of the nucleotide sequences of >1000 recombinant Ig pairs derived from the blood of 5 acute HIV-1 infected (AHI) subjects. We found that the Ig genes from the 20 day AHI PC showed extraordinary clonal relatedness among themselves; a single clone comprised of 52 members, with observed and inferred precursor antibodies specific for HIV-1 Env gp41. Antibodies from AHI patients show a decreased CDR3H length and an increased mutation frequency when compared to influenza vaccinated individuals. The high mutation frequency is coupled with a comparatively low synonymous to non-synonymous mutation ratio in the heavy chain. Our results may suggest presence of positive antigenic selection in previously triggered non-HIV-1 memory B cells in AHI.
Taken together, the studies presented in this thesis provide methods to study diversification in pathogens, and invertebrate and vertebrate immune systems.
Item Open Access Computational Processing of Omics Data: Implications for Analysis(2013) Benjamin, Ashlee MarieIn this work, I present four studies across the range of 'omics data types - a Genome- Wide Association Study for gene-by-sex interaction of obesity traits, computational models for transcription start site classification, an assessment of reference-based mapping methods for RNA-Seq data from non-model organisms, and a statistical model for open-platform proteomics data alignment.
Obesity is an increasingly prevalent and severe health concern with a substantial heritable component, and marked sex differences. We sought to determine if the effect of genetic variants also differed by sex by performing a genome-wide association study modeling the effect of genotype-by-sex interaction on obesity phenotypes. Genotype data from individuals in the Framingham Heart Study Offspring cohort were analyzed across five exams. Although no variants showed genome-wide significant gene-by-sex interaction in any individual exam, four polymorphisms displayed a consistent BMI association (P-values .00186 to .00010) across all five exams. These variants were clustered downstream of LYPLAL1, which encodes a lipase/esterase expressed in adipose tissue, a locus previously identified as having sex-specific effects on central obesity. Primary effects in males were in the opposite direction as females and were replicated in Framingham Generation 3. Our data support a sex-influenced association between genetic variation at the LYPLAL1 locus and obesity-related traits.
The application of deep sequencing to map 5' capped transcripts has confirmed the existence of at least two distinct promoter classes in metazoans: focused promot- ers with transcription start sites (TSSs) that occur in a narrowly defined genomic span and dispersed promoters with TSSs that are spread over a larger window. Pre- vious studies have explored the presence of genomic features, such as CpG islands and sequence motifs, in these promoter classes, and our collaborators recently inves- tigated the relationship with chromatin features. It was found that promoter classes are significantly differentiated by nucleosome organization and chromatin structure. Here, we present computational models supporting the stronger contribution of chro- matin features to the definition of dispersed promoters compared to focused start sites. Specifically, dispersed promoters display enrichment for well-positioned nucleosomes downstream of the TSS and a more clearly defined nucleosome free region upstream, while focused promoters have a less organized nucleosome structure, yet higher presence of RNA polymerase II. These differences extend to histone vari- ants (H2A.Z) and marks (H3K4 methylation), as well as insulator binding (such as CTCF), independent of the expression levels of affected genes.
The application of next-generation sequencing technology to gene expression quantification analysis, namely, RNA-Sequencing, has transformed the way in which gene expression studies are conducted and analyzed. These advances are of partic- ular interest to researchers studying non-model organisms, as the need for knowl- edge of sequence information is overcome. De novo assembly methods have gained widespread acceptance in the RNA-Seq community for non-model organisms with no true reference genome or transcriptome. While such methods have tremendous utility, computational complexity is still a significant challenge for organisms with large and complex genomes. Here we present a comparison of four reference-based mapping methods for non-human primate data. We explore mapping efficacy, correlation between computed expression values, and utility for differential expression analyses. We show that reference-based mapping methods indeed have utility in RNA-Seq analysis of mammalian data with no true reference, and that the details of mapping methods should be carefully considered when doing so. We find that shorter seed sequences, allowance of mismatches, and allowance of gapped alignments, in addition to splice junction gaps result in more sensitive alignments of non-human primate RNA-Seq data.
Open-platform proteomics experiments seek to quantify and identify the proteins present in biological samples. Much like differential gene expression analyses, it is often of interest to determine how protein abundance differs in various physiological conditions. Label free LC-MS/MS enables the rapid measurement of thousands of proteins, providing a wealth of peptide intensity information for differential analysis. However, the processing of raw proteomics data poses significant challenges that must be overcome prior to analysis. We specifically address the matching of peptide measurements across samples - an essential pre-processing step in every proteomics experiment. Presented here is a novel method for open-platform proteomics data alignment with the ability to incorporate previously unused aspects of the data, particularly ion mobility drift times and product ion data. Our results suggest that the inclusion of additional data results in higher numbers of more confident matches, without increasing the number of mismatches. We also show that the incorporation of product ion data can improve results dramatically. Based on these results, we argue that the incorporation of ion mobility drift times and product ion information are worthy pursuits. In addition, alignment methods should be flexible enough to utilize all available data, particularly with recent advancements in experimental separation methods. The addition of drift times and/or high energy to alignment methods and accurate mass and time (AMT) tag databases can greatly improve experimenters ability to identify measured peptides, reducing analysis costs and potentially the need to run additional experiments.
Item Open Access Computational Protein Design with Non-proteinogenic Amino Acids and Small Molecule Ligands, with Applications to Protein-protein Interaction Inhibitors, Anti-microbial Enzyme Inhibitors, and Antibody Design(2021) Wang, SiyuComputational protein design is a leading-edge technology to design novel protein with novel functions, as well as study the structure and function of known protein. Conventionally, most of the existing computational protein design methods and softwares focus only on modeling proteinogenic amino acids. However, in reality most biochemical systems are far more complicated. Many kinds of protein not only consist of proteinogenic amino acids, but also contain non-natural amino acids or post-transnational modifications. For some protein, their function can only be fulfilled through the interaction with small molecule ligands or cofactors, which is also beyond the scope of proteinogenic amino acids. In order to expand the capability of computational protein design methods, in this dissertation we incorporated the the modeling of non-natural amino acids into OSPREY. OSPREY is a computational protein design software suite that based on provable algorithms and developed in our lab. Furthermore, 3 human health related designs involving non-natural amino acids or small molecule ligands are presented in this dissertation: (1) design of novel cystic fibrosis therapeutics using non-natural amino acids, (2) re-design of HIV-1 broadly neutralizing antibodies for better potency and breadth, and (3) development of novel antibiotics fighting methicillin-resistant Staphylococcus aureus and the analysis of its resistance mechanism. Through extensive computational results and experiential data, we are able to demonstrate the success of our above designs.
Item Open Access Computational Systems Biology of Saccharomyces cerevisiae Cell Growth and Division(2014) Mayhew, Michael BenjaminCell division and growth are complex processes fundamental to all living organisms. In the budding yeast, Saccharomyces cerevisiae, these two processes are known to be coordinated with one another as a cell's mass must roughly double before division. Moreover, cell-cycle progression is dependent on cell size with smaller cells at birth generally taking more time in the cell cycle. This dependence is a signature of size control. Systems biology is an emerging field that emphasizes connections or dependencies between biological entities and processes over the characteristics of individual entities. Statistical models provide a quantitative framework for describing and analyzing these dependencies. In this dissertation, I take a statistical systems biology approach to study cell division and growth and the dependencies within and between these two processes, drawing on observations from richly informative microscope images and time-lapse movies. I review the current state of knowledge on these processes, highlighting key results and open questions from the biological literature. I then discuss my development of machine learning and statistical approaches to extract cell-cycle information from microscope images and to better characterize the cell-cycle progression of populations of cells. In addition, I analyze single cells to uncover correlation in cell-cycle progression, evaluate potential models of dependence between growth and division, and revisit classical assertions about budding yeast size control. This dissertation presents a unique perspective and approach towards comprehensive characterization of the coordination between growth and division.
Item Embargo Computationally Mining the Microbiome for Biologically Meaningful Results(2024) Jiang, DantingThe human microbiome—the diverse commensal microbial communities that reside in and on every individual—has received considerable attention over the past few decades for its vital role in modulating host physiology. Although recent years have witnessed an explosion of meta-omics data generated from high-throughput sequencing-based microbiome studies, our knowledge of how specific commensal bacteria regulate human health and disease remains disproportionately limited. Developing computational frameworks that allow one to coherently derive biologically meaningful results given a collection of distinct ‘omics datasets will contribute to an improved understanding of the complex, disease-modulatory host–microbiome interplay. In this dissertation, I established several bioinformatic pipelines that help accomplish this goal and applied them to investigate the role of the microbiome in different aspects of viral infectious diseases. Specifically, in Chapter 2, I demonstrated an analysis scheme that links phenotype-associated taxa to the metabolic pathways they significantly contribute to and correlated multiple high-dimensional datasets obtained from microbiome analysis, metabolomics, and immune phenotyping. Applying this workflow, I found that a commensal Sutterella species, its metabolic pathways, and the production of short-chain fatty acids and bile acids are associated with increased antibody responses to an HIV vaccine. In contrast, in Chapter 4, I provided an alternative procedure that connects differentially abundant metabolic pathways with their main contributing reactions, genes, and microbes and correlated the microbiome functional profile with the cytokine milieu quantified by multiplex immunoassays. Applying this workflow, I found that symptomatic COVID-19 infection results in a less diverse gut microbiota marked by lower functional capacity for the tyrosine biosynthesis pathway, which—in addition to being associated with interferon alpha 2a levels—is driven by a reduction in prephenate dehydrogenase that can be primarily attributed to just a few bacterial species. Moreover, in Chapter 3, I extended a computational discovery platform our group has previously developed to create highly-specific data-driven hypotheses that we further validated experimentally using in vitro assays and bacterial genetics. I also established a straightforward bioinformatic pipeline that efficiently extracts the gene homologs of interest from expansive abundance tables produced by metagenomic functional analysis, enabling us to quickly confirm the translational relevance of our findings in human-derived datasets. Ultimately, we delineated a complete microbiota-driven pathway—including identification of the specific bacterial taxa, bacterial gene, bacterial metabolite, and host receptor involved—that broadly inhibits viral infections. Last but not least, in Chapter 5, I presented an innovative method inspired by concepts from topological data analysis to compute the within-sample phylogenetic microbial biodiversity. Using data from earlier chapters for benchmarking, I showed that this new approach efficiently and accurately estimates Faith’s phylogenetic diversity compared to the standard path. Taken together, this dissertation offers several generalizable computational frameworks that assist the analysis of heterogeneous meta-omics datasets in a focused, directed, and interpretable fashion, with applications that have remarkably enhanced our knowledge of the microbiome in various facets of viral infectious diseases, including vaccine response, viral acquisition, and disease severity.