Browsing by Subject "ChIP-seq"
- Results Per Page
- Sort Options
Item Open Access Chromatin Determinants of the Eukaryotic DNA Replication Program(2011) Eaton, Matthew LucasThe accurate and timely replication of eukaryotic DNA during S-phase is of critical importance for the cell and for the inheritance of genetic information. Missteps in the replication program can activate cell cycle checkpoints or, worse, trigger the genomic instability and aneuploidy associated with diseases such as cancer. Eukaryotic DNA replication initiates asynchronously from hundreds to tens of thousands of replication origins spread across the genome. The origins are acted upon independently, but patterns emerge in the form of large-scale replication timing domains. Each of these origins must be localized, and the activation time determined by a system of signals that, though they have yet to be fully understood, are not dependent on the primary DNA sequence. This regulation of DNA replication has been shown to be extremely plastic, changing to fit the needs of cells in development or effected by replication stress.
We have investigated the role of chromatin in specifying the eukaryotic DNA replication program. Chromatin elements, including histone variants, histone modifications and nucleosome positioning, are an attractive candidate for DNA replication control, as they are not specified fully by sequence, and they can be modified to fit the unique needs of a cell without altering the DNA template. The origin recognition complex (ORC) specifies replication origin location by binding the DNA of origins. The S. cerevisiae ORC recognizes the ARS (autonomously replicating sequence) consensus sequence (ACS), but only a subset of potential genomic sites are bound, suggesting other chromosomal features influence ORC binding. Using high-throughput sequencing to map ORC binding and nucleosome positioning, we show that yeast origins are characterized by an asymmetric pattern of positioned nucleosomes flanking the ACS. The origin sequences are sufficient to maintain a nucleosome-free origin; however, ORC is required for the precise positioning of nucleosomes flanking the origin. These findings identify local nucleosomes as an important determinant for origin selection and function. Next, we describe the D. melanogaster replication program in the context of the chromatin and transcription landscape for multiple cell lines using data generated by the modENCODE consortium. We find that while the cell lines exhibit similar replication programs, there are numerous cell line-specific differences that correlate with changes in the chromatin architecture. We identify chromatin features that are associated with replication timing, early origin usage, and ORC binding. Primary sequence, activating chromatin marks, and DNA-binding proteins (including chromatin remodelers) contribute in an additive manner to specify ORC-binding sites. We also generate accurate and predictive models from the chromatin data to describe origin usage and strength between cell lines. Multiple activating chromatin modifications contribute to the function and relative strength of replication origins, suggesting that the chromatin environment does not regulate origins of replication as a simple binary switch, but rather acts as a tunable rheostat to regulate replication initiation events.
Taken together our data and analyses imply that the chromatin contains sufficient information to direct the DNA replication program.
Item Open Access Computational Methods For Functional Motif Identification and Approximate Dimension Reduction in Genomic Data(2011) Georgiev, StoyanUncovering the DNA regulatory logic in complex organisms has been one of the important goals of modern biology in the post-genomic era. The sequencing of multiple genomes in combination with the advent of DNA microarrays and, more recently, of massively parallel high-throughput sequencing technologies has made possible the adoption of a global perspective to the inference of the regulatory rules governing the context-specific interpretation of the genetic code that complements the more focused classical experimental approaches. Extracting useful information and managing the complexity resulting from the sheer volume and the high-dimensionality of the data produced by these genomic assays has emerged as a major challenge which we attempt to address in this work by developing computational methods and tools, specifically designed for the study of the gene regulatory processes in this new global genomic context.
First, we focus on the genome-wide discovery of physical interactions between regulatory sequence regions and their cognate proteins at both the DNA and RNA level. We present a motif analysis framework that leverages the genome-wide
evidence for sequence-specific interactions between trans-acting factors and their preferred cis-acting regulatory regions. The utility of the proposed framework is demonstarted on DNA and RNA cross-linking high-throughput data.
A second goal of this thesis is the development of scalable approaches to dimension reduction based on spectral decomposition and their application to the study of population structure in massive high-dimensional genetic data sets. We have developed computational tools and have performed theoretical and empirical analyses of their statistical properties with particular emphasis on the analysis of the individual genetic variation measured by Single Nucleotide Polymorphism (SNP) microrarrays.
Item Open Access Developing Quantitative Models in Analyzing High-throughput Sequencing Data(2021) Kim, Young-SookDiverse functional genomics assays have been developed and helped to investigate complex gene regulations in various biological conditions. For example, RNA-seq has been used to capture gene expressions in diverse human tissues, helping to study tissue-common and tissue-specific gene regulation. ChIP-seq has been used to identify the genomic regions bound by numerous transcription factors, thus helping to identify collaborative and competitive binding mechanisms of the transcription factors. Despite this huge increase in the amount and the accessibility of genomic data, we have several challenges to analyze those data with proper statistical methods. Some assays such as STARR-seq do not have a proper statistical model that detects both activated and repressed regulatory elements, making researchers depend on the statistical models developed for other assays. Some assays such as ChIP-seq and RNA-seq have limited joint analysis models that are flexible and computationally scalable, resulting in the limited statistical power in identifying the genomic regions or genes shared by multiple biological conditions. To solve those challenges in analyzing high-throughput assays, we first developed a statistical model called correcting reads and analysis of differential active elements or CRADLE to analyze STARR-seq data. CRADLE removes technical biases that can confound quantification of regulatory activity and then detects both activated and repressed regulatory elements. We observed the corrected read counts improved the visualization of regulatory activity, allowing for more accurate detection of regulatory elements. Indeed, through simulation study, we showed CRADLE significantly improved precision and recall in detecting regulatory elements compared to the previous statistical approaches and that improvement was especially prominent in identifying repressed regulatory elements. Based on our work on developing CRADLE, we adapted the statistical framework of CRADLE and developed a joint analysis model of multiple data for biology or JAMMY that can be applied to diverse high-throughput sequencing data. JAMMY is a flexible statistical model that jointly analyzes multiple conditions, identifies condition-shared and condition-specific genomic regions, and then quantifies the preferential activity of a subset of biological conditions for each genomic region. We applied JAMMY to STARR-seq, ChIP-seq, and RNA-seq data, and observed JAMMY overall improved the precision and recall in identifying condition-shared activity compared to the traditional condition-by-condition analysis. This gain of statistical power from the joint analysis led us to find a novel co-binding of two transcription factors in our study. Those results show the substantial advantages of using joint analysis model in integrating genomic data from multiple biological conditions.
Item Open Access Modeling Nuclease Digestion Data to Predict the Dynamics of Genome-wide Transcription Factor Occupancy(2016) Luo, KaixuanIdentifying and deciphering the complex regulatory information embedded in the genome is critical to our understanding of biology and the etiology of complex diseases. The regulation of gene expression is governed largely by the occupancy of transcription factors (TFs) at various cognate binding sites. Characterizing TF binding is particularly challenging since TF occupancy is not just complex but also dynamic. Current genome-wide surveys of TF binding sites typically use chromatin immunoprecipitation (ChIP), which is limited to measuring one TF at a time, thus less scalable in profiling the dynamics of TF occupancy across cell types or conditions. This dissertation develops novel computational frameworks to model sequencing data from DNase and/or MNase nuclease digestion assays that allows multiple TFs to be surveyed in a single experiment, in both human and yeast. We predicted occupancy landscapes and constructed a cell-type specificity map for many TFs across human cell types, revealed novel relationships between TF occupancy and TF expression, and monitored the occupancy dynamics of various TFs in response to androgen and estrogen hormone simulations. The TF/cell type occupancy matrix generated from our model expands the total output of the ENCODE ChIP-seq efforts by a factor of nearly 200 times. These computational frameworks serve as an innovative and cost effective strategy which enables efficient profiling of TF occupancy landscapes across different cell types or dynamic conditions in a high-throughput manner.