Browsing by Author "Ohler, Uwe"
Results Per Page
Sort Options
Item Open Access A paired-end sequencing strategy to map the complex landscape of transcription initiation.(Nature methods, 2010-07) Ni, Ting; Corcoran, David L; Rach, Elizabeth A; Song, Shen; Spana, Eric P; Gao, Yuan; Ohler, Uwe; Zhu, JunRecent studies using high-throughput sequencing protocols have uncovered the complexity of mammalian transcription by RNA polymerase II, helping to define several initiation patterns in which transcription start sites (TSSs) cluster in both narrow and broad genomic windows. Here we describe a paired-end sequencing strategy, which enables more robust mapping and characterization of capped transcripts. We used this strategy to explore the transcription initiation landscape in the Drosophila melanogaster embryo. Extending the previous findings in mammals, we found that fly promoters exhibited distinct initiation patterns, which were linked to specific promoter sequence motifs. Furthermore, we identified many 5' capped transcripts originating from coding exons; our analyses support that they are unlikely the result of alternative TSSs, but rather the product of post-transcriptional modifications. We demonstrated paired-end TSS analysis to be a powerful method to uncover the transcriptional complexity of eukaryotic genomes.Item Open Access A Semi-Supervised Predictive Model to Link Regulatory Regions to Their Target Genes(2015) Hafez, Dina MohamedNext generation sequencing technologies have provided us with a wealth of data profiling a diverse range of biological processes. In an effort to better understand the process of gene regulation, two predictive machine learning models specifically tailored for analyzing gene transcription and polyadenylation are presented.
Transcriptional enhancers are specific DNA sequences that act as ``information integration hubs" to confer regulatory requirements on a given cell. These non-coding DNA sequences can regulate genes from long distances, or across chromosomes, and their relationships with their target genes are not limited to one-to-one. With thousands of putative enhancers and less than 14,000 protein-coding genes, detecting enhancer-gene pairs becomes a very complex machine learning and data analysis challenge.
In order to predict these specific-sequences and link them to genes they regulate, we developed McEnhancer. Using DNAseI sensitivity data and annotated in-situ hybridization gene expression clusters, McEnhancer builds interpolated Markov models to learn enriched sequence content of known enhancer-gene pairs and predicts unknown interactions in a semi-supervised learning algorithm. Classification of predicted relationships were 73-98% accurate for gene sets with varying levels of initial known examples. Predicted interactions showed a great overlap when compared to Hi-C identified interactions. Enrichment of known functionally related TF binding motifs, enhancer-associated histone modification marks, along with corresponding developmental time point was highly evident.
On the other hand, pre-mRNA cleavage and polyadenylation is an essential step for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-UTRs, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered by the lack of appropriate tests for determining APAs with significant differences across multiple libraries.
We specified a linear effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed us to identify highly specific subsets of APA events in the individual tissue types. Predictive kernel-based SVM models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. The main cis-regulatory elements described for polyadenylation were found to be a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific sites. We applied this model on SRp20 data, an RNA binding protein that might be involved in oncogene activation and obtained interesting insights.
Together, these two models contribute to the understanding of enhancers and the key role they play in regulating tissue-specific expression patterns during development, as well as provide a better understanding of the diversity of post-transcriptional gene regulation in multiple tissue types.
Item Open Access Assessing the utility of thermodynamic features for microRNA target prediction under relaxed seed and no conservation requirements.(PLoS One, 2011) Lekprasert, Parawee; Mayhew, Michael; Ohler, UweBACKGROUND: Many computational microRNA target prediction tools are focused on several key features, including complementarity to 5'seed of miRNAs and evolutionary conservation. While these features allow for successful target identification, not all miRNA target sites are conserved and adhere to canonical seed complementarity. Several studies have propagated the use of energy features of mRNA:miRNA duplexes as an alternative feature. However, different independent evaluations reported conflicting results on the reliability of energy-based predictions. Here, we reassess the usefulness of energy features for mammalian target prediction, aiming to relax or eliminate the need for perfect seed matches and conservation requirement. METHODOLOGY/PRINCIPAL FINDINGS: We detect significant differences of energy features at experimentally supported human miRNA target sites and at genome-wide sites of AGO protein interaction. This trend is confirmed on datasets that assay the effect of miRNAs on mRNA and protein expression changes, and a simple linear regression model leads to significant correlation of predicted versus observed expression change. Compared to 6-mer seed matches as baseline, application of our energy-based model leads to ∼3-5-fold enrichment on highly down-regulated targets, and allows for prediction of strictly imperfect targets with enrichment above baseline. CONCLUSIONS/SIGNIFICANCE: In conclusion, our results indicate significant promise for energy-based miRNA target prediction that includes a broader range of targets without having to use conservation or impose stringent seed match rules.Item Open Access Automated Microscopy and High Throughput Image Analysis in Arabidopsis and Drosophila(2009) Mace, Daniel L.Development of a single cell into an adult organism is accomplished through an elaborate and complex cascade of spatiotemporal gene expression. While methods exist for capturing spatiotemporal expression patterns---in situ hybridization, reporter constructs, fluorescent tags---these methods have been highly laborious, and results are frequently assessed by subjective qualitative comparisons. To address these issues, methods must be developed for automating the capture of images, as well as for the normalization and quantification of the resulting data. In this thesis, I design computational approaches for high throughput image analysis which can be grouped into three main areas. First, I develop methods for the capture of high resolution images from high throughput platforms. In addition to the informatics aspect of this problem, I also devise a novel multiscale probabilistic model that allows us to identify and segment objects in an automated fashion. Second, high resolution images must be registered and normalized to a common frame of reference for cross image comparisons. To address these issues, I implement approaches for image registration using statistical shape models and non-rigid registration. Lastly, I validate the spatial expression data obtained from microscopy images to other known spatial expression methods, and develop methods for comparing and calculating the significance between spatial expression patterns. I demonstrate these methods on two model developmental organisms: Arabidopsis and Drosophila.
Item Open Access Automatic annotation of spatial expression patterns via sparse Bayesian factor models.(PLoS Comput Biol, 2011-07) Pruteanu-Malinici, Iulian; Mace, Daniel L; Ohler, UweAdvances in reporters for gene expression have made it possible to document and quantify expression patterns in 2D-4D. In contrast to microarrays, which provide data for many genes but averaged and/or at low resolution, images reveal the high spatial dynamics of gene expression. Developing computational methods to compare, annotate, and model gene expression based on images is imperative, considering that available data are rapidly increasing. We have developed a sparse Bayesian factor analysis model in which the observed expression diversity of among a large set of high-dimensional images is modeled by a small number of hidden common factors. We apply this approach on embryonic expression patterns from a Drosophila RNA in situ image database, and show that the automatically inferred factors provide for a meaningful decomposition and represent common co-regulation or biological functions. The low-dimensional set of factor mixing weights is further used as features by a classifier to annotate expression patterns with functional categories. On human-curated annotations, our sparse approach reaches similar or better classification of expression patterns at different developmental stages, when compared to other automatic image annotation methods using thousands of hard-to-interpret features. Our study therefore outlines a general framework for large microscopy data sets, in which both the generative model itself, as well as its application for analysis tasks such as automated annotation, can provide insight into biological questions.Item Open Access Computational Methods For Functional Motif Identification and Approximate Dimension Reduction in Genomic Data(2011) Georgiev, StoyanUncovering the DNA regulatory logic in complex organisms has been one of the important goals of modern biology in the post-genomic era. The sequencing of multiple genomes in combination with the advent of DNA microarrays and, more recently, of massively parallel high-throughput sequencing technologies has made possible the adoption of a global perspective to the inference of the regulatory rules governing the context-specific interpretation of the genetic code that complements the more focused classical experimental approaches. Extracting useful information and managing the complexity resulting from the sheer volume and the high-dimensionality of the data produced by these genomic assays has emerged as a major challenge which we attempt to address in this work by developing computational methods and tools, specifically designed for the study of the gene regulatory processes in this new global genomic context.
First, we focus on the genome-wide discovery of physical interactions between regulatory sequence regions and their cognate proteins at both the DNA and RNA level. We present a motif analysis framework that leverages the genome-wide
evidence for sequence-specific interactions between trans-acting factors and their preferred cis-acting regulatory regions. The utility of the proposed framework is demonstarted on DNA and RNA cross-linking high-throughput data.
A second goal of this thesis is the development of scalable approaches to dimension reduction based on spectral decomposition and their application to the study of population structure in massive high-dimensional genetic data sets. We have developed computational tools and have performed theoretical and empirical analyses of their statistical properties with particular emphasis on the analysis of the individual genetic variation measured by Single Nucleotide Polymorphism (SNP) microrarrays.
Item Open Access Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection.(Nucleic Acids Res, 2014-10-29) Yardımcı, Galip Gürkan; Frank, Christopher L; Crawford, Gregory E; Ohler, UweDNaseI footprinting is an established assay for identifying transcription factor (TF)-DNA interactions with single base pair resolution. High-throughput DNase-seq assays have recently been used to detect in vivo DNase footprints across the genome. Multiple computational approaches have been developed to identify DNase-seq footprints as predictors of TF binding. However, recent studies have pointed to a substantial cleavage bias of DNase and its negative impact on predictive performance of footprinting. To assess the potential for using DNase-seq to identify individual binding sites, we performed DNase-seq on deproteinized genomic DNA and determined sequence cleavage bias. This allowed us to build bias corrected and TF-specific footprint models. The predictive performance of these models demonstrated that predicted footprints corresponded to high-confidence TF-DNA interactions. DNase-seq footprints were absent under a fraction of ChIP-seq peaks, which we show to be indicative of weaker binding, indirect TF-DNA interactions or possible ChIP artifacts. The modeling approach was also able to detect variation in the consensus motifs that TFs bind to. Finally, cell type specific footprints were detected within DNase hypersensitive sites that are present in multiple cell types, further supporting that footprints can identify changes in TF binding that are not detectable using other strategies.Item Open Access Integrative Regulatory Mapping Indicates that the RNA-Binding Protein HuR Couples Pre-mRNA Processing and mRNA Stability(MOLECULAR CELL, 2011-08-05) Mukherjee, Neelanjan; Corcoran, David L; Nusbaum, Jeffrey D; Reid, David W; Georgiev, Stoyan; Hafner, Markus; Ascano, Manuel; Tuschl, Thomas; Ohler, Uwe; Keene, Jack DItem Open Access Mapping the complexity of transcription control in higher eukaryotes.(Genome Biol, 2010) Tomancak, Pavel; Ohler, UweRecent genomic analyses suggest the importance of combinatorial regulation by broadly expressed transcription factors rather than expression domains characterized by highly specific factors.Item Open Access MicroRNA Target Prediction via Duplex Formation Features and Direct Binding Evidence(2012) Lekprasert, ParaweeMicroRNAs (miRNAs) are small RNAs that have important roles in post-transcriptional gene regulation in a wide range of species. This regulation is controlled by having miRNAs directly bind to a target messenger RNA (mRNA), causing it to be destabilized and degraded, or translationally repressed. Identifying miRNA targets has been a large area of focus for study; however, a lack of generally high-throughput experiments to validate direct miRNA targeting has been a limiting factor. To overcome these limitations, computational methods have become crucial for understanding and predicting miRNA-gene target interactions.
While a variety of computational tools exist for predicting miRNA targets, many of them are focused on a similar feature set for their prediction. These commonly used features are complementarity to 5'seed of miRNAs and evolutionary conservation. Unfortunately, not all miRNA target sites are conserved or adhere to canonical seed complementarity. Seeking to address these limitations, several studies have included energy features of mRNA:miRNA duplex formation as alternative features. However, different independent evaluations reported conflicting results on the reliability of energy-based predictions. Here, we reassess the usefulness of energy features for mammalian target prediction, aiming to relax or eliminate the need for perfect seed matches and conservation requirement.
We detect significant differences of energy features at experimentally supported human miRNA target sites and at genome-wide interaction sites to Argonaute (AGO) protein family members, which are essential parts of the miRNA machinery complex. This trend is confirmed on data sets that assay the effect of miRNAs on mRNA and protein expression changes, where a statistically significant change in expression is noted when compared to the control. Furthermore, our method also allows for prediction of strictly imperfect sites, as well as non-conserved targets.
Recently, new methods for identifying direct miRNA binding have been developed, which provides us with additional sources of information for miRNA target prediction. While some computational target predictions tools have begun to incorporate this information, they still rely on the presence of a seed match in the AGO-bound windows without accounting for the possibility of variations.
We investigate the usefulness of the site level direct binding evidence in miRNA target identification and propose a model that incorporates multiple different features along with the AGO-interaction data. Our method outperforms both an ad hoc strategy of seed match searches as well as an existing target prediction tool, while still allowing for predictions of sites other than a long perfect seed match. Additionally, we show supporting evidence for a class of non-canonical sites as bound targets. Our model can be extended to predict additional types of imperfect sites, and can also be readily modified to include additional features that may produce additional improvements.
Item Open Access Modeling the evolution of regulatory elements by simultaneous detection and alignment with phylogenetic pair HMMs.(PLoS Comput Biol, 2010-12-16) Majoros, William H; Ohler, UweThe computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation.Item Open Access Promoting developmental transcription.(Development, 2010-01) Ohler, Uwe; Wassarman, David AAnimal growth and development depend on the precise control of gene expression at the level of transcription. A central role in the regulation of developmental transcription is attributed to transcription factors that bind DNA enhancer elements, which are often located far from gene transcription start sites. Here, we review recent studies that have uncovered significant regulatory functions in developmental transcription for the TFIID basal transcription factors and for the DNA core promoter elements that are located close to transcription start sites.Item Open Access Protocol for fast scRNA-seq raw data processing using scKB and non-arbitrary quality control with COPILOT.(STAR protocols, 2022-12) Hsu, Che-Wei; Shahan, Rachel; Nolan, Trevor M; Benfey, Philip N; Ohler, UweWe describe a protocol to perform fast and non-arbitrary quality control of single-cell RNA sequencing (scRNA-seq) raw data using scKB and COPILOT. scKB is a wrapper script of kallisto and bustools for accelerated alignment and transcript count matrix generation, which runs significantly faster than the popular tool Cell Ranger. COPILOT then offers non-arbitrary background noise removal by comparing distributions of low-quality and high-quality cells. Together, this protocol streamlines the processing workflow and provides an easy entry for new scRNA-seq users. For complete details on the use and execution of this protocol, please refer to Shahan et al. (2022).Item Open Access Sustained-input switches for transcription factors and microRNAs are central building blocks of eukaryotic gene circuits.(Genome Biol, 2013-08-23) Megraw, Molly; Mukherjee, Sayan; Ohler, UweWaRSwap is a randomization algorithm that for the first time provides a practical network motif discovery method for large multi-layer networks, for example those that include transcription factors, microRNAs, and non-regulatory protein coding genes. The algorithm is applicable to systems with tens of thousands of genes, while accounting for critical aspects of biological networks, including self-loops, large hubs, and target rearrangements. We validate WaRSwap on a newly inferred regulatory network from Arabidopsis thaliana, and compare outcomes on published Drosophila and human networks. Specifically, sustained input switches are among the few over-represented circuits across this diverse set of eukaryotes.Item Open Access The Spatial and Temporal Regulatory Code of Transcription Initiation in Drosophila melanogaster(2010) Rach, Elizabeth AnnTranscription initiation is a key component in the regulation of gene expression. Recent high-throughput sequencing techniques have enhanced our understanding of mammalian transcription by revealing narrow and broad patterns of transcription start sites (TSSs). Transcription initiation is central to the determination of condition specificity, as distinct repertoires of transcription factors (TFs) that assist in the recruitment of the RNA polymerase II to the DNA are present under different conditions. However, our understanding of the presence and spatiotemporal architecture of the promoter patterns in the fruit fly remains in its infancy. Nucleosome organization and transcription initiation have been considered hallmarks of gene expression, but their cooperative regulation is also not yet understood.
In this work, we applied a hierarchical clustering strategy on available 5' expressed sequence tags (ESTs), and developed an improved paired-end sequencing strategy to explore the transcription initiation landscape of the D.melanogaster genome. We distinguished three initiation patterns: 'Peaked or Narrow Peak TSSs‛, 'Broad Peak TSSs‛, and 'Broad TSS cluster groups or Weak Peak TSSs‛. The promoters of peaked TSSs contained the location specific sequence elements, and were bound by TATA Binding Protein (TBP), while the promoters of broad TSS cluster groups were associated with non-location-specific elements, and were bound by the TATA-box related Factor 2 (TRF2).
Available ESTs and a tiling array time series enabled us to show that TSSs had distinct associations to conditions, and temporal patterns of embryonic activity differed across the majority of alternative promoters. Peaked promoters had an association to maternally inherited transcripts, and broad TSS cluster group promoters were more highly associated to zygotic utilization. The paired-end sequencing strategy identified a large number of 5' capped transcripts originating from coding exons that were unlikely the result of alternative TSSs, but rather the product of post-transcriptional modifications.
We applied an innovative search program called FREE to embryo, head, and testes specific core promoter sequences and identified 123 motifs: 16 novel and 107 supported by other motif sources. Motifs in the embryo specific core promoters were found at location hotspots from the TSS. A family of oligos was discovered that matched the Pause Button motif that is associated with RNA pol II stalling.
Lastly, we analyzed nucleosome organization, chromatin structure, and insulators across the three promoter patterns in the fruit fly and human genomes. The WP promoters showed higher associations with H2A.Z, DNase Hypersensitivity Sites (DHS), H3K4 methylations, and Class I insulators CTCF/BEAF32/CP190. Conversely, NP promoters had higher associations with polII and GAF binding. BP promoters exhibited a combination of features from both promoter patterns. Our study provides a comprehensive map of initiation sites and the conditions under which they are utilized in D. melanogaster. The presence of promoter specific histone replacements, chromatin modifications, and insulator elements support the existence of two divergent strategies of transcriptional regulation in higher eukaryotes. Together, these data illustrate the complex regulatory code of transcription initiation.
Item Open Access Tracking Transcription Factors on the Genome by their DNase-seq Footprints(2014) Yardimci, Galip GurkanAbstract
Transcription factors control numerous vital processes in the cell through their ability to control gene expression. Dysfunctional regulation by transcription factors lead to disorders and disease. Transcription factors regulate gene expression by binding to DNA sequences (motifs) on the genome and altering chromatin. DNase-seq footprinting is a well-established assay for identification of DNA sequences that bind to transcription factors. We developed computational techniques to analyze footprints and predict transcription factor binding. These transcription factor specific predictive models are able to correct for DNase sequence bias and characterize variation in DNA binding sequence. We found that DNase-seq footprints are able to identify cell-type or condition specific transcription factor activity and may offer information about the type of the interaction between DNA and transcription factor. Our DNase-seq footprint model is able to accurately discover high confidence transcription factor binding sites and discover alternative interactions between transcription factors and DNA. DNase-seq footprints can be used with ChIP-seq data to discover true binding sites and better understand transcription regulation.
Item Open Access Uncovering the Transcription Factor Network Underlying Mammalian Sex Determination(2014) Natarajan, AnirudhUnderstanding transcriptional regulation in development and disease is one of the central questions in modern biology. The current working model is that Transcription Factors (TFs) combinatorially bind to specific regions of the genome and drive the expression of groups of genes in a cell-type specific fashion. In organisms with large genomes, particularly mammals, TFs bind to enhancer regions that are often several kilobases away from the genes they regulate, which makes identifying the regulators of gene expression difficult. In order to overcome these obstacles and uncover transcriptional regulatory networks, we used an approach combining expression profiling and genome-wide identification of enhancers followed by motif analysis. Further, we applied these approaches to uncover the TFs important in mammalian sex determination.
Using expression data from a panel of 19 human cell lines we identified genes showing patterns of cell-type specific up-regulation, down-regulation and constitutive expression. We then utilized matched DNase-seq data to assign DNase Hypersensitivity Sites (DHSs) to each gene based on proximity. These DHSs were scanned for matches to motifs and compiled to generate scores reflecting the presence of TF binding sites (TFBSs) in each gene's putative regulatory regions. We used a sparse logistic regression classifier to classify differentially regulated groups of genes. Comparing our approach to proximal promoter regions, we discovered that using sequence features in regions of open chromatin provided significant performance improvement. Crucially, we discovered both known and novel regulators of gene expression in different cell types. For some of these TFs, we found cell-type specific footprints indicating direct binding to their cognate motifs.
The mammalian gonad is an excellent system to study cell fate determination processes and the dynamic regulation orchestrated by TFs in development. At embryonic day (E) 10.5, the bipotential gonad initiates either testis development in XY embryos, or ovarian development in XX embryos. Genetic studies over the last 3 decades have revealed about 30 genes important in this process, but there are still significant gaps in our understanding. Specifically, we do not know the network of TFs and their specific combinations that cause the rapid changes in gene expression observed during gonadal fate commitment. Further, more than half the cases of human sex reversal are as yet unexplained.
To apply the methods we developed to identify regulators of gene expression to the gonad, we took two approaches. First, we carried out a careful dissection of the transcriptional dynamics during gonad differentiation in the critical window between E11.0 and E12.0. We profiled the transcriptome at 6 equally spaced time points and developed a Hidden Markov Model to reveal the cascades of transcription that drive the differentiation of the gonad. Further, we discovered that while the ovary maintains its transcriptional state at this early stage, concurrent up- and down-regulation of hundreds of genes are orchestrated by the testis pathway. Further, we compared two different strains of mice with differential susceptibility to XY male-to-female sex reversal. This analysis revealed that in the C57BL/6J strain, the male pathway is delayed by ~5 hours, likely explaining the increased susceptibility to sex reversal in this strain. Finally, we validated the function of Lmo4, a transcriptional co-factor up-regulated in XY gonads at E11.6 in both strains. RNAi mediated knockdown of Lmo4 in primary gonadal cells led to the down-regulation of male pathway genes including key regulators such as Sox9 and Fgf9.
To find the enhancers in the XY gonad, we conducted DNase-seq in E13.5 XY supporting cells. In addition, we conducted ChIP-seq for H3K27ac, a mark correlated with active enhancer activity. Further, we conducted motif analysis to reveal novel regulators of sex determination. Our work is an important step towards combining expression and chromatin profiling data to assemble transcriptional networks and is applicable to several systems.