Browsing by Author "Hartemink, Alexander J"
Results Per Page
Sort Options
Item Open Access A branching process model for flow cytometry and budding index measurements in cell synchrony experiments.(Ann Appl Stat, 2009) Orlando, David A; Iversen, Edwin S; Hartemink, Alexander J; Haase, Steven BWe present a flexible branching process model for cell population dynamics in synchrony/time-series experiments used to study important cellular processes. Its formulation is constructive, based on an accounting of the unique cohorts in the population as they arise and evolve over time, allowing it to be written in closed form. The model can attribute effects to subsets of the population, providing flexibility not available using the models historically applied to these populations. It provides a tool for in silico synchronization of the population and can be used to deconvolve population-level experimental measurements, such as temporal expression profiles. It also allows for the direct comparison of assay measurements made from multiple experiments. The model can be fit either to budding index or DNA content measurements, or both, and is easily adaptable to new forms of data. The ability to use DNA content data makes the model applicable to almost any organism. We describe the model and illustrate its utility and flexibility in a study of cell cycle progression in the yeast Saccharomyces cerevisiae.Item Open Access A Computational Synthesis of Genes, Behavior, and Evolution Provides Insights into the Molecular Basis of Vocal Learning(2012) Pfenning, Andreas RVocal learning is the ability modify vocal output based on auditory input and is the basis of human speech acquisition. It is shared by few distantly related bird and mammal orders, and is thus very likely to be an example of convergent evolution, having evolved independently in multiple lineages. This complex behavior is presumed to require networks of regulated genes to develop the necessary neural circuits for learning and maintaining vocalizations. Deciphering these networks has been limited by the lack of high throughput genomic tools in vocal learning avian species and the lack of a solid computational framework to understand the relationship between gene expression and behavior. This dissertation provides new insights into the evolution and mechanisms of vocal learning by taking a top-down, systems biology approach to understanding gene expression regulation across avian and mammalian species. First, I worked with colleagues to develop a zebra finch Agilent oligonucleotide microarray, including developing programs for more accurate annotation of oligonucleotides and genes. I then used these arrays and tools in multiple collaborative, but related projects, to measure transcriptome expression data in vocal learning and non-learning avian species, under a number of behavioral paradigms, with a focus on song production. To make sense of the avian microarray data, I compiled microarray data from other sources, including expression analyses across over 900 human brain regions generated by Allen Brain Institute. To compare these data sets, I developed and performed a variety of computational analyses including clustering, linear models, gene set enrichment analysis, motif discovery, and phylogenetic inference, providing a novel framework to study the gene regulatory networks associated with a complex behavior. Using the developed framework, we are able to better understand vocal learning at different levels: how the brain regions for vocal learning evolved and how those brain regions function during the production of learned vocalizations. At the evolutionary level, we identified genes with unique expression patterns in the brains of vocal learning birds and humans. Interesting candidates include genes related to formation of neural connections, in particular the SLIT/ROBO axon guidance pathway. This algorithm also allowed us to identify the analogous regions that are a part of vocal learning circuit across species, providing the first quantitative evidence relating the human vocal learning circuit to the avian vocal learning circuit. With the avian song system verified as a model for human speech at the molecular level, we conducted an experiment to better understand what is happening in those brain regions during singing by profiling gene expression in a time course as birds are producing song. Surprisingly, an overwhelming majority of the gene expression identified was strongly enriched in a particular region. We also found a tight coupling between the behavioral function of a particular region and the gene expression pattern. To gain insight into the mechanisms of this gene regulation, we conducted a genomic scan of transcription factor binding sites in zebra finch. Many transcription factor binding sites were enriched in the promoters of genes with a particular temporal patterns, several of which had already been hypothesized to play a role in the neural system. Using this data set of gene expression profiles and transcription factor binding sites along with separate experiments conducted in mouse, we were able uncover evidence that the transcription factor CARF plays a role in neuron homeostasis. These results have broadened our understanding of the molecular basis of vocal learning at multiple levels. Overall, this dissertation outlines a novel way of approaching the study of the relationship between genes and behavior.
Item Open Access A nucleosome-guided map of transcription factor binding sites in yeast.(PLoS Comput Biol, 2007-11) Narlikar, Leelavati; Gordân, Raluca; Hartemink, Alexander JFinding functional DNA binding sites of transcription factors (TFs) throughout the genome is a crucial step in understanding transcriptional regulation. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known TF motifs occur in the genome than are actually functional. However, information about chromatin structure may help to identify the functional sites. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling TFs to bind DNA in those regions. Here, we describe a novel motif discovery algorithm that employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy. When a Gibbs sampling algorithm is applied to yeast sequence-sets identified by ChIP-chip, the correct motif is found in 52% more cases with our informative prior than with the commonly used uniform prior. This is the first demonstration that nucleosome occupancy information can be used to improve motif discovery. The improvement is dramatic, even though we are using only a statistical model to predict nucleosome occupancy; we expect our results to improve further as high-resolution genome-wide experimental nucleosome occupancy data becomes increasingly available.Item Open Access Advances to Bayesian network inference for generating causal networks from observational biological data.(Bioinformatics, 2004-12-12) Yu, Jing; Smith, V Anne; Wang, Paul P; Hartemink, Alexander J; Jarvis, Erich DMOTIVATION: Network inference algorithms are powerful computational tools for identifying putative causal interactions among variables from observational data. Bayesian network inference algorithms hold particular promise in that they can capture linear, non-linear, combinatorial, stochastic and other types of relationships among variables across multiple levels of biological organization. However, challenges remain when applying these algorithms to limited quantities of experimental data collected from biological systems. Here, we use a simulation approach to make advances in our dynamic Bayesian network (DBN) inference algorithm, especially in the context of limited quantities of biological data. RESULTS: We test a range of scoring metrics and search heuristics to find an effective algorithm configuration for evaluating our methodological advances. We also identify sampling intervals and levels of data discretization that allow the best recovery of the simulated networks. We develop a novel influence score for DBNs that attempts to estimate both the sign (activation or repression) and relative magnitude of interactions among variables. When faced with limited quantities of observational data, combining our influence score with moderate data interpolation reduces a significant portion of false positive interactions in the recovered networks. Together, our advances allow DBN inference algorithms to be more effective in recovering biological networks from experimentally collected data. AVAILABILITY: Source code and simulated data are available upon request. SUPPLEMENTARY INFORMATION: http://www.jarvislab.net/Bioinformatics/BNAdvances/Item Open Access Computational Inference of Genome-Wide Protein-DNA Interactions Using High-Throughput Genomic Data(2015) Zhong, JianlingTranscriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Item Open Access Computational inference of neural information flow networks.(PLoS Comput Biol, 2006-11-24) Smith, V Anne; Yu, Jing; Smulders, Tom V; Hartemink, Alexander J; Jarvis, Erich DDetermining how information flows along anatomical brain pathways is a fundamental requirement for understanding how animals perceive their environments, learn, and behave. Attempts to reveal such neural information flow have been made using linear computational methods, but neural interactions are known to be nonlinear. Here, we demonstrate that a dynamic Bayesian network (DBN) inference algorithm we originally developed to infer nonlinear transcriptional regulatory networks from gene expression data collected with microarrays is also successful at inferring nonlinear neural information flow networks from electrophysiology data collected with microelectrode arrays. The inferred networks we recover from the songbird auditory pathway are correctly restricted to a subset of known anatomical paths, are consistent with timing of the system, and reveal both the importance of reciprocal feedback in auditory processing and greater information flow to higher-order auditory areas when birds hear natural as opposed to synthetic sounds. A linear method applied to the same data incorrectly produces networks with information flow to non-neural tissue and over paths known not to exist. To our knowledge, this study represents the first biologically validated demonstration of an algorithm to successfully infer neural information flow networks.Item Open Access Computational Systems Biology of Saccharomyces cerevisiae Cell Growth and Division(2014) Mayhew, Michael BenjaminCell division and growth are complex processes fundamental to all living organisms. In the budding yeast, Saccharomyces cerevisiae, these two processes are known to be coordinated with one another as a cell's mass must roughly double before division. Moreover, cell-cycle progression is dependent on cell size with smaller cells at birth generally taking more time in the cell cycle. This dependence is a signature of size control. Systems biology is an emerging field that emphasizes connections or dependencies between biological entities and processes over the characteristics of individual entities. Statistical models provide a quantitative framework for describing and analyzing these dependencies. In this dissertation, I take a statistical systems biology approach to study cell division and growth and the dependencies within and between these two processes, drawing on observations from richly informative microscope images and time-lapse movies. I review the current state of knowledge on these processes, highlighting key results and open questions from the biological literature. I then discuss my development of machine learning and statistical approaches to extract cell-cycle information from microscope images and to better characterize the cell-cycle progression of populations of cells. In addition, I analyze single cells to uncover correlation in cell-cycle progression, evaluate potential models of dependence between growth and division, and revisit classical assertions about budding yeast size control. This dissertation presents a unique perspective and approach towards comprehensive characterization of the coordination between growth and division.
Item Open Access Convergent transcriptional specializations in the brains of humans and song-learning birds.(Science, 2014-12-12) Pfenning, Andreas R; Hara, Erina; Whitney, Osceola; Rivas, Miriam V; Wang, Rui; Roulhac, Petra L; Howard, Jason T; Wirthlin, Morgan; Lovell, Peter V; Ganapathy, Ganeshkumar; Mouncastle, Jacquelyn; Moseley, M Arthur; Thompson, J Will; Soderblom, Erik J; Iriki, Atsushi; Kato, Masaki; Gilbert, M Thomas P; Zhang, Guojie; Bakken, Trygve; Bongaarts, Angie; Bernard, Amy; Lein, Ed; Mello, Claudio V; Hartemink, Alexander J; Jarvis, Erich DSong-learning birds and humans share independently evolved similarities in brain pathways for vocal learning that are essential for song and speech and are not found in most other species. Comparisons of brain transcriptomes of song-learning birds and humans relative to vocal nonlearners identified convergent gene expression specializations in specific song and speech brain regions of avian vocal learners and humans. The strongest shared profiles relate bird motor and striatal song-learning nuclei, respectively, with human laryngeal motor cortex and parts of the striatum that control speech production and learning. Most of the associated genes function in motor control and brain connectivity. Thus, convergent behavior and neural connectivity for a complex trait are associated with convergent specialized expression of multiple genes.Item Open Access Core and region-enriched networks of behaviorally regulated genes and the singing genome.(Science, 2014-12-12) Whitney, Osceola; Pfenning, Andreas R; Howard, Jason T; Blatti, Charles A; Liu, Fang; Ward, James M; Wang, Rui; Audet, Jean-Nicoles; Kellis, Manolis; Mukherjee, Sayan; Sinha, Saurabh; Hartemink, Alexander J; West, Anne E; Jarvis, Erich DSongbirds represent an important model organism for elucidating molecular mechanisms that link genes with complex behaviors, in part because they have discrete vocal learning circuits that have parallels with those that mediate human speech. We found that ~10% of the genes in the avian genome were regulated by singing, and we found a striking regional diversity of both basal and singing-induced programs in the four key song nuclei of the zebra finch, a vocal learning songbird. The region-enriched patterns were a result of distinct combinations of region-enriched transcription factors (TFs), their binding motifs, and presinging acetylation of histone 3 at lysine 27 (H3K27ac) enhancer activity in the regulatory regions of the associated genes. RNA interference manipulations validated the role of the calcium-response transcription factor (CaRF) in regulating genes preferentially expressed in specific song nuclei in response to singing. Thus, differential combinatorial binding of a small group of activity-regulated TFs and predefined epigenetic enhancer activity influences the anatomical diversity of behaviorally regulated gene networks.Item Open Access Deciphering genome-wide chromatin occupancy, dynamics, and their connections to gene regulation(2022) Mitra, SnehaThe genomic DNA is bound by a myriad of proteins to form the chromatin inside the nucleus of the cell. The proteins can bind to the genome in different combinations leading to a combinatorial explosion in the number of possible chromatin configurations. The differences in the chromatin configurations for the same genome sequence give rise to distinct cell types. Likewise, cells of the same type also undergo changes in chromatin configurations under different environmental conditions. Key changes to the occupancy profile of the chromatin may dictate changes in gene regulation or vice versa. Therefore, it is important to decipher the chromatin occupancy profiles of the genome and understand how these configurations are related to the transcription of genes.
In this dissertation, we analyze chromatin using chromatin accessibility data sets, particularly MNase-seq and ATAC-seq, that describe the protein-bound and unbound regions of the genome. We first describe a state-space model that uses chromatin accessibility data to jointly infer the occupancy profile of hundreds of proteins binding to the genome. We apply our model to the yeast genome to study the occupancy profile of transcription factors and nucleosomes. We further extend our model to study chromatin dynamics of yeast cells subjected to cadmium stress. In doing so, we identify genomic regions exhibiting changes in the occupancy profile of transcription factors and nucleosomes. Upon comparing with available gene expression data we find that key changes in chromatin configuration occur around gene bodies that are differentially regulated during cadmium stress. Our analyses highlight how specific changes to the occupancy profiles relate to gene expression.
Building upon the interrelatedness of chromatin and transcription, we describe a regression-based approach that predicts transcription from chromatin accessibility data sets. We find that the chromatin accessibility in specific parts of the genome is highly correlated to gene expression. These genomic regions are potential regulatory regions that can lie far away from the gene body and interact with the genes due to the looping of the DNA. Our model identifies these regulatory regions in a gene-specific manner that helps us further understand the connections between chromatin and transcription.
Item Open Access Domain-oriented edge-based alignment of protein interaction networks.(Bioinformatics, 2009-06-15) Guo, Xin; Hartemink, Alexander JMOTIVATION: Recent advances in high-throughput experimental techniques have yielded a large amount of data on protein-protein interactions (PPIs). Since these interactions can be organized into networks, and since separate PPI networks can be constructed for different species, a natural research direction is the comparative analysis of such networks across species in order to detect conserved functional modules. This is the task of network alignment. RESULTS: Most conventional network alignment algorithms adopt a node-then-edge-alignment paradigm: they first identify homologous proteins across networks and then consider interactions among them to construct network alignments. In this study, we propose an alternative direct-edge-alignment paradigm. Specifically, instead of explicit identification of homologous proteins, we directly infer plausibly alignable PPIs across species by comparing conservation of their constituent domain interactions. We apply our approach to detect conserved protein complexes in yeast-fly and yeast-worm PPI networks, and show that our approach outperforms two recent approaches in most alignment performance metrics. AVAILABILITY: Supplementary material and source code can be found at http://www.cs.duke.edu/ approximately amink/.Item Open Access Evaluating functional network inference using simulations of complex biological systems.(Bioinformatics, 2002) Smith, V Anne; Jarvis, Erich D; Hartemink, Alexander JMOTIVATION: Although many network inference algorithms have been presented in the bioinformatics literature, no suitable approach has been formulated for evaluating their effectiveness at recovering models of complex biological systems from limited data. To overcome this limitation, we propose an approach to evaluate network inference algorithms according to their ability to recover a complex functional network from biologically reasonable simulated data. RESULTS: We designed a simulator to generate data representing a complex biological system at multiple levels of organization: behaviour, neural anatomy, brain electrophysiology, and gene expression of songbirds. About 90% of the simulated variables are unregulated by other variables in the system and are included simply as distracters. We sampled the simulated data at intervals as one would sample from a biological system in practice, and then used the sampled data to evaluate the effectiveness of an algorithm we developed for functional network inference. We found that our algorithm is highly effective at recovering the functional network structure of the simulated system-including the irrelevance of unregulated variables-from sampled data alone. To assess the reproducibility of these results, we tested our inference algorithm on 50 separately simulated sets of data and it consistently recovered almost perfectly the complex functional network structure underlying the simulated data. To our knowledge, this is the first approach for evaluating the effectiveness of functional network inference algorithms at recovering models from limited data. Our simulation approach also enables researchers a priori to design experiments and data-collection protocols that are amenable to functional network inference.Item Open Access Finding regulatory DNA motifs using alignment-free evolutionary conservation information.(Nucleic Acids Res, 2010-04) Gordân, Raluca; Narlikar, Leelavati; Hartemink, Alexander JAs an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.Item Open Access From Population to Single Cells: Deconvolution of Cell-cycle Dynamics(2012) Guo, XinThe cell cycle is one of the fundamental processes in all living organisms, and all cells arise from the division of existing cells. To better understand the regulation of the cell cycle, synchrony experiments are widely used to monitor cellular dynamics during this process. In such experiments, a large population of cells is generally arrested or selected at one stage of the cycle, and then released to progress through subsequent division stages. Measurements are then taken in this population at a variety of time points after release to provide insight into the dynamics of the cell cycle. However, due to cell-to-cell variability and asymmetric cell division, cells in a synchronized population lose synchrony over time. As a result, the time-series measurements from the synchronized cell populations do not accurately reflect the underlying dynamics of cell-cycle processes.
In this thesis, we introduce a deconvolution algorithm that learns a more accurate view of cell-cycle dynamics, free from the convolution effects associated with imperfect cell synchronization. Through wavelet-basis regularization, our method sharpens signal without sharpening noise, and can remarkably increase both the dynamic range and the temporal resolution of time-series data. Though it can be applied to any such data, we demonstrate the utility of our method by applying it to a recent cell-cycle transcription time course in the eukaryote Saccharomyces cerevisiae. We show that our method more sensitively detects cell-cycle-regulated transcription, and reveals subtle timing differences that are masked in the original population measurements. Our algorithm also explicitly learns distinct transcription programs for both mother and daughter cells, enabling us to identify 82 genes transcribed almost entirely in the early G1 in a daughter-specific manner.
In addition to the cell-cycle deconvolution algorithm, we introduce DOMAIN, a protein-protein interaction (PPI) network alignment method, which employs a novel direct-edge-alignment paradigm to detect conserved functional modules (e.g., protein complexes, molecular pathways) from pairwise PPI networks. By applying our approach to detect protein complexes conserved in yeast-fly and yeast-worm PPI networks, we show that our approach outperforms two widely used approaches in most alignment performance metrics. We also show that our approach enables us to identify conserved cell-cycle-related functional modules across yeast-fly PPI networks.
Item Open Access Influence of network topology and data collection on network inference.(Pac Symp Biocomput, 2003) Smith, V Anne; Jarvis, Erich D; Hartemink, Alexander JWe recently developed an approach for testing the accuracy of network inference algorithms by applying them to biologically realistic simulations with known network topology. Here, we seek to determine the degree to which the network topology and data sampling regime influence the ability of our Bayesian network inference algorithm, NETWORKINFERENCE, to recover gene regulatory networks. NETWORKINFERENCE performed well at recovering feedback loops and multiple targets of a regulator with small amounts of data, but required more data to recover multiple regulators of a gene. When collecting the same number of data samples at different intervals from the system, the best recovery was produced by sampling intervals long enough such that sampling covered propagation of regulation through the network but not so long such that intervals missed internal dynamics. These results further elucidate the possibilities and limitations of network inference based on biological data.Item Open Access Modeling Biological Systems from Heterogeneous Data(2008-04-24) Bernard, Allister P.The past decades have seen rapid development of numerous high-throughput technologies to observe biomolecular phenomena. High-throughput biological data are inherently heterogeneous, providing information at the various levels at which organisms integrate inputs to arrive at an observable phenotype. Approaches are needed to not only analyze heterogeneous biological data, but also model the complex experimental observation procedures. We first present an algorithm for learning dynamic cell cycle transcriptional regulatory networks from gene expression and transcription factor binding data. We learn regulatory networks using dynamic Bayesian network inference algorithms that combine evidence from gene expression data through the likelihood and evidence from binding data through an informative structure prior. We next demonstrate how analysis of cell cycle measurements like gene expression data are obstructed by sychrony loss in synchronized cell populations. Due to synchrony loss, population-level cell cycle measurements are convolutions of the true measurements that would have been observed when monitoring individual cells. We introduce a fully parametric, probabilistic model, CLOCCS, capable of characterizing multiple sources of asynchrony in synchronized cell populations. Using CLOCCS, we formulate a constrained convex optimization deconvolution algorithm that recovers single cell estimates from observed population-level measurements. Our algorithm offers a solution for monitoring individual cells rather than a population of cells that lose synchrony over time. Using our deconvolution algorithm, we provide a global high resolution view of cell cycle gene expression in budding yeast, right from an initial cell progressing through its cell cycle, to across the newly created mother and daughter cell. Proteins, and not gene expression, are responsible for all cellular functions, and we need to understand how proteins and protein complexes operate. We introduce PROCTOR, a statistical approach capable of learning the hidden interaction topology of protein complexes from direct protein-protein interaction data and indirect co-complexed protein interaction data. We provide a global view of the budding yeast interactome depicting how proteins interact with each other via their interfaces to form macromolecular complexes. We conclude by demonstrating how our algorithms, utilizing information from heterogeneous biological data, can provide a dynamic view of regulatory control in the budding yeast cell cycle.Item Open Access Modeling Multi-factor Binding of the Genome(2010) Wasson, Todd StevenHundreds of different factors adorn the eukaryotic genome, binding to it in large number. These DNA binding factors (DBFs) include nucleosomes, transcription factors (TFs), and other proteins and protein complexes, such as the origin recognition complex (ORC). DBFs compete with one another for binding along the genome, yet many current models of genome binding do not consider different types of DBFs together simultaneously. Additionally, binding is a stochastic process that results in a continuum of binding probabilities at any position along the genome, but many current models tend to consider positions as being either binding sites or not.
Here, we present a model that allows a multitude of DBFs, each at different concentrations, to compete with one another for binding sites along the genome. The result is an 'occupancy profile', a probabilistic description of the DNA occupancy of each factor at each position. We implement our model efficiently as the software package COMPETE. We demonstrate genome-wide and at specific loci how modeling nucleosome binding alters TF binding, and vice versa, and illustrate how factor concentration influences binding occupancy. Binding cooperativity between nearby TFs arises implicitly via mutual competition with nucleosomes. Our method applies not only to TFs, but also recapitulates known occupancy profiles of a well-studied replication origin with and without ORC binding.
We then develop a statistical framework for tuning our model concentrations to further improve its predictions. Importantly, this tuning optimizes with respect to actual biological data. We take steps to ensure that our tuned parameters are biologically plausible.
Finally, we discuss novel extensions and applications of our model, suggesting next steps in its development and deployment.
Item Open Access Modeling Nuclease Digestion Data to Predict the Dynamics of Genome-wide Transcription Factor Occupancy(2016) Luo, KaixuanIdentifying and deciphering the complex regulatory information embedded in the genome is critical to our understanding of biology and the etiology of complex diseases. The regulation of gene expression is governed largely by the occupancy of transcription factors (TFs) at various cognate binding sites. Characterizing TF binding is particularly challenging since TF occupancy is not just complex but also dynamic. Current genome-wide surveys of TF binding sites typically use chromatin immunoprecipitation (ChIP), which is limited to measuring one TF at a time, thus less scalable in profiling the dynamics of TF occupancy across cell types or conditions. This dissertation develops novel computational frameworks to model sequencing data from DNase and/or MNase nuclease digestion assays that allows multiple TFs to be surveyed in a single experiment, in both human and yeast. We predicted occupancy landscapes and constructed a cell-type specificity map for many TFs across human cell types, revealed novel relationships between TF occupancy and TF expression, and monitored the occupancy dynamics of various TFs in response to androgen and estrogen hormone simulations. The TF/cell type occupancy matrix generated from our model expands the total output of the ENCODE ChIP-seq efforts by a factor of nearly 200 times. These computational frameworks serve as an innovative and cost effective strategy which enables efficient profiling of TF occupancy landscapes across different cell types or dynamic conditions in a high-throughput manner.
Item Open Access Modeling Time-Varying Networks with Applications to Neural Flow and Genetic Regulation(2010) Robinson, Joshua WestlyMany biological processes are effectively modeled as networks, but a frequent assumption is that these networks do not change during data collection. However, that assumption does not hold for many phenomena, such as neural growth during learning or changes in genetic regulation during cell differentiation. Approaches are needed that explicitly model networks as they change in time and that characterize the nature of those changes.
In this work, we develop a new class of graphical models in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. We first present the model, explain how to derive it from Bayesian networks, and develop an efficient MCMC sampling algorithm that easily generalizes under varying levels of uncertainty about the data generation process. We then characterize the nature of evolving networks in several biological datasets.
We initially focus on learning how neural information flow networks change in songbirds with implanted electrodes. We characterize how they change in response to different sound stimuli and during the process of habituation. We continue to explore the neurobiology of songbirds by identifying changes in neural information flow in another habituation experiment using fMRI data. Finally, we briefly examine evolving genetic regulatory networks involved in Drosophila muscle differentiation during development.
We conclude by suggesting new experimental directions and statistical extensions to the model for predicting novel neural flow results.
Item Open Access Online Learning of Non-Stationary Networks, with Application to Financial Data(2012) Hongo, YasunoriIn this paper, we propose a new learning algorithm for non-stationary Dynamic Bayesian Networks is proposed. Although a number of effective learning algorithms for non-stationary DBNs have previously been proposed and applied in Signal Pro- cessing and Computational Biology, those algorithms are based on batch learning algorithms that cannot be applied to online time-series data. Therefore, we propose a learning algorithm based on a Particle Filtering approach so that we can apply that algorithm to online time-series data. To evaluate our algorithm, we apply it to the simulated data set and the real-world financial data set. The result on the simulated data set shows that our algorithm performs accurately makes estimation and detects change. The result applying our algorithm to the real-world financial data set shows several features, which are suggested in previous research that also implies the effectiveness of our algorithm.