Browsing by Subject "Bayes factor"
Results Per Page
Sort Options
Item Open Access A Bayesian Model for Nucleosome Positioning Using DNase-seq Data(2015) Zhong, JianlingAs fundamental structural units of the chromatin, nucleosomes are involved in virtually all aspects of genome function. Different methods have been developed to map genome-wide nucleosome positions, including MNase-seq and a recent chemical method requiring genetically engineered cells. However, these methods are either low resolution and prone to enzymatic sequence bias or require genetically modified cells. The DNase I enzyme has been used to probe nucleosome structure since the 1960s, but in the current high throughput sequencing era, DNase-seq has mainly been used to study regulatory sequences known as DNase hypersensitive sites. This thesis shows that DNase-seq data is also very informative about nucleosome positioning. The distinctive oscillatory DNase I cutting patterns on nucleosomal DNA are shown and discussed. Based on these patterns, a Bayes factor is proposed to be used for distinguishing nucleosomal and non-nucleosomal genome positions. The results show that this approach is highly sensitive and specific. A Bayesian method that simulates the data generation process and can provide more interpretable results is further developed based on the Bayes factor investigations. Preliminary results on a test genomic region show that the Bayesian model works well in identifying nucleosome positioning. Estimated posterior distributions also agree with some known biological observations from external data. Taken together, methods developed in this thesis show that DNase-seq can be used to identify nucleosome positioning, adding great value to this widely utilized protocol.
Item Open Access Computational Inference of Genome-Wide Protein-DNA Interactions Using High-Throughput Genomic Data(2015) Zhong, JianlingTranscriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Item Open Access Simulation Study on Exchangeability and Significant Test on Survey Data(2015) Cao, YongThe two years of Master of Science in Statistical and Economic Modeling program is the most rewarding time ever in my life. This thesis acts as a portfolio of project and applied experience while I am enrolled in the Master of Science in Statistical and Economic Modeling program. This thesis will summarize my graduate study in two parts: Simulation Study of Exchangeability for Binary Data, and Summary of Summer Internship at Center for Responsible Lending. The project of Simulation Study of Exchangeability for Binary Data contains materials from a team project, which jointly performed by Sheng Jiang, Xuan Sun and me. Abstracts for both projects are below in order.
(1) Simulation Study of Exchangeability for Binary Data
To investigate tractable Bayesian tests on exchangeability, this project considers special cases of nonexchangeable random sequences: Markov chains. Asymptotic results of Bayes factor (BF) are derived. When null hypothesis is true, Bayes Factor in favor of the null goes to infinity at geometric rate (true odds is not one half). When null hypothesis is not true, Bayes Factor in favor of the null goes to 0 faster than geometric rate. The results are robust under misspecifications. Simulation studies are employed to see the performance of the test when the sample size is small, prior beliefs change and true parameters change.
(2) Summary of Summer Internship at Center for Responsible Lending
My summer internship deals with a survey data from Social Science Research Solution about auto financing. The dataset includes about one thousand valid responses and 114 variables for each response. My efforts on exploratory statistic analysis unfolded many interesting findings. For example, African Americans and Latinos are receiving 2.02% higher APR on average than white buyers, excluding the effects of relevant variables. And what's more, a Fisher's Exact Test of Significance is widely used to discover the significance of a series of variables. Results are presented in organized neat tables. Findings are included in weekly reports. One example finding is that warranty add-‐‑ons of a financed car has significant impacts on all three aspects of a loan, which is Annual Percent Rate, Loan Amount, and Monthly Payment.