Developing Quantitative Models in Analyzing High-throughput Sequencing Data

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Diverse functional genomics assays have been developed and helped to investigate complex gene regulations in various biological conditions. For example, RNA-seq has been used to capture gene expressions in diverse human tissues, helping to study tissue-common and tissue-specific gene regulation. ChIP-seq has been used to identify the genomic regions bound by numerous transcription factors, thus helping to identify collaborative and competitive binding mechanisms of the transcription factors. Despite this huge increase in the amount and the accessibility of genomic data, we have several challenges to analyze those data with proper statistical methods. Some assays such as STARR-seq do not have a proper statistical model that detects both activated and repressed regulatory elements, making researchers depend on the statistical models developed for other assays. Some assays such as ChIP-seq and RNA-seq have limited joint analysis models that are flexible and computationally scalable, resulting in the limited statistical power in identifying the genomic regions or genes shared by multiple biological conditions. To solve those challenges in analyzing high-throughput assays, we first developed a statistical model called correcting reads and analysis of differential active elements or CRADLE to analyze STARR-seq data. CRADLE removes technical biases that can confound quantification of regulatory activity and then detects both activated and repressed regulatory elements. We observed the corrected read counts improved the visualization of regulatory activity, allowing for more accurate detection of regulatory elements. Indeed, through simulation study, we showed CRADLE significantly improved precision and recall in detecting regulatory elements compared to the previous statistical approaches and that improvement was especially prominent in identifying repressed regulatory elements. Based on our work on developing CRADLE, we adapted the statistical framework of CRADLE and developed a joint analysis model of multiple data for biology or JAMMY that can be applied to diverse high-throughput sequencing data. JAMMY is a flexible statistical model that jointly analyzes multiple conditions, identifies condition-shared and condition-specific genomic regions, and then quantifies the preferential activity of a subset of biological conditions for each genomic region. We applied JAMMY to STARR-seq, ChIP-seq, and RNA-seq data, and observed JAMMY overall improved the precision and recall in identifying condition-shared activity compared to the traditional condition-by-condition analysis. This gain of statistical power from the joint analysis led us to find a novel co-binding of two transcription factors in our study. Those results show the substantial advantages of using joint analysis model in integrating genomic data from multiple biological conditions.





Kim, Young-Sook (2021). Developing Quantitative Models in Analyzing High-throughput Sequencing Data. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.