Developing Quantitative Models in Analyzing High-throughput Sequencing Data

Kim, Young-Sook

Developing Quantitative Models in Analyzing High-throughput Sequencing Data

dc.contributor.advisor	Reddy, Timothy E
dc.contributor.author	Kim, Young-Sook
dc.date.accessioned	2022-02-11T21:38:48Z
dc.date.available	2023-01-18T09:17:10Z
dc.date.issued	2021
dc.department	Biostatistics and Bioinformatics Doctor of Philosophy
dc.description.abstract	Diverse functional genomics assays have been developed and helped to investigate complex gene regulations in various biological conditions. For example, RNA-seq has been used to capture gene expressions in diverse human tissues, helping to study tissue-common and tissue-specific gene regulation. ChIP-seq has been used to identify the genomic regions bound by numerous transcription factors, thus helping to identify collaborative and competitive binding mechanisms of the transcription factors. Despite this huge increase in the amount and the accessibility of genomic data, we have several challenges to analyze those data with proper statistical methods. Some assays such as STARR-seq do not have a proper statistical model that detects both activated and repressed regulatory elements, making researchers depend on the statistical models developed for other assays. Some assays such as ChIP-seq and RNA-seq have limited joint analysis models that are flexible and computationally scalable, resulting in the limited statistical power in identifying the genomic regions or genes shared by multiple biological conditions. To solve those challenges in analyzing high-throughput assays, we first developed a statistical model called correcting reads and analysis of differential active elements or CRADLE to analyze STARR-seq data. CRADLE removes technical biases that can confound quantification of regulatory activity and then detects both activated and repressed regulatory elements. We observed the corrected read counts improved the visualization of regulatory activity, allowing for more accurate detection of regulatory elements. Indeed, through simulation study, we showed CRADLE significantly improved precision and recall in detecting regulatory elements compared to the previous statistical approaches and that improvement was especially prominent in identifying repressed regulatory elements. Based on our work on developing CRADLE, we adapted the statistical framework of CRADLE and developed a joint analysis model of multiple data for biology or JAMMY that can be applied to diverse high-throughput sequencing data. JAMMY is a flexible statistical model that jointly analyzes multiple conditions, identifies condition-shared and condition-specific genomic regions, and then quantifies the preferential activity of a subset of biological conditions for each genomic region. We applied JAMMY to STARR-seq, ChIP-seq, and RNA-seq data, and observed JAMMY overall improved the precision and recall in identifying condition-shared activity compared to the traditional condition-by-condition analysis. This gain of statistical power from the joint analysis led us to find a novel co-binding of two transcription factors in our study. Those results show the substantial advantages of using joint analysis model in integrating genomic data from multiple biological conditions.
dc.identifier.uri	https://hdl.handle.net/10161/24384
dc.subject	Bioinformatics
dc.subject	ChIP-seq
dc.subject	joint analysis
dc.subject	quantitative models
dc.subject	RNA-seq
dc.subject	starr-seq
dc.title	Developing Quantitative Models in Analyzing High-throughput Sequencing Data
dc.type	Dissertation
duke.embargo.months	11.178082191780822

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kim_duke_0066D_16473.pdf
Size:: 7.45 MB
Format:: Adobe Portable Document Format

Download

Collections

Dissertations