A Computational Framework for Understanding Transcription Factor-DNA Binding and Effects of Non-coding Variants

Loading...

Date

2025

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

1
views
0
downloads

Attention Stats

Abstract

Genome-wide association studies (GWAS) have identified hundreds of thousands of genetic variants associated with human traits and diseases, yet the vast majority of those variants lie in non-coding regions of the genome, and their molecular consequences remain largely unknown. A major mechanism through which non-coding variants act is by altering transcription factor (TF) binding. TFs are sequences-specific DNA-binding proteins that orchestrate gene regulation by recognizing specific DNA sequences or motifs within regulatory elements, reshaping chromatin environments, and guiding the activity of RNA polymerase. They are central to almost all aspects of cellular functions: controlling the cell cycle, modulating responses to environmental stimuli, guiding cell development and differentiation, and enabling cellular reprogramming. Since TFs are essential for those critical functions, disruptions of TF binding caused by non-coding variants can propagate through gene regulatory network and lead to change in phenotype or contribute to disease risk. Understanding how non-coding variants affect TF binding therefore remains a central challenge in functional genomics. This dissertation addresses this challenge in four parts. Chapter 1 (introduction) provides biological and computational background for this problem. It reviews the roles of TFs in gene regulation, the importance of modeling the effects of non-coding variants on TF binding, and major experimental platform for study TF binding, including both in vivo and in vitro technologies which map TF occupancy in cellular contexts and characterize intrinsic binding preferences respectively. The chapter also summarizes existing computational approaches, highlighting both their contributions and their limitations. Chapter 2 introduces QBiC-SELEX, a modular computational method designed to predict the effects of genetic variants on TF binding using HT-SELEX data. QBiC-SELEX addresses systematic biases across SELEX-based platforms by explicitly modeling the experimental artifacts. Trained on over 4,000 HT-SELEX experiments covering more than 1,000 human TFs with a robust model curation strategy, QBiC-SELEX consistently outperforms existing methods across in vitro and in vivo benchmarks. Importantly, QBiC-SELEX not only provides a large-scale resource of curated variant-effect models for more than 1,000 TFs but also offers new instructive insights for how SELEX-based experiments can be more effectively designed and computational methods can utilize this type of data. Chapter 3 presents the second method, which addresses the complementary problem of calling TF binding sites across a broad affinity range. Unlike traditional methods such as position weight matrices (PWMs), which prioritize high affinity binding sites, CtrlF-TF uses k-mer data derived from universal protein binding microarray and processed HT-SELEX data generated by QBiC-SELEX to better capture medium- and low-affinity binding sites which are functionally important for regulatory activity. Benchmarks with PBM, ChIP-seq and genomic footprinting data demonstrates that CtrlF-TF calls substantially more TF binding sites than PWM while preserving lower false positive rates. The Chapter 4 summarizes these contributions and outlines a path forward. A potential opportunity is to integrate curated in vitro TF-DNA binding data into a small TF-specialized pretrained genomic language model, which can then be fine-tuned with in vivo datasets such as ChIP-seq and ATAC-seq. This integrative approach has the potential to capture both intrinsic binding preferences of TFs and context-specific occupancy with high specificities. Together, QBiC-SELEX and CtrlF-TF provide valuable community resources for variants effect prediction and TF binding discovery. They deepen our understanding of into how SELEX-based data can be effectively modeled and demonstrate the principles of building robust and interpretable methods in functional genomics, laying the foundation for future integrative approaches such as genomic language model that unifies all TF data.

Description

Provenance

Subjects

Bioinformatics, Genetics, Biostatistics, Functional Genomics, Gene Regulation, Genetics, Non-coding Variants, Transcription Factor

Citation

Citation

Li, Shengyu (2025). A Computational Framework for Understanding Transcription Factor-DNA Binding and Effects of Non-coding Variants. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/34141.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.