Protein-DNA Binding: Discovering Motifs and Distinguishing Direct from Indirect Interactions

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



The initiation of two major processes in the eukaryotic cell, gene transcription and DNA replication, is regulated largely through interactions between proteins or protein complexes and DNA. Although a lot is known about the interacting proteins and their role in regulating transcription and replication, the specific DNA binding motifs of many regulatory proteins and complexes are still to be determined. For this purpose, many computational tools for DNA motif discovery have been developed in the last two decades. These tools employ a variety of strategies, from exhaustive search to sampling techniques, with the hope of finding over-represented motifs in sets of co-regulated or co-bound sequences. Despite the variety of computational tools aimed at solving the problem of motif discovery, their ability to correctly detect known DNA motifs is still limited. The motifs are usually short and many times degenerate, which makes them difficult to distinguish from genomic background. We believe the most efficient strategy for improving the performance of motif discovery is not to use increasingly complex computational and statistical methods and models, but to incorporate more of the biology into the computational techniques, in a principled manner. To this end, we propose a novel motif discovery algorithm: PRIORITY. Based on a general Gibbs sampling framework, PRIORITY has a major advantage over other motif discovery tools: it can incorporate different types of biological information (e.g., nucleosome positioning information) to guide the search for DNA binding sites toward regions where these sites are more likely to occur (e.g., nucleosome-free regions).

We use transcription factor (TF) binding data from yeast chromatin immunoprecipitation (ChIP-chip) experiments to test the performance of our motif discovery algorithm when incorporating three types of biological information: information about nucleosome positioning, information about DNA double-helical stability, and evolutionary conservation information. In each case, incorporating additional biological information has proven very useful in increasing the accuracy of motif finding, with the number of correctly identified motifs increasing with up to 52%. PRIORITY is not restricted to TF binding data. In this work, we also analyze origin recognition complex (ORC) binding data and show that PRIORITY can utilize DNA structural information to predict the binding specificity of the yeast ORC.

Despite the improvement obtained using additional biological information, the success of motif discovery algorithms in identifying known motifs is still limited, especially when applied to sequences bound in vivo (such as those of ChIP-chip) because the observed protein-DNA interactions are not necessarily direct. Some TFs associate with DNA only indirectly via protein partners, while others exhibit both direct and indirect binding. We propose a novel method to distinguish between direct and indirect TF-DNA interactions, integrating in vivo TF binding data, in vivo nucleosome occupancy data, and in vitro motifs from protein binding microarrays. When applied to yeast ChIP-chip data, our method reveals that only 48% of the ChIP-chip data sets can be readily explained by direct binding of the profiled TF, while 16% can be explained by indirect DNA binding. In the remaining 36%, we found that none of the motifs used in our analysis was able to explain the ChIP-chip data, either because the data was too noisy or because the set of motifs was incomplete. As more in vitro motifs become available, our method can be used to build a complete catalog of direct and indirect TF-DNA interactions.





Gordan, Raluca Mihaela (2009). Protein-DNA Binding: Discovering Motifs and Distinguishing Direct from Indirect Interactions. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.