Towards a Complete Transcriptional Regulatory Code: Improved Motif Discovery Using Informative Priors
Transcriptional regulation is the primary mechanism employed by the cell to ensure coordinated expression of its numerous genes. A key component of this process is the binding of proteins called transcription factors (TFs) to corresponding regulatory sites on the DNA. Understanding where exactly these TFs bind, under what conditions they are active, and which genes they regulate is all part of deciphering the transcriptional regulatory code. An important step towards solving this problem is the identification of DNA binding specificities, represented as motifs, for all TFs. In spite of an explosion of TF binding data from high-throughput technologies, the problem of motif discovery remains unsolved, due to the short length and degeneracy of binding sites.
We introduce PRIORITY, a Gibbs sampling-based approach, which incorporates informative positional priors into a probabilistic framework, to find significant motifs from high-throughput TF binding data. We use different data sources to build our positional priors and apply them to yeast ChIP-chip data:
* TFs can be classified into several structural classes based on their DNA-binding domains. Using a Bayesian learning algorithm, we show that it is possible to predict the class of a TF with remarkable accuracy, using information solely from its DNA binding sites. We further incorporate these results in the form of informative priors into PRIORITY, which learns the structural class of the TF in addition to its motif.
* In the nucleus, DNA is present in the form of chromatin--wrapped around nucleosomes--with certain regions being more accessible to TFs than others. It has been shown that functional binding sites are generally located in nucleosome-free regions. We use nucleosome occupancy predictions to compute a novel positional prior that biases the search towards the more accessible regions, thereby enriching the motif signal.
* Functional elements are often conserved across related species. Most conventional methods that exploit this fact use alignments. However, multiple alignments cannot always capture relocation and reversed orientation of binding sites across species. We propose a new alignment-free technique that not only accounts for these transformations, but is much faster than conventional methods.
All our priors significantly outperform conventional methods, finding motifs matching literature for 52 TFs. We produce a genome-wide map of TF binding sites in yeast based on these and other novel motif predictions.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations