Population Genetic Annotation of the Human Genome: Identifying Pathogenic Mutations
In the past decade, there have been a series of breakthroughs in human genetics. The advent of next-generation sequencing (NGS) has made it possible, for the first time, to sequence an entire human genome inexpensively and efficiently. The affordability and ease of NGS has led to an explosion of data. Now, the largest hurdle in human genetics has shifted from technology-based limitations of sequencing to developing a framework interpreting the large amount of data that has been generated. Specifically, in medical genetics, the key challenge is recognizing which of a given patient's many mutations may be contributing to disease.
The most successful methodologies for this problem rely on conservation. However, conservation cannot capture human-specific intolerance to variation. Other methodologies rely on biochemical annotations, which can indicate a genomic region's functional role, but do not directly assess its intolerance to variation in the context of disease. Therefore, despite these available methodologies, detecting causal variation still remains incredibly challenging.
In my thesis, I describe three methodologies, based on population genetics and standing human variation, which can help identify the regions of the genome that are most likely to cause disease when mutated. The first, subRVIS, focuses on sub-regions within genes. The second, ncRVIS, focuses on the regulatory regions of genes. The third, Orion, tackles the daunting problem of interpreting and prioritizing variants across the entire genome.
In Chapter 1, we will review some of the history that has brought us to this point and some of the methodologies currently in use for detecting disease-causing variants.
In Chapter 2, we describe subRVIS, a methodology that divides the gene into sub-regions based on sequence homology to known protein domains, and then ranks those sub-regions based on their tolerance to functional variation. We show that this ranking is associated with the sub-region's likelihood of carrying a previously known pathogenic mutation. Further, we demonstrate that the biological division into domains adds significant information in comparison to dividing the gene into random regions matched in size. This methodology is useful in localizing where pathogenic mutations are most likely to fall within genes.
Chapter 3 describes a methodology to rank genes based on the likelihood that mutations falling in their regulatory regions are pathogenic. We demonstrate that this ranking is associated with whether or not a gene is sensitive to changes in its dosage. This methodology is useful in assessing the pathogenicity of mutations occurring in known regulatory regions that have been associated with genes.
In Chapter 4 we tackle one of the most intimidating and challenging problems in the field of medical genetics: detecting intolerance to variation across the entire human genome. Using a sliding window, we generate a score per base to highlight the regions of the genome that are intolerant to variation, with higher scores corresponding to more intolerant sequence. We term this approach Orion. We demonstrate that exons and DNase hypersensitive sites are enriched for higher Orion scores. This methodology will transform the way whole-genome sequence data are interpreted, by giving researchers the ability to assess the pathogenicity of variants in regions of the genome that are not yet fully understood.
We have developed methodologies to tackle the key problem of detecting disease-causing variation in patients' sequence data. In an era overwhelmed by NGS data, these methodologies bring us closer to understanding the genetics of disease.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations