Prioritizing Genomic Sequences Using Evolutionary Signatures in Humans and Near-Human Primates

Loading...
Thumbnail Image
Limited Access
This item is unavailable until:
2025-06-06

Date

2024

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

35
views
0
downloads

Abstract

The identification of genes subject to purifying selection has important applications to disease genetics. The rapid increase in the number of human genomes available has resulted in a large increase in the power of population genetics analyses to detect such genes. This study takes into account the variability in mutability across the genome and the fact that the impact of different types of mutations on a gene is influenced by the gene's specific functions. We have explicitly modeled sequence context dependent mutation rates and directly compared the standing variants of genes to neutral expectations given the mutation rates. We developed a likelihood ratio test (LRT) based on a multinomial model involving five classes of variants to test the deviation of observed composition of variant types from expectation for each gene. The p-value of the LRT was taken as the measure of overall constraint of the gene. By comparing constraints across variant classes, we categorized genes into different mechanistic functional groups. The results were validated by AUC analyses of several functional and disease relevant gene sets. Our LRT outperformed several existing constraint methods in most of the gene sets. Important for diagnostic applications, we find that genes that are more constrained in missense variants than LOF variants, are more likely to contain more ClinVar pathogenic missense variants than LOF variants. Differences in missense and LOF constraint are correlated with the differences in the pathogenic missense and LOF variant counts in ClinVar. Also, we showed that non-eGenes from GTEx have greater intolerance compared with eGenes. Finally, we applied the functional constraint idea to the non-coding regions of human, and we saw an increase in accuracy when the threshold of calling functional variants is more stringent. Our analysis is based on genetic data from a population level database (gnomAD v2) and can be used with other large datasets in the future. Another way to understand the essential regions of the human genome is to conduct the comparative multi-omic analyses across multiple near-human species. Our focus was particularly on non-coding regulatory regions, aiming to functionally characterize how cis and trans noncoding genetic variation affects the chromatin accessibility. In this study, we jointly analyzed DNase-seq/ATAC-seq, whole genome sequencing (WGS), and RNA-seq data across five primate species: human, chimpanzee, gorilla, orangutan, and macaque. We developed a cross-species analysis pipeline to minimize the bias while maximizing the data retention. The pipeline includes cross-species and cross-replicates alignment, and functionality estimates for SNV and indels. We also developed methods to partition the variation in the accessibility to cis and trans effects. The pipeline along with our variance partitioning techniques, is adaptable for use with other multi-omic datasets from various species. We focused on the transcription factor (TF) bindings for cis effects, including the presence of the binding sites, and the binding affinity changes due to the variants (including indels) across species and replicates. We identified TFs associated with either increased or decreased chromatin accessibility. For trans effects, we analyzed TF abundance using RNA-seq data. Principal Component Analysis (PCA) was used to extract the low dimensional matrix to simplify both the cis and trans effects. This was followed by using canonical correlation analysis, alongside orthogonalization, regression, and type II ANOVA, with precision weighting, to dissect the variation. We identified the species-specific differential regions and ubiquitously open regions and intersected those regions with the evolutionary constraint and conservation scores as well as with the cis and trans effects partitioning. This comprehensive analysis led to a biological interpretation that offer new insights into evolution of non-coding regions in humans and closely related primates.

Department

Description

Provenance

Citation

Citation

Duan, Yuncheng (2024). Prioritizing Genomic Sequences Using Evolutionary Signatures in Humans and Near-Human Primates. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/30905.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.