Prioritizing Genomic Sequences Using Evolutionary Signatures in Humans and Near-Human Primates

dc.contributor.advisor

Wray, Greg GAW

dc.contributor.advisor

Allen, Andrew ASA

dc.contributor.author

Duan, Yuncheng

dc.date.accessioned

2024-06-06T13:45:09Z

dc.date.issued

2024

dc.department

Biology

dc.description.abstract

The identification of genes subject to purifying selection has important applications to disease genetics. The rapid increase in the number of human genomes available has resulted in a large increase in the power of population genetics analyses to detect such genes. This study takes into account the variability in mutability across the genome and the fact that the impact of different types of mutations on a gene is influenced by the gene's specific functions. We have explicitly modeled sequence context dependent mutation rates and directly compared the standing variants of genes to neutral expectations given the mutation rates. We developed a likelihood ratio test (LRT) based on a multinomial model involving five classes of variants to test the deviation of observed composition of variant types from expectation for each gene. The p-value of the LRT was taken as the measure of overall constraint of the gene. By comparing constraints across variant classes, we categorized genes into different mechanistic functional groups. The results were validated by AUC analyses of several functional and disease relevant gene sets. Our LRT outperformed several existing constraint methods in most of the gene sets. Important for diagnostic applications, we find that genes that are more constrained in missense variants than LOF variants, are more likely to contain more ClinVar pathogenic missense variants than LOF variants. Differences in missense and LOF constraint are correlated with the differences in the pathogenic missense and LOF variant counts in ClinVar. Also, we showed that non-eGenes from GTEx have greater intolerance compared with eGenes. Finally, we applied the functional constraint idea to the non-coding regions of human, and we saw an increase in accuracy when the threshold of calling functional variants is more stringent. Our analysis is based on genetic data from a population level database (gnomAD v2) and can be used with other large datasets in the future. Another way to understand the essential regions of the human genome is to conduct the comparative multi-omic analyses across multiple near-human species. Our focus was particularly on non-coding regulatory regions, aiming to functionally characterize how cis and trans noncoding genetic variation affects the chromatin accessibility. In this study, we jointly analyzed DNase-seq/ATAC-seq, whole genome sequencing (WGS), and RNA-seq data across five primate species: human, chimpanzee, gorilla, orangutan, and macaque. We developed a cross-species analysis pipeline to minimize the bias while maximizing the data retention. The pipeline includes cross-species and cross-replicates alignment, and functionality estimates for SNV and indels. We also developed methods to partition the variation in the accessibility to cis and trans effects. The pipeline along with our variance partitioning techniques, is adaptable for use with other multi-omic datasets from various species. We focused on the transcription factor (TF) bindings for cis effects, including the presence of the binding sites, and the binding affinity changes due to the variants (including indels) across species and replicates. We identified TFs associated with either increased or decreased chromatin accessibility. For trans effects, we analyzed TF abundance using RNA-seq data. Principal Component Analysis (PCA) was used to extract the low dimensional matrix to simplify both the cis and trans effects. This was followed by using canonical correlation analysis, alongside orthogonalization, regression, and type II ANOVA, with precision weighting, to dissect the variation. We identified the species-specific differential regions and ubiquitously open regions and intersected those regions with the evolutionary constraint and conservation scores as well as with the cis and trans effects partitioning. This comprehensive analysis led to a biological interpretation that offer new insights into evolution of non-coding regions in humans and closely related primates.

dc.identifier.uri

https://hdl.handle.net/10161/30905

dc.rights.uri

https://creativecommons.org/licenses/by-nc-nd/4.0/

dc.subject

Biology

dc.subject

Bioinformatics

dc.subject

Biostatistics

dc.subject

Comparative Multi-Omics

dc.subject

Functional Constraint

dc.subject

Genomics

dc.subject

Human Diseases

dc.subject

Primates

dc.subject

Regulatory Regions

dc.title

Prioritizing Genomic Sequences Using Evolutionary Signatures in Humans and Near-Human Primates

dc.type

Dissertation

duke.embargo.months

12

duke.embargo.release

2025-06-06T13:45:09Z

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Duan_duke_0066D_17900.pdf
Size:
11.97 MB
Format:
Adobe Portable Document Format

Collections