Computational Processing of Omics Data: Implications for Analysis

Thumbnail Image




Wray, Gregory A

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



In this work, I present four studies across the range of 'omics data types - a Genome- Wide Association Study for gene-by-sex interaction of obesity traits, computational models for transcription start site classification, an assessment of reference-based mapping methods for RNA-Seq data from non-model organisms, and a statistical model for open-platform proteomics data alignment.

Obesity is an increasingly prevalent and severe health concern with a substantial heritable component, and marked sex differences. We sought to determine if the effect of genetic variants also differed by sex by performing a genome-wide association study modeling the effect of genotype-by-sex interaction on obesity phenotypes. Genotype data from individuals in the Framingham Heart Study Offspring cohort were analyzed across five exams. Although no variants showed genome-wide significant gene-by-sex interaction in any individual exam, four polymorphisms displayed a consistent BMI association (P-values .00186 to .00010) across all five exams. These variants were clustered downstream of LYPLAL1, which encodes a lipase/esterase expressed in adipose tissue, a locus previously identified as having sex-specific effects on central obesity. Primary effects in males were in the opposite direction as females and were replicated in Framingham Generation 3. Our data support a sex-influenced association between genetic variation at the LYPLAL1 locus and obesity-related traits.

The application of deep sequencing to map 5' capped transcripts has confirmed the existence of at least two distinct promoter classes in metazoans: focused promot- ers with transcription start sites (TSSs) that occur in a narrowly defined genomic span and dispersed promoters with TSSs that are spread over a larger window. Pre- vious studies have explored the presence of genomic features, such as CpG islands and sequence motifs, in these promoter classes, and our collaborators recently inves- tigated the relationship with chromatin features. It was found that promoter classes are significantly differentiated by nucleosome organization and chromatin structure. Here, we present computational models supporting the stronger contribution of chro- matin features to the definition of dispersed promoters compared to focused start sites. Specifically, dispersed promoters display enrichment for well-positioned nucleosomes downstream of the TSS and a more clearly defined nucleosome free region upstream, while focused promoters have a less organized nucleosome structure, yet higher presence of RNA polymerase II. These differences extend to histone vari- ants (H2A.Z) and marks (H3K4 methylation), as well as insulator binding (such as CTCF), independent of the expression levels of affected genes.

The application of next-generation sequencing technology to gene expression quantification analysis, namely, RNA-Sequencing, has transformed the way in which gene expression studies are conducted and analyzed. These advances are of partic- ular interest to researchers studying non-model organisms, as the need for knowl- edge of sequence information is overcome. De novo assembly methods have gained widespread acceptance in the RNA-Seq community for non-model organisms with no true reference genome or transcriptome. While such methods have tremendous utility, computational complexity is still a significant challenge for organisms with large and complex genomes. Here we present a comparison of four reference-based mapping methods for non-human primate data. We explore mapping efficacy, correlation between computed expression values, and utility for differential expression analyses. We show that reference-based mapping methods indeed have utility in RNA-Seq analysis of mammalian data with no true reference, and that the details of mapping methods should be carefully considered when doing so. We find that shorter seed sequences, allowance of mismatches, and allowance of gapped alignments, in addition to splice junction gaps result in more sensitive alignments of non-human primate RNA-Seq data.

Open-platform proteomics experiments seek to quantify and identify the proteins present in biological samples. Much like differential gene expression analyses, it is often of interest to determine how protein abundance differs in various physiological conditions. Label free LC-MS/MS enables the rapid measurement of thousands of proteins, providing a wealth of peptide intensity information for differential analysis. However, the processing of raw proteomics data poses significant challenges that must be overcome prior to analysis. We specifically address the matching of peptide measurements across samples - an essential pre-processing step in every proteomics experiment. Presented here is a novel method for open-platform proteomics data alignment with the ability to incorporate previously unused aspects of the data, particularly ion mobility drift times and product ion data. Our results suggest that the inclusion of additional data results in higher numbers of more confident matches, without increasing the number of mismatches. We also show that the incorporation of product ion data can improve results dramatically. Based on these results, we argue that the incorporation of ion mobility drift times and product ion information are worthy pursuits. In addition, alignment methods should be flexible enough to utilize all available data, particularly with recent advancements in experimental separation methods. The addition of drift times and/or high energy to alignment methods and accurate mass and time (AMT) tag databases can greatly improve experimenters ability to identify measured peptides, reducing analysis costs and potentially the need to run additional experiments.





Benjamin, Ashlee Marie (2013). Computational Processing of Omics Data: Implications for Analysis. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.