dc.description.abstract |
<p>In this work, I present four studies across the range of 'omics data types - a
Genome- Wide Association Study for gene-by-sex interaction of obesity traits, computational
models for transcription start site classification, an assessment of reference-based
mapping methods for RNA-Seq data from non-model organisms, and a statistical model
for open-platform proteomics data alignment.</p><p>Obesity is an increasingly prevalent
and severe health concern with a substantial heritable component, and marked sex differences.
We sought to determine if the effect of genetic variants also differed by sex by performing
a genome-wide association study modeling the effect of genotype-by-sex interaction
on obesity phenotypes. Genotype data from individuals in the Framingham Heart Study
Offspring cohort were analyzed across five exams. Although no variants showed genome-wide
significant gene-by-sex interaction in any individual exam, four polymorphisms displayed
a consistent BMI association (P-values .00186 to .00010) across all five exams. These
variants were clustered downstream of LYPLAL1, which encodes a lipase/esterase expressed
in adipose tissue, a locus previously identified as having sex-specific effects on
central obesity. Primary effects in males were in the opposite direction as females
and were replicated in Framingham Generation 3. Our data support a sex-influenced
association between genetic variation at the LYPLAL1 locus and obesity-related traits.</p><p>The
application of deep sequencing to map 5' capped transcripts has confirmed the existence
of at least two distinct promoter classes in metazoans: focused promot- ers with transcription
start sites (TSSs) that occur in a narrowly defined genomic span and dispersed promoters
with TSSs that are spread over a larger window. Pre- vious studies have explored the
presence of genomic features, such as CpG islands and sequence motifs, in these promoter
classes, and our collaborators recently inves- tigated the relationship with chromatin
features. It was found that promoter classes are significantly differentiated by nucleosome
organization and chromatin structure. Here, we present computational models supporting
the stronger contribution of chro- matin features to the definition of dispersed promoters
compared to focused start sites. Specifically, dispersed promoters display enrichment
for well-positioned nucleosomes downstream of the TSS and a more clearly defined nucleosome
free region upstream, while focused promoters have a less organized nucleosome structure,
yet higher presence of RNA polymerase II. These differences extend to histone vari-
ants (H2A.Z) and marks (H3K4 methylation), as well as insulator binding (such as CTCF),
independent of the expression levels of affected genes.</p><p>The application of next-generation
sequencing technology to gene expression quantification analysis, namely, RNA-Sequencing,
has transformed the way in which gene expression studies are conducted and analyzed.
These advances are of partic- ular interest to researchers studying non-model organisms,
as the need for knowl- edge of sequence information is overcome. De novo assembly
methods have gained widespread acceptance in the RNA-Seq community for non-model organisms
with no true reference genome or transcriptome. While such methods have tremendous
utility, computational complexity is still a significant challenge for organisms with
large and complex genomes. Here we present a comparison of four reference-based mapping
methods for non-human primate data. We explore mapping efficacy, correlation between
computed expression values, and utility for differential expression analyses. We show
that reference-based mapping methods indeed have utility in RNA-Seq analysis of mammalian
data with no true reference, and that the details of mapping methods should be carefully
considered when doing so. We find that shorter seed sequences, allowance of mismatches,
and allowance of gapped alignments, in addition to splice junction gaps result in
more sensitive alignments of non-human primate RNA-Seq data.</p><p>Open-platform proteomics
experiments seek to quantify and identify the proteins present in biological samples.
Much like differential gene expression analyses, it is often of interest to determine
how protein abundance differs in various physiological conditions. Label free LC-MS/MS
enables the rapid measurement of thousands of proteins, providing a wealth of peptide
intensity information for differential analysis. However, the processing of raw proteomics
data poses significant challenges that must be overcome prior to analysis. We specifically
address the matching of peptide measurements across samples - an essential pre-processing
step in every proteomics experiment. Presented here is a novel method for open-platform
proteomics data alignment with the ability to incorporate previously unused aspects
of the data, particularly ion mobility drift times and product ion data. Our results
suggest that the inclusion of additional data results in higher numbers of more confident
matches, without increasing the number of mismatches. We also show that the incorporation
of product ion data can improve results dramatically. Based on these results, we argue
that the incorporation of ion mobility drift times and product ion information are
worthy pursuits. In addition, alignment methods should be flexible enough to utilize
all available data, particularly with recent advancements in experimental separation
methods. The addition of drift times and/or high energy to alignment methods and accurate
mass and time (AMT) tag databases can greatly improve experimenters ability to identify
measured peptides, reducing analysis costs and potentially the need to run additional
experiments.</p>
|
|