Browsing by Subject "Databases, Genetic"
Now showing 1 - 20 of 24
- Results Per Page
- Sort Options
Item Open Access A Genocentric Approach to Discovery of Mendelian Disorders.(American journal of human genetics, 2019-11) Hansen, Adam W; Murugan, Mullai; Li, He; Khayat, Michael M; Wang, Liwen; Rosenfeld, Jill; Andrews, B Kim; Jhangiani, Shalini N; Coban Akdemir, Zeynep H; Sedlazeck, Fritz J; Ashley-Koch, Allison E; Liu, Pengfei; Muzny, Donna M; Task Force for Neonatal Genomics; Davis, Erica E; Katsanis, Nicholas; Sabo, Aniko; Posey, Jennifer E; Yang, Yaping; Wangler, Michael F; Eng, Christine M; Sutton, V Reid; Lupski, James R; Boerwinkle, Eric; Gibbs, Richard AThe advent of inexpensive, clinical exome sequencing (ES) has led to the accumulation of genetic data from thousands of samples from individuals affected with a wide range of diseases, but for whom the underlying genetic and molecular etiology of their clinical phenotype remains unknown. In many cases, detailed phenotypes are unavailable or poorly recorded and there is little family history to guide study. To accelerate discovery, we integrated ES data from 18,696 individuals referred for suspected Mendelian disease, together with relatives, in an Apache Hadoop data lake (Hadoop Architecture Lake of Exomes [HARLEE]) and implemented a genocentric analysis that rapidly identified 154 genes harboring variants suspected to cause Mendelian disorders. The approach did not rely on case-specific phenotypic classifications but was driven by optimization of gene- and variant-level filter parameters utilizing historical Mendelian disease-gene association discovery data. Variants in 19 of the 154 candidate genes were subsequently reported as causative of a Mendelian trait and additional data support the association of all other candidate genes with disease endpoints.Item Open Access An Atlas of Genetic Variation Linking Pathogen-Induced Cellular Traits to Human Disease.(Cell host & microbe, 2018-08) Wang, Liuyang; Pittman, Kelly J; Barker, Jeffrey R; Salinas, Raul E; Stanaway, Ian B; Williams, Graham D; Carroll, Robert J; Balmat, Tom; Ingham, Andy; Gopalakrishnan, Anusha M; Gibbs, Kyle D; Antonia, Alejandro L; eMERGE Network; Heitman, Joseph; Lee, Soo Chan; Jarvik, Gail P; Denny, Joshua C; Horner, Stacy M; DeLong, Mark R; Valdivia, Raphael H; Crosslin, David R; Ko, Dennis CPathogens have been a strong driving force for natural selection. Therefore, understanding how human genetic differences impact infection-related cellular traits can mechanistically link genetic variation to disease susceptibility. Here we report the Hi-HOST Phenome Project (H2P2): a catalog of cellular genome-wide association studies (GWAS) comprising 79 infection-related phenotypes in response to 8 pathogens in 528 lymphoblastoid cell lines. Seventeen loci surpass genome-wide significance for infection-associated phenotypes ranging from pathogen replication to cytokine production. We combined H2P2 with clinical association data from patients to identify a SNP near CXCL10 as a risk factor for inflammatory bowel disease. A SNP in the transcriptional repressor ZBTB20 demonstrated pleiotropy, likely through suppression of multiple target genes, and was associated with viral hepatitis. These data are available on a web portal to facilitate interpreting human genome variation through the lens of cell biology and should serve as a rich resource for the research community.Item Open Access Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.(Nature, 2002-12-05) Okazaki, Y; Furuno, M; Kasukawa, T; Adachi, J; Bono, H; Kondo, S; Nikaido, I; Osato, N; Osato, N; Saito, R; Suzuki, H; Yamanaka, I; Kiyosawa, H; Yagi, K; Tomaru, Y; Hasegawa, Y; Nogami, A; Schönbach, C; Gojobori, T; Baldarelli, R; Hill, DP; Bult, C; Hume, DA; Hume, DA; Quackenbush, J; Schriml, LM; Kanapin, A; Matsuda, H; Batalov, S; Beisel, KW; Blake, JA; Bradt, D; Brusic, V; Chothia, C; Corbani, LE; Cousins, S; Dalla, E; Dragani, TA; Fletcher, CF; Forrest, A; Frazer, KS; Gaasterland, T; Gariboldi, M; Gissi, C; Godzik, A; Gough, J; Grimmond, S; Gustincich, S; Hirokawa, N; Jackson, IJ; Jarvis, ED; Kanai, A; Kawaji, H; Kawasawa, Y; Kedzierski, RM; King, BL; Konagaya, A; Kurochkin, IV; Lee, Y; Lenhard, B; Lyons, PA; Maglott, DR; Maltais, L; Marchionni, L; McKenzie, L; Miki, H; Nagashima, T; Numata, K; Okido, T; Pavan, WJ; Pertea, G; Pesole, G; Petrovsky, N; Pillai, R; Pontius, JU; Qi, D; Ramachandran, S; Ravasi, T; Reed, JC; Reed, DJ; Reid, J; Ring, BZ; Ringwald, M; Sandelin, A; Schneider, C; Semple, CAM; Setou, M; Shimada, K; Sultana, R; Takenaka, Y; Taylor, MS; Teasdale, RD; Tomita, M; Verardo, R; Wagner, L; Wahlestedt, C; Wang, Y; Watanabe, Y; Wells, C; Wilming, LG; Wynshaw-Boris, A; Yanagisawa, M; Yang, I; Yang, L; Yuan, Z; Zavolan, M; Zhu, Y; Zimmer, A; Carninci, P; Hayatsu, N; Hirozane-Kishikawa, T; Konno, H; Nakamura, M; Sakazume, N; Sato, K; Shiraki, T; Waki, K; Kawai, J; Aizawa, K; Arakawa, T; Fukuda, S; Hara, A; Hashizume, W; Imotani, K; Ishii, Y; Itoh, M; Kagawa, I; Miyazaki, A; Sakai, K; Sasaki, D; Shibata, K; Shinagawa, A; Yasunishi, A; Yoshino, M; Waterston, R; Lander, ES; Rogers, J; Birney, E; Hayashizaki, Y; FANTOM Consortium; RIKEN Genome Exploration Research Group Phase I & II TeamOnly a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 'transcriptional units', contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense-antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.Item Open Access Analysis of the mouse transcriptome for genes involved in the function of the nervous system.(Genome Res, 2003-06) Gustincich, Stefano; Batalov, Serge; Beisel, Kirk W; Bono, Hidemasa; Carninci, Piero; Fletcher, Colin F; Grimmond, Sean; Hirokawa, Nobutaka; Jarvis, Erich D; Jegla, Tim; Kawasawa, Yuka; LeMieux, Julianna; Miki, Harukata; Raviola, Elio; Teasdale, Rohan D; Tominaga, Naoko; Yagi, Ken; Zimmer, Andreas; Hayashizaki, Yoshihide; Okazaki, Yasushi; RIKEN GER Group; GSL MembersWe analyzed the mouse Representative Transcript and Protein Set for molecules involved in brain function. We found full-length cDNAs of many known brain genes and discovered new members of known brain gene families, including Family 3 G-protein coupled receptors, voltage-gated channels, and connexins. We also identified previously unknown candidates for secreted neuroactive molecules. The existence of a large number of unique brain ESTs suggests an additional molecular complexity that remains to be explored.A list of genes containing CAG stretches in the coding region represents a first step in the potential identification of candidates for hereditary neurological disorders.Item Open Access Application of a rank-based genetic association test to age-at-onset data from the Collaborative Study on the Genetics of Alcoholism study.(BMC Genet, 2005-12-30) Li, YJ; Martin, ER; Zhang, L; Allen, ASAssociation studies of quantitative traits have often relied on methods in which a normal distribution of the trait is assumed. However, quantitative phenotypes from complex human diseases are often censored, highly skewed, or contaminated with outlying values. We recently developed a rank-based association method that takes into account censoring and makes no distributional assumptions about the trait. In this study, we applied our new method to age-at-onset data on ALDX1 and ALDX2. Both traits are highly skewed (skewness > 1.9) and often censored. We performed a whole genome association study of age at onset of the ALDX1 trait using Illumina single-nucleotide polymorphisms. Only slightly more than 5% of markers were significant. However, we identified two regions on chromosomes 14 and 15, which each have at least four significant markers clustering together. These two regions may harbor genes that regulate age at onset of ALDX1 and ALDX2. Future fine mapping of these two regions with densely spaced markers is warranted.Item Open Access Avianbase: a community resource for bird genomics.(Genome Biol, 2015-01-29) Eöry, Lél; Gilbert, M Thomas P; Li, Cai; Li, Bo; Archibald, Alan; Aken, Bronwen L; Zhang, Guojie; Jarvis, Erich; Flicek, Paul; Burt, David WGiving access to sequence and annotation data for genome assemblies is important because, while facilitating research, it places both assembly and annotation quality under scrutiny, resulting in improvements to both. Therefore we announce Avianbase, a resource for bird genomics, which provides access to data released by the Avian Phylogenomics Consortium.Item Open Access Characterization of the standard and recommended CODIS markers.(Journal of forensic sciences, 2013-01) Katsanis, Sara H; Wagner, Jennifer KAs U.S. courts grapple with constitutional challenges to DNA identification applications, judges are resting legal decisions on the fingerprint analogy, questioning whether the information from a DNA profile could, in light of scientific advances, reveal biomedically relevant information. While CODIS loci were selected largely because they lack phenotypic associations, how this criterion was assessed is unclear. To clarify their phenotypic relevance, we describe the standard and recommended CODIS markers within the context of what is known currently about the genome. We characterize the genomic regions and phenotypic associations of the 24 standard and suggested CODIS markers. None of the markers are within exons, although 12 are intragenic. No CODIS genotypes are associated with known phenotypes. This study provides clarification of the genomic significance of the key identification markers and supports--independent of the forensic scientific community--that the CODIS profiles provide identification but not sensitive or biomedically relevant information.Item Open Access Detecting local haplotype sharing and haplotype association.(Genetics, 2014-07) Xu, Hanli; Guan, YongtaoA novel haplotype association method is presented, and its power is demonstrated. Relying on a statistical model for linkage disequilibrium (LD), the method first infers ancestral haplotypes and their loadings at each marker for each individual. The loadings are then used to quantify local haplotype sharing between individuals at each marker. A statistical model was developed to link the local haplotype sharing and phenotypes to test for association. We devised a novel method to fit the LD model, reducing the complexity from putatively quadratic to linear (in the number of ancestral haplotypes). Therefore, the LD model can be fitted to all study samples simultaneously, and, consequently, our method is applicable to big data sets. Compared to existing haplotype association methods, our method integrated out phase uncertainty, avoided arbitrariness in specifying haplotypes, and had the same number of tests as the single-SNP analysis. We applied our method to data from the Wellcome Trust Case Control Consortium and discovered eight novel associations between seven gene regions and five disease phenotypes. Among these, GRIK4, which encodes a protein that belongs to the glutamate-gated ionic channel family, is strongly associated with both coronary artery disease and rheumatoid arthritis. A software package implementing methods described in this article is freely available at http://www.haplotype.org.Item Open Access Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature.(PLoS One, 2010-05-20) Dahdul, Wasila M; Balhoff, James P; Engeman, Jeffrey; Grande, Terry; Hilton, Eric J; Kothari, Cartik; Lapp, Hilmar; Lundberg, John G; Midford, Peter E; Vision, Todd J; Westerfield, Monte; Mabee, Paula MBACKGROUND: The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies. METHODOLOGY/PRINCIPAL FINDINGS: We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, http://zfin.org). We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators. CONCLUSIONS/SIGNIFICANCE: The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.Item Open Access Genes with high penetrance for syndromic and non-syndromic autism typically function within the nucleus and regulate gene expression.(Molecular autism, 2016-01) Casanova, Emily L; Sharp, Julia L; Chakraborty, Hrishikesh; Sumi, Nahid Sultana; Casanova, Manuel FBACKGROUND:Intellectual disability (ID), autism, and epilepsy share frequent yet variable comorbidities with one another. In order to better understand potential genetic divergence underlying this variable risk, we studied genes responsible for monogenic IDs, grouped according to their autism and epilepsy comorbidities. METHODS:Utilizing 465 different forms of ID with known molecular origins, we accessed available genetic databases in conjunction with gene ontology (GO) to determine whether the genetics underlying ID diverge according to its comorbidities with autism and epilepsy and if genes highly penetrant for autism or epilepsy share distinctive features that set them apart from genes that confer comparatively variable or no apparent risk. RESULTS:The genetics of ID with autism are relatively enriched in terms associated with nervous system-specific processes and structural morphogenesis. In contrast, we find that ID with highly comorbid epilepsy (HCE) is modestly associated with lipid metabolic processes while ID without autism or epilepsy comorbidity (ID only) is enriched at the Golgi membrane. Highly comorbid autism (HCA) genes, on the other hand, are strongly enriched within the nucleus, are typically involved in regulation of gene expression, and, along with IDs with more variable autism, share strong ties with a core protein-protein interaction (PPI) network integral to basic patterning of the CNS. CONCLUSIONS:According to GO terminology, autism-related gene products are integral to neural development. While it is difficult to draw firm conclusions regarding IDs unassociated with autism, it is clear that the majority of HCA genes are tightly linked with general dysregulation of gene expression, suggesting that disturbances to the chronology of neural maturation and patterning may be key in conferring susceptibility to autism spectrum conditions.Item Open Access Genetic variation associated with childhood and adult stature and risk of MYCN-amplified neuroblastoma.(Cancer medicine, 2020-11) Semmes, Eleanor C; Shen, Erica; Cohen, Jennifer L; Zhang, Chenan; Wei, Qingyi; Hurst, Jillian H; Walsh, Kyle MBackground
Neuroblastoma is the most common pediatric solid tumor. MYCN-amplification is an important negative prognostic indicator and inherited genetic contributions to risk are incompletely understood. Genetic determinants of stature increase risk of several adult and childhood cancers, but have not been studied in neuroblastoma despite elevated neuroblastoma incidence in children with congenital overgrowth syndromes.Methods
We investigated the association between genetic determinants of height and neuroblastoma risk in 1538 neuroblastoma cases, stratified by MYCN-amplification status, and compared to 3390 European-ancestry controls using polygenic scores for birth length (five variants), childhood height (six variants), and adult height (413 variants). We further examined the UK Biobank to evaluate the association of known neuroblastoma risk loci and stature.Results
An increase in the polygenic score for childhood stature, corresponding to a ~0.5 cm increase in pre-pubertal height, was associated with greater risk of MYCN-amplified neuroblastoma (OR = 1.14, P = .047). An increase in the polygenic score for adult stature, corresponding to a ~1.7 cm increase in adult height attainment, was associated with decreased risk of MYCN-amplified neuroblastoma (OR = 0.87, P = .047). These associations persisted in case-case analyses comparing MYCN-amplified to MYCN-unamplified neuroblastoma. No polygenic height scores were associated with MYCN-unamplified neuroblastoma risk. Previously identified genome-wide association study hits for neuroblastoma (N = 10) were significantly enriched for association with both childhood (P = 4.0 × 10-3 ) and adult height (P = 8.9 × 10-3 ) in >250 000 UK Biobank study participants.Conclusions
Genetic propensity to taller childhood height and shorter adult height were associated with MYCN-amplified neuroblastoma risk, suggesting that biological pathways affecting growth trajectories and pubertal timing may contribute to MYCN-amplified neuroblastoma etiology.Item Open Access Genome-wide association study of acute kidney injury after coronary bypass graft surgery identifies susceptibility loci.(Kidney Int, 2015-10) Stafford-Smith, Mark; Li, Yi-Ju; Mathew, Joseph P; Li, Yen-Wei; Ji, Yunqi; Phillips-Bute, Barbara G; Milano, Carmelo A; Newman, Mark F; Kraus, William E; Kertai, Miklos D; Shah, Svati H; Podgoreanu, Mihai V; Duke Perioperative Genetics and Safety Outcomes (PEGASUS) Investigative TeamAcute kidney injury (AKI) is a common, serious complication of cardiac surgery. Since prior studies have supported a genetic basis for postoperative AKI, we conducted a genome-wide association study (GWAS) for AKI following coronary bypass graft (CABG) surgery. The discovery data set consisted of 873 nonemergent CABG surgery patients with cardiopulmonary bypass (PEGASUS), while a replication data set had 380 cardiac surgical patients (CATHGEN). Single-nucleotide polymorphism (SNP) data were based on Illumina Human610-Quad (PEGASUS) and OMNI1-Quad (CATHGEN) BeadChips. We used linear regression with adjustment for a clinical AKI risk score to test SNP associations with the postoperative peak rise relative to preoperative serum creatinine concentration as a quantitative AKI trait. Nine SNPs meeting significance in the discovery set were detected. The rs13317787 in GRM7|LMCD1-AS1 intergenic region (3p21.6) and rs10262995 in BBS9 (7p14.3) were replicated with significance in the CATHGEN data set and exhibited significantly strong overall association following meta-analysis. Additional fine mapping using imputed SNPs across these two regions and meta-analysis found genome-wide significance at the GRM7|LMCD1-AS1 locus and a significantly strong association at BBS9. Thus, through an unbiased GWAS approach, we found two new loci associated with post-CABG AKI providing new insights into the pathogenesis of perioperative AKI.Item Open Access GREAT improves functional interpretation of cis-regulatory regions.(Nature biotechnology, 2010-05-02) McLean, Cory Y; Bristor, Dave; Hiller, Michael; Clarke, Shoa L; Schaar, Bruce T; Lowe, Craig B; Wenger, Aaron M; Bejerano, GillWe developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to analyze the functional significance of cis-regulatory regions identified by localized measurements of DNA binding events across an entire genome. Whereas previous methods took into account only binding proximal to genes, GREAT is able to properly incorporate distal binding sites and control for false positives using a binomial test over the input genomic regions. GREAT incorporates annotations from 20 ontologies and is available as a web application. Applying GREAT to data sets from chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) of multiple transcription-associated factors, including SRF, NRSF, GABP, Stat3 and p300 in different developmental contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate testable hypotheses. The utility of GREAT is not limited to ChIP-seq, as it could also be applied to open chromatin, localized epigenomic markers and similar functional data sets, as well as comparative genomics sets.Item Open Access Inter-chromosomal level of genome organization and longevity-related phenotypes in humans.(Age (Dordr), 2013-04) Kulminski, Alexander M; Culminskaya, Irina; Yashin, Anatoli IStudies focusing on unraveling the genetic origin of health span in humans assume that polygenic, aging-related phenotypes are inherited through Mendelian mechanisms of inheritance of individual genes. We use the Framingham Heart Study (FHS) data to examine whether non-Mendelian mechanisms of inheritance can drive linkage of loci on non-homologous chromosomes and whether such mechanisms can be relevant to longevity-related phenotypes. We report on genome-wide inter-chromosomal linkage disequilibrium (LD) and on chromosome-wide intra-chromosomal LD and show that these are real phenomena in the FHS data. Genetic analysis of inheritance in families based on Mendelian segregation reveals that the alleles of single nucleotide polymorphisms (SNPs) in LD at loci on non-homologous chromosomes are inherited as a complex resembling haplotypes of a genetic unit. This result implies that the inter-chromosomal LD is likely caused by non-random assortment of non-homologous chromosomes during meiosis. The risk allele haplotypes can be subject to dominant-negative selection primary through the mechanisms of non-Mendelian inheritance. They can go to extinction within two human generations. The set of SNPs in inter-chromosomal LD (N=68) is nearly threefold enriched, with high significance (p=1.6 × 10(-9)), on non-synonymous coding variants (N=28) compared to the entire qualified set of the studied SNPs. Genes for the tightly linked SNPs are involved in fundamental biological processes in an organism. Survival analyses show that the revealed non-genetic linkage is associated with heritable complex phenotype of premature death. Our results suggest the presence of inter-chromosomal level of functional organization in the human genome and highlight a challenging problem of genomics of human health and aging.Item Open Access Interpreting Incidentally Identified Variants in Genes Associated With Catecholaminergic Polymorphic Ventricular Tachycardia in a Large Cohort of Clinical Whole-Exome Genetic Test Referrals.(Circulation. Arrhythmia and electrophysiology, 2017-04) Landstrom, AP; Dailey-Schwartz, AL; Rosenfeld, JA; Yang, Y; McLean, MJ; Miyake, CY; Valdes, SO; Fan, Y; Allen, HD; Penny, DJ; Kim, JJBACKGROUND:The rapid expansion of genetic testing has led to increased utilization of clinical whole-exome sequencing (WES). Clinicians and genetic researchers are being faced with assessing risk of disease vulnerability from incidentally identified genetic variants which is typified by variants found in genes associated with sudden death-predisposing catecholaminergic polymorphic ventricular tachycardia (CPVT). We sought to determine whether incidentally identified variants in genes associated with CPVT from WES clinical testing represent disease-associated biomarkers. METHODS AND RESULTS:CPVT-associated genes RYR2 and CASQ2 variants were identified in one of the world's largest collections of clinical WES referral tests (N=6517, Baylor Miraca Genetics Laboratories) and compared with a control cohort of ostensibly healthy individuals (N=60 706) and a case cohort of CPVT cases (N=155). Within the WES cohort, the rate of rare variants in CPVT-associated genes was 8.8% compared with 6.0% among controls and 60.0% among cases. There was a predominance of variants of undetermined significance (97.7%). After protein topology mapping, WES variants colocalized more frequently to residues with variants found in controls compared with cases. Retrospective clinical evaluation of individuals referred to our institution with WES-positive variants demonstrated no evidence of clinical CPVT in individuals with a low pretest clinical suspicion for CPVT. CONCLUSIONS:The prevalence of incidentally identified CPVT-associated variants is ≈9% among WES tests. Variants of undetermined significances in CPVT-associated genes in WES genetic testing, in the absence of clinical suspicion for CPVT, are unlikely to represent markers of CPVT pathogenicity.Item Open Access Latent factor analysis to discover pathway-associated putative segmental aneuploidies in human cancers.(PLoS Comput Biol, 2010-09-02) Lucas, Joseph E; Kung, Hsiu-Ni; Chi, Jen-Tsan ATumor microenvironmental stresses, such as hypoxia and lactic acidosis, play important roles in tumor progression. Although gene signatures reflecting the influence of these stresses are powerful approaches to link expression with phenotypes, they do not fully reflect the complexity of human cancers. Here, we describe the use of latent factor models to further dissect the stress gene signatures in a breast cancer expression dataset. The genes in these latent factors are coordinately expressed in tumors and depict distinct, interacting components of the biological processes. The genes in several latent factors are highly enriched in chromosomal locations. When these factors are analyzed in independent datasets with gene expression and array CGH data, the expression values of these factors are highly correlated with copy number alterations (CNAs) of the corresponding BAC clones in both the cell lines and tumors. Therefore, variation in the expression of these pathway-associated factors is at least partially caused by variation in gene dosage and CNAs among breast cancers. We have also found the expression of two latent factors without any chromosomal enrichment is highly associated with 12q CNA, likely an instance of "trans"-variations in which CNA leads to the variations in gene expression outside of the CNA region. In addition, we have found that factor 26 (1q CNA) is negatively correlated with HIF-1alpha protein and hypoxia pathways in breast tumors and cell lines. This agrees with, and for the first time links, known good prognosis associated with both a low hypoxia signature and the presence of CNA in this region. Taken together, these results suggest the possibility that tumor segmental aneuploidy makes significant contributions to variation in the lactic acidosis/hypoxia gene signatures in human cancers and demonstrate that latent factor analysis is a powerful means to uncover such a linkage.Item Open Access Next-generation polyploid phylogenetics: rapid resolution of hybrid polyploid complexes using PacBio single-molecule sequencing.(The New phytologist, 2017-01) Rothfels, CJ; Pryer, KM; Li, FDifficulties in generating nuclear data for polyploids have impeded phylogenetic study of these groups. We describe a high-throughput protocol and an associated bioinformatics pipeline (Pipeline for Untangling Reticulate Complexes (Purc)) that is able to generate these data quickly and conveniently, and demonstrate its efficacy on accessions from the fern family Cystopteridaceae. We conclude with a demonstration of the downstream utility of these data by inferring a multi-labeled species tree for a subset of our accessions. We amplified four c. 1-kb-long nuclear loci and sequenced them in a parallel-tagged amplicon sequencing approach using the PacBio platform. Purc infers the final sequences from the raw reads via an iterative approach that corrects PCR and sequencing errors and removes PCR-mediated recombinant sequences (chimeras). We generated data for all gene copies (homeologs, paralogs, and segregating alleles) present in each of three sets of 50 mostly polyploid accessions, for four loci, in three PacBio runs (one run per set). From the raw sequencing reads, Purc was able to accurately infer the underlying sequences. This approach makes it easy and economical to study the phylogenetics of polyploids, and, in conjunction with recent analytical advances, facilitates investigation of broad patterns of polyploid evolution.Item Open Access One thousand plant transcriptomes and the phylogenomics of green plants.(Nature, 2019-10-23) One Thousand Plant Transcriptomes InitiativeGreen plants (Viridiplantae) include around 450,000-500,000 species1,2 of great diversity and have important roles in terrestrial and aquatic ecosystems. Here, as part of the One Thousand Plant Transcriptomes Initiative, we sequenced the vegetative transcriptomes of 1,124 species that span the diversity of plants in a broad sense (Archaeplastida), including green plants (Viridiplantae), glaucophytes (Glaucophyta) and red algae (Rhodophyta). Our analysis provides a robust phylogenomic framework for examining the evolution of green plants. Most inferred species relationships are well supported across multiple species tree and supermatrix analyses, but discordance among plastid and nuclear gene trees at a few important nodes highlights the complexity of plant genome evolution, including polyploidy, periods of rapid speciation, and extinction. Incomplete sorting of ancestral variation, polyploidization and massive expansions of gene families punctuate the evolutionary history of green plants. Notably, we find that large expansions of gene families preceded the origins of green plants, land plants and vascular plants, whereas whole-genome duplications are inferred to have occurred repeatedly throughout the evolution of flowering plants and ferns. The increasing availability of high-quality plant genome sequences and advances in functional genomics are enabling research on genome evolution across the green tree of life.Item Open Access Screening the human exome: a comparison of whole genome and whole transcriptome sequencing.(Genome Biol, 2010) Cirulli, Elizabeth T; Singh, Abanish; Shianna, Kevin V; Ge, Dongliang; Smith, Jason P; Maia, Jessica M; Heinzen, Erin L; Goedert, James J; Goldstein, David B; Center for HIV/AIDS Vaccine Immunology (CHAVI)BACKGROUND: There is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans. There are three approaches possible: whole-genome sequencing, whole-exome sequencing using exon capture methods, and RNA-Seq. While whole-genome sequencing is the most complete, it remains sufficiently expensive that cost effective alternatives are important. RESULTS: Here we provide a systematic exploration of how well RNA-Seq can identify human coding variants by comparing variants identified through high coverage whole-genome sequencing to those identified by high coverage RNA-Seq in the same individual. This comparison allowed us to directly evaluate the sensitivity and specificity of RNA-Seq in identifying coding variants, and to evaluate how key parameters such as the degree of coverage and the expression levels of genes interact to influence performance. We find that although only 40% of exonic variants identified by whole genome sequencing were captured using RNA-Seq; this number rose to 81% when concentrating on genes known to be well-expressed in the source tissue. We also find that a high false positive rate can be problematic when working with RNA-Seq data, especially at higher levels of coverage. CONCLUSIONS: We conclude that as long as a tissue relevant to the trait under study is available and suitable quality control screens are implemented, RNA-Seq is a fast and inexpensive alternative approach for finding coding variants in genes with sufficiently high expression levels.Item Open Access SNPpy--database management for SNP data from genome wide association studies.(PLoS One, 2011) Mitha, Faheem; Herodotou, Herodotos; Borisov, Nedyalko; Jiang, Chen; Yoder, Josh; Owzar, KourosBACKGROUND: We describe SNPpy, a hybrid script database system using the Python SQLAlchemy library coupled with the PostgreSQL database to manage genotype data from Genome-Wide Association Studies (GWAS). This system makes it possible to merge study data with HapMap data and merge across studies for meta-analyses, including data filtering based on the values of phenotype and Single-Nucleotide Polymorphism (SNP) data. SNPpy and its dependencies are open source software. RESULTS: The current version of SNPpy offers utility functions to import genotype and annotation data from two commercial platforms. We use these to import data from two GWAS studies and the HapMap Project. We then export these individual datasets to standard data format files that can be imported into statistical software for downstream analyses. CONCLUSIONS: By leveraging the power of relational databases, SNPpy offers integrated management and manipulation of genotype and phenotype data from GWAS studies. The analysis of these studies requires merging across GWAS datasets as well as patient and marker selection. To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats. It does low level and flexible data validation, including validation of patient data. SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.