Browsing by Subject "Data Mining"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Open Access A cross-sectional analysis of HIV and hepatitis C clinical trials 2007 to 2010: the relationship between industry sponsorship and randomized study design.(Trials, 2014-01) Goswami, Neela D; Tsalik, Ephraim L; Naggie, Susanna; Miller, William C; Horton, John R; Pfeiffer, Christopher D; Hicks, Charles BBackground
The proportion of clinical research sponsored by industry will likely continue to expand as federal funds for academic research decreases, particularly in the fields of HIV/AIDS and hepatitis C (HCV). While HIV and HCV continue to burden the US population, insufficient data exists as to how industry sponsorship affects clinical trials involving these infectious diseases. Debate exists about whether pharmaceutical companies undertake more market-driven research practices to promote therapeutics, or instead conduct more rigorous trials than their non-industry counterparts because of increased resources and scrutiny. The ClinicalTrials.gov registry, which allows investigators to fulfill a federal mandate for public trial registration, provides an opportunity for critical evaluation of study designs for industry-sponsored trials, independent of publication status. As part of a large public policy effort, the Clinical Trials Transformation Initiative (CTTI) recently transformed the ClinicalTrials.gov registry into a searchable dataset to facilitate research on clinical trials themselves.Methods
We conducted a cross-sectional analysis of 477 HIV and HCV drug treatment trials, registered with ClinicalTrials.gov from 1 October 2007 to 27 September 2010, to study the relationship of study sponsorship with randomized study design. The likelihood of using randomization given industry (versus non-industry) sponsorship was reported with prevalence ratios (PR). PRs were estimated using crude and stratified tabular analysis and Poisson regression adjusting for presence of a data monitoring committee, enrollment size, study phase, number of study sites, inclusion of foreign study sites, exclusion of persons older than age 65, and disease condition.Results
The crude PR was 1.17 (95% CI 0.94, 1.45). Adjusted Poisson models produced a PR of 1.13 (95% CI 0.82, 1.56). There was a trend toward mild effect measure modification by study phase, but this was not statistically significant. In stratified tabular analysis the adjusted PR was 1.14 (95% CI 0.78, 1.68) among phase 2/3 trials and 1.06 (95% CI 0.50, 2.22) among phase 4 trials.Conclusions
No significant relationship was found between industry sponsorship and use of randomization in trial design in this cross-sectional study. Prospective studies evaluating other aspects of trial design may shed further light on the relationship between industry sponsorship and appropriate trial methodology.Item Open Access Acquisition, Analysis, and Sharing of Data in 2015 and Beyond: A Survey of the Landscape: A Conference Report From the American Heart Association Data Summit 2015.(J Am Heart Assoc, 2015-11-05) Antman, Elliott M; Benjamin, Emelia J; Harrington, Robert A; Houser, Steven R; Peterson, Eric D; Bauman, Mary Ann; Brown, Nancy; Bufalino, Vincent; Califf, Robert M; Creager, Mark A; Daugherty, Alan; Demets, David L; Dennis, Bernard P; Ebadollahi, Shahram; Jessup, Mariell; Lauer, Michael S; Lo, Bernard; MacRae, Calum A; McConnell, Michael V; McCray, Alexa T; Mello, Michelle M; Mueller, Eric; Newburger, Jane W; Okun, Sally; Packer, Milton; Philippakis, Anthony; Ping, Peipei; Prasoon, Prad; Roger, Véronique L; Singer, Steve; Temple, Robert; Turner, Melanie B; Vigilante, Kevin; Warner, John; Wayte, Patrick; American Heart Association Data Sharing Summit AttendeesBACKGROUND: A 1.5-day interactive forum was convened to discuss critical issues in the acquisition, analysis, and sharing of data in the field of cardiovascular and stroke science. The discussion will serve as the foundation for the American Heart Association's (AHA's) near-term and future strategies in the Big Data area. The concepts evolving from this forum may also inform other fields of medicine and science. METHODS AND RESULTS: A total of 47 participants representing stakeholders from 7 domains (patients, basic scientists, clinical investigators, population researchers, clinicians and healthcare system administrators, industry, and regulatory authorities) participated in the conference. Presentation topics included updates on data as viewed from conventional medical and nonmedical sources, building and using Big Data repositories, articulation of the goals of data sharing, and principles of responsible data sharing. Facilitated breakout sessions were conducted to examine what each of the 7 stakeholder domains wants from Big Data under ideal circumstances and the possible roles that the AHA might play in meeting their needs. Important areas that are high priorities for further study regarding Big Data include a description of the methodology of how to acquire and analyze findings, validation of the veracity of discoveries from such research, and integration into investigative and clinical care aspects of future cardiovascular and stroke medicine. Potential roles that the AHA might consider include facilitating a standards discussion (eg, tools, methodology, and appropriate data use), providing education (eg, healthcare providers, patients, investigators), and helping build an interoperable digital ecosystem in cardiovascular and stroke science. CONCLUSION: There was a consensus across stakeholder domains that Big Data holds great promise for revolutionizing the way cardiovascular and stroke research is conducted and clinical care is delivered; however, there is a clear need for the creation of a vision of how to use it to achieve the desired goals. Potential roles for the AHA center around facilitating a discussion of standards, providing education, and helping establish a cardiovascular digital ecosystem. This ecosystem should be interoperable and needs to interface with the rapidly growing digital object environment of the modern-day healthcare system.Item Open Access Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.(Database : the journal of biological databases and curation, 2018-01) Dahdul, Wasila; Manda, Prashanti; Cui, Hong; Balhoff, James P; Dececchi, T Alexander; Ibrahim, Nizar; Lapp, Hilmar; Vision, Todd; Mabee, Paula MNatural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.Item Open Access Automatic identification of variables in epidemiological datasets using logic regression.(BMC medical informatics and decision making, 2017-04) Lorenz, Matthias W; Abdi, Negin Ashtiani; Scheckenbach, Frank; Pflug, Anja; Bülbül, Alpaslan; Catapano, Alberico L; Agewall, Stefan; Ezhov, Marat; Bots, Michiel L; Kiechl, Stefan; Orth, Andreas; PROG-IMT study groupBackground
For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.Methods
For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.Results
In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.Conclusions
We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.Item Open Access GREAT improves functional interpretation of cis-regulatory regions.(Nature biotechnology, 2010-05-02) McLean, Cory Y; Bristor, Dave; Hiller, Michael; Clarke, Shoa L; Schaar, Bruce T; Lowe, Craig B; Wenger, Aaron M; Bejerano, GillWe developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to analyze the functional significance of cis-regulatory regions identified by localized measurements of DNA binding events across an entire genome. Whereas previous methods took into account only binding proximal to genes, GREAT is able to properly incorporate distal binding sites and control for false positives using a binomial test over the input genomic regions. GREAT incorporates annotations from 20 ontologies and is available as a web application. Applying GREAT to data sets from chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) of multiple transcription-associated factors, including SRF, NRSF, GABP, Stat3 and p300 in different developmental contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate testable hypotheses. The utility of GREAT is not limited to ChIP-seq, as it could also be applied to open chromatin, localized epigenomic markers and similar functional data sets, as well as comparative genomics sets.Item Open Access Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy.(Database : the journal of biological databases and curation, 2015-01) Dahdul, Wasila; Dececchi, T Alexander; Ibrahim, Nizar; Lapp, Hilmar; Mabee, PaulaThe diverse phenotypes of living organisms have been described for centuries, and though they may be digitized, they are not readily available in a computable form. Using over 100 morphological studies, the Phenoscape project has demonstrated that by annotating characters with community ontology terms, links between novel species anatomy and the genes that may underlie them can be made. But given the enormity of the legacy literature, how can this largely unexploited wealth of descriptive data be rendered amenable to large-scale computation? To identify the bottlenecks, we quantified the time involved in the major aspects of phenotype curation as we annotated characters from the vertebrate phylogenetic systematics literature. This involves attaching fully computable logical expressions consisting of ontology terms to the descriptions in character-by-taxon matrices. The workflow consists of: (i) data preparation, (ii) phenotype annotation, (iii) ontology development and (iv) curation team discussions and software development feedback. Our results showed that the completion of this work required two person-years by a team of two post-docs, a lead data curator, and students. Manual data preparation required close to 13% of the effort. This part in particular could be reduced substantially with better community data practices, such as depositing fully populated matrices in public repositories. Phenotype annotation required ∼40% of the effort. We are working to make this more efficient with Natural Language Processing tools. Ontology development (40%), however, remains a highly manual task requiring domain (anatomical) expertise and use of specialized software. The large overhead required for data preparation and ontology development contributed to a low annotation rate of approximately two characters per hour, compared with 14 characters per hour when activity was restricted to character annotation. Unlocking the potential of the vast stores of morphological descriptions requires better tools for efficiently processing natural language, and better community practices towards a born-digital morphology. Database URL: http://kb.phenoscape.orgItem Open Access SplicerAV: a tool for mining microarray expression data for changes in RNA processing.(BMC Bioinformatics, 2010-02-25) Robinson, Timothy J; Dinan, Michaela A; Dewhirst, Mark; Garcia-Blanco, Mariano A; Pearson, James LBACKGROUND: Over the past two decades more than fifty thousand unique clinical and biological samples have been assayed using the Affymetrix HG-U133 and HG-U95 GeneChip microarray platforms. This substantial repository has been used extensively to characterize changes in gene expression between biological samples, but has not been previously mined en masse for changes in mRNA processing. We explored the possibility of using HG-U133 microarray data to identify changes in alternative mRNA processing in several available archival datasets. RESULTS: Data from these and other gene expression microarrays can now be mined for changes in transcript isoform abundance using a program described here, SplicerAV. Using in vivo and in vitro breast cancer microarray datasets, SplicerAV was able to perform both gene and isoform specific expression profiling within the same microarray dataset. Our reanalysis of Affymetrix U133 plus 2.0 data generated by in vitro over-expression of HRAS, E2F3, beta-catenin (CTNNB1), SRC, and MYC identified several hundred oncogene-induced mRNA isoform changes, one of which recognized a previously unknown mechanism of EGFR family activation. Using clinical data, SplicerAV predicted 241 isoform changes between low and high grade breast tumors; with changes enriched among genes coding for guanyl-nucleotide exchange factors, metalloprotease inhibitors, and mRNA processing factors. Isoform changes in 15 genes were associated with aggressive cancer across the three breast cancer datasets. CONCLUSIONS: Using SplicerAV, we identified several hundred previously uncharacterized isoform changes induced by in vitro oncogene over-expression and revealed a previously unknown mechanism of EGFR activation in human mammary epithelial cells. We analyzed Affymetrix GeneChip data from over 400 human breast tumors in three independent studies, making this the largest clinical dataset analyzed for en masse changes in alternative mRNA processing. The capacity to detect RNA isoform changes in archival microarray data using SplicerAV allowed us to carry out the first analysis of isoform specific mRNA changes directly associated with cancer survival.