Browsing by Subject "Natural Language Processing"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item Open Access Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.(Database : the journal of biological databases and curation, 2018-01) Dahdul, Wasila; Manda, Prashanti; Cui, Hong; Balhoff, James P; Dececchi, T Alexander; Ibrahim, Nizar; Lapp, Hilmar; Vision, Todd; Mabee, Paula MNatural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.Item Open Access Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy.(Database : the journal of biological databases and curation, 2015-01) Dahdul, Wasila; Dececchi, T Alexander; Ibrahim, Nizar; Lapp, Hilmar; Mabee, PaulaThe diverse phenotypes of living organisms have been described for centuries, and though they may be digitized, they are not readily available in a computable form. Using over 100 morphological studies, the Phenoscape project has demonstrated that by annotating characters with community ontology terms, links between novel species anatomy and the genes that may underlie them can be made. But given the enormity of the legacy literature, how can this largely unexploited wealth of descriptive data be rendered amenable to large-scale computation? To identify the bottlenecks, we quantified the time involved in the major aspects of phenotype curation as we annotated characters from the vertebrate phylogenetic systematics literature. This involves attaching fully computable logical expressions consisting of ontology terms to the descriptions in character-by-taxon matrices. The workflow consists of: (i) data preparation, (ii) phenotype annotation, (iii) ontology development and (iv) curation team discussions and software development feedback. Our results showed that the completion of this work required two person-years by a team of two post-docs, a lead data curator, and students. Manual data preparation required close to 13% of the effort. This part in particular could be reduced substantially with better community data practices, such as depositing fully populated matrices in public repositories. Phenotype annotation required ∼40% of the effort. We are working to make this more efficient with Natural Language Processing tools. Ontology development (40%), however, remains a highly manual task requiring domain (anatomical) expertise and use of specialized software. The large overhead required for data preparation and ontology development contributed to a low annotation rate of approximately two characters per hour, compared with 14 characters per hour when activity was restricted to character annotation. Unlocking the potential of the vast stores of morphological descriptions requires better tools for efficiently processing natural language, and better community practices towards a born-digital morphology. Database URL: http://kb.phenoscape.orgItem Open Access Prediction of Bitcoin prices using Twitter Data and Natural Language Processing(2021-12-16) Wong, Eugene Lu XianThe influence of social media platforms like Twitter had long been perceived as a bellwether of Bitcoin Prices. This paper aims to investigate if the tweets can be modeled using two different approaches, namely, the Naïve Bayes and LSTM models, to compute the sentiment scores in order to predict the Bitcoin price signal. Through the experiments conducted, the LSTM model indicates some degree of predictive advantage compared to the Naïve Bayes model.Item Open Access The Press and Peace(2024-05-10) Bussey, JakobeThis study utilizes state-of-the-art BERT (Bidirectional Encoder Representations from Transformers) models to perform sentiment analysis on Wall Street Journal and New York Times articles about the Iraq War published between 2002 and 2012 and further categorize them using advanced unsupervised machine learning techniques. By utilizing statistical analysis and quartic regression models, this paper concludes that the two newspapers report on the Iraq War differently, with both exhibiting a predominantly negative-neutral tone overall. Additionally, the analysis reveals significant fluctuations in negativity from both outlets over time as the war progresses. Furthermore, this study examines the objectivity of reporting between editorial and non-editorial articles, finding that non-editorials tend to report more objectively, and the neutrality of editorials remains relatively constant while the objectivity of non-editorials fluctuates in response to war events. Finally, the paper investigates variations in sentiment across different topics, uncovering substantial variations in positive, neutral, and negative sentiments across topics and their evolution over time.