Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy.
Abstract
The diverse phenotypes of living organisms have been described for centuries, and
though they may be digitized, they are not readily available in a computable form.
Using over 100 morphological studies, the Phenoscape project has demonstrated that
by annotating characters with community ontology terms, links between novel species
anatomy and the genes that may underlie them can be made. But given the enormity of
the legacy literature, how can this largely unexploited wealth of descriptive data
be rendered amenable to large-scale computation? To identify the bottlenecks, we quantified
the time involved in the major aspects of phenotype curation as we annotated characters
from the vertebrate phylogenetic systematics literature. This involves attaching fully
computable logical expressions consisting of ontology terms to the descriptions in
character-by-taxon matrices. The workflow consists of: (i) data preparation, (ii)
phenotype annotation, (iii) ontology development and (iv) curation team discussions
and software development feedback. Our results showed that the completion of this
work required two person-years by a team of two post-docs, a lead data curator, and
students. Manual data preparation required close to 13% of the effort. This part in
particular could be reduced substantially with better community data practices, such
as depositing fully populated matrices in public repositories. Phenotype annotation
required ∼40% of the effort. We are working to make this more efficient with Natural
Language Processing tools. Ontology development (40%), however, remains a highly manual
task requiring domain (anatomical) expertise and use of specialized software. The
large overhead required for data preparation and ontology development contributed
to a low annotation rate of approximately two characters per hour, compared with 14
characters per hour when activity was restricted to character annotation. Unlocking
the potential of the vast stores of morphological descriptions requires better tools
for efficiently processing natural language, and better community practices towards
a born-digital morphology. Database URL: http://kb.phenoscape.org
Type
Journal articleSubject
AnimalsHumans
Anatomy, Comparative
Natural Language Processing
Databases, Factual
Data Mining
Biological Ontologies
Data Curation
Permalink
https://hdl.handle.net/10161/26581Published Version (Please cite this version)
10.1093/database/bav040Publication Info
Dahdul, Wasila; Dececchi, T Alexander; Ibrahim, Nizar; Lapp, Hilmar; & Mabee, Paula (2015). Moving the mountain: analysis of the effort required to transform comparative anatomy
into computable anatomy. Database : the journal of biological databases and curation, 2015. pp. bav040. 10.1093/database/bav040. Retrieved from https://hdl.handle.net/10161/26581.This is constructed from limited available data and may be imprecise. To cite this
article, please review & use the official citation provided by the journal.
Collections
More Info
Show full item recordScholars@Duke
Hilmar Lapp
Dir, IT

Articles written by Duke faculty are made available through the campus open access policy. For more information see: Duke Open Access Policy
Rights for Collection: Scholarly Articles
Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info