Data curation of a findable, accessible, interoperable, reusable polymer nanocomposites data resource - MaterialsMine

Loading...

Date

2022

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

80
views
479
downloads

Attention Stats

Abstract

A polymer nanocomposite (PNC) is a composite material consisting of a polymer matrix and stiff fillers with at least one dimension smaller than 100 nm. With the addition of a small amount of filler to the polymer matrix, PNC demonstrates large reinforcement of mechanical, viscoelastic, dielectric, thermal, optical, and other physiochemical properties as compared to pure polymer or pure fillers acting alone. PNCs have thus attracted significant amounts of research interest over recent years. To accelerate materials design, we need findable, accessible, interoperable, and reusable (FAIR) data resources to provide sufficient data for data-driven approaches to replace the traditional trial-and-error style of exploration in a lab. With the goal to build a FAIR data resource for the PNC community, we built NanoMine in 2016, which later evolves into MaterialsMine with the extension of MetaMine in the metamaterial domain. To be FAIR, we need a clear and extensible data representation to enable the interoperable knowledge exchange. We thus designed the NanoMine XML schema. With the data framework and data representation in place, we still need tools and a user-friendly interface for data curation. This dissertation describes in detail the tools and data interfaces we developed to ensure a smooth data curation pathway for NanoMine/MaterialsMine. To reduce and prevent curation errors and thus improve data quality, we need data validation mechanisms. To address the need, we discuss the validation mechanisms embedded both during and after the curation. On many occasions, even without human-caused curation errors, the data resource cannot perform to its full capacity due to data inconsistencies. For example, the inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names and the inconsistent use of mass fraction and volume fraction in specifying the composite composition. To address the need for data standardization, tools developed to bypass manual curation, the mass fraction – volume fraction conversion agent, and ChemProps, a RESTful API-enabled multi-algorithm-based polymer/filler name mapping methodology, are discussed in detail in this dissertation. To create truly powerful and transformative materials design paradigms and towards a sustainable future for MaterialsMine, we need to harness the power of AI to efficiently extract a significant set of data from the published, archival literature. Natural Language Processing (NLP) offers an opportunity to make this data accessible and readily reusable by humans and machines. The first step is to generate a sample list where curators can easily find the number of samples, their compositions, and properties reported in the paper. The task is handled in a pretraining-finetuning fashion. Downstream tasks include Named Entity Recognition (NER) to detect sample code, sample composition, property, and group reference to samples in the captions, and Relation Extraction (RE) which predicts the relations between pairs of detected named entities. In this dissertation, a detailed discussion of how the two corpora for pretraining and finetuning are constructed is provided. A T5-base model pretrained on the caption-mention corpus and finetuned for the NER and RE tasks is proposed. We evaluated it along with an array of BERT-based models and seq2seq models for potential applications in semi-automated curation pipeline for MaterialsMine.

Description

Provenance

Subjects

Materials Science, data curation, Materials informatics, MaterialsMine, Natural language processing

Citation

Citation

Hu, Bingyin (2022). Data curation of a findable, accessible, interoperable, reusable polymer nanocomposites data resource - MaterialsMine. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/26880.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.