Machine Learning to Estimate Exposure and Effects of Emerging Chemicals and Other Consumer Product Ingredients

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Chemicals in consumer products can influence our risk for developing adverse health conditions. This research addresses knowledge gaps in our ability to evaluate chemical safety, particularly for emerging substances on the market. Acknowledging the need for more high-throughput exposure and hazard models to support risk assessment, computational frameworks leveraging machine learning strategies and "big data" from public databases and mass social data sources were tested.

First, to understand consumer exposure, we require a better understanding of ingredient concentrations in products. A computational framework was developed to estimate chemical weight fractions for consumer products containing emerging substances. Nanomaterial-enabled products were used as a case study to represent such substances with limited physicochemical property data. Feature variables included chemical properties, functional use categories (e.g., antimicrobial), the type of product and its matrix. Weight fractions were classified as low, medium or high using a random forest or nonlinear support vector classifier. Performance of machine learning models was qualitatively compared with that of models from a second framework trained on data-rich, bulk-scale organic chemical product data. Models could roughly stratify material-product observations into weight fraction bins with moderate success. The best model achieved an average balanced accuracy of 73% on nanomaterials product data. Chemical functional use features served as particularly insightful predictors, suggesting that functional use data may be useful in evaluating the safety and sustainability of emerging chemicals. Investment in chemical and product data collection could see continued improvement of such machine learning models.

Shifting focus to the impact of chemicals on consumers, data on personal care products, ingredients, and customer reviews from online retailers and databases was collected to see if certain chemicals might increase risk of adverse reactions to products. The study scope was narrowed to shampoo products for hypothesis testing. Processing steps in the data pipeline included informatics and machine learning methods, namely, natural language processing for interpreting product reviews, text extraction from images of product labels, and feature reduction using chemical structure and ingredient source data. Fifty-one ingredient clusters were identified as having a significant correlation with higher adverse reaction rates in consumers when present in shampoos. Among these, there were a few common plant-based ingredients and synthetic preservatives known for causing skin sensitivity or irritation. In comparison with other constituents, however, the positively correlated ingredient groups had a general lack of published structural, physicochemical property and toxicity data. Results suggest an urgent need for targeted, higher-throughput chemical evaluations to safeguard consumers.

Together, these proof-of-concept studies progress our ability to quantify exposure and hazard of emerging and data-poor substances in consumer products. The outcomes of the computational frameworks can help prioritize potentially problematic substances for additional study to characterize risk.





Thornton, Luka (2023). Machine Learning to Estimate Exposure and Effects of Emerging Chemicals and Other Consumer Product Ingredients. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.