Demographic distribution matching between real-world and virtual phantom population.
Date
2026-03
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Citation Stats
Attention Stats
Abstract
Background
The adoption of virtual imaging trials (VITs) is rapidly expanding, offering a cost-effective and ethically viable alternative to large-scale clinical trials for imaging system evaluation. However, differences in demographic composition between virtual phantom populations and real-world clinical cohorts can introduce bias in imaging performance assessments, particularly for underrepresented populations. Such discrepancies, if unaddressed, can limit the translational relevance of VIT findings by misrepresenting diagnostic performance across diverse patient groups.Purpose
To address this limitation, we introduce DISTINCT (Distributional Subsampling for Covariate-Targeted Alignment), a statistical framework for selecting demographically aligned subsamples from large clinical datasets to support robust comparisons with virtual cohorts.Methods
We applied DISTINCT to the National Lung Screening Trial (NLST) and a companion virtual trial dataset (VLST). The algorithm jointly aligned typical continuous (age, BMI) and categorical (sex, race, ethnicity) variables by constructing multidimensional bins based on discretized covariates. For a given target size, DISTINCT samples individuals to match the joint demographic distribution of the reference population. We evaluated the demographic similarity between VLST and progressively larger NLST subsamples using Wasserstein and Kolmogorov-Smirnov (K-S) distances to identify the maximal subsample size with acceptable alignment. After demographic alignment, we evaluated lung cancer risk prediction performance by applying two established NLST risk scores to the aligned subsamples and assessing their stability with receiver operating characteristic (ROC) analysis.Results
The DISTINCT algorithm identified a maximal demographically aligned NLST subsample of 9974 participants that preserved similarity to the VLST population. To assess whether such aligned subsets were sufficient for downstream applications, we applied two established NLST lung cancer risk scores and evaluated their performance using ROC analysis. Area under the curve (AUC) estimates stabilized once subsample sizes exceeded approximately 6000 participants, demonstrating that moderately sized aligned subsets provide reliable predictive model evaluation. Stratified analyses revealed demographic-specific variations in AUC, underscoring the importance of covariate alignment for fair and representative comparisons.Conclusion
DISTINCT provides a statistically rigorous and scalable approach for covariate alignment between real and virtual imaging cohorts based on demographic factors of variability. Although demonstrated for lung cancer screening with low-dose CT, the framework is broadly applicable to other imaging modalities and diseases, and across wide ranges of factors of variability. By enabling fair and representative performance assessments, DISTINCT advances the integration of VITs into imaging research and protocol optimization workflows.Type
Department
Description
Provenance
Subjects
Citation
Permalink
Published Version (Please cite this version)
Publication Info
Ghosh, Dhrubajyoti, Fakrul Tushar, Lavsen Dahal, Liesbeth Vancoillie, Kyle J Lafata, Ehsan Samei, Joseph Y Lo, Sheng Luo, et al. (2026). Demographic distribution matching between real-world and virtual phantom population. Medical physics, 53(3). p. e70364. 10.1002/mp.70364 Retrieved from https://hdl.handle.net/10161/34351.
This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.
Collections
Scholars@Duke
Kyle Jon Lafata
Kyle Lafata is the Thaddeus V. Samulski Associate Professor at Duke University with faculty appointments in Radiation Oncology, Radiology, Medical Physics, Electrical & Computer Engineering, and Mathematics. He joined the faculty at Duke in 2020 following postdoctoral training at the US Department of Veterans Affairs. His dissertation work focused on the applied analysis of stochastic partial differential equations and high-dimensional image phenotyping, where he developed physics-based computational methods and soft-computing paradigms to interrogate images. These included stochastic modeling, self-organization, and quantum machine learning (i.e., an emerging branch of research that explores the methodological and structural similarities between quantum systems and learning systems).
Prof. Lafata has worked in various areas of computational medicine and biology, resulting in over 80 academic papers, 30 invited talks, and more than 100 national conference presentations. At Duke, the Lafata Laboratory focuses on the theory, development, and application of computational oncology. The lab interrogates disease at different length-scales of its biological organization via high-performance computing, multiscale modeling, advanced imaging technology, and the applied analysis of stochastic partial differential equations. Current research interests include tumor topology, cellular dynamics, tumor immune microenvironment, drivers of radiation resistance and immune dysregulation, molecular insight into tissue heterogeneity, and biologically-guided adaptative treatment strategies.
Joseph Yuan-Chieh Lo
My research is at the intersection of computer vision, machine learning, and medical imaging, with a dual focus on mammography and computed tomography (CT). Together with our industry partner, we developed deep learning algorithms for breast cancer screening with 2D/3D mammography, and that product is now undergoing FDA approval with anticipated rollout to clinics worldwide. We also pioneer the creation of "digital twin" anatomical models from patient imaging data, using these models to forge new paths in CT scan analysis through virtual readers and deep learning techniques. Additionally, we're developing a computer-aided triage system for detecting diseases across multiple organs in body CT scans, leveraging hospital-scale datasets and integrating natural language processing with deep learning for comprehensive disease classification.
Sheng Luo
Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.
