Demographic distribution matching between real-world and virtual phantom population.

Abstract

Background

The adoption of virtual imaging trials (VITs) is rapidly expanding, offering a cost-effective and ethically viable alternative to large-scale clinical trials for imaging system evaluation. However, differences in demographic composition between virtual phantom populations and real-world clinical cohorts can introduce bias in imaging performance assessments, particularly for underrepresented populations. Such discrepancies, if unaddressed, can limit the translational relevance of VIT findings by misrepresenting diagnostic performance across diverse patient groups.

Purpose

To address this limitation, we introduce DISTINCT (Distributional Subsampling for Covariate-Targeted Alignment), a statistical framework for selecting demographically aligned subsamples from large clinical datasets to support robust comparisons with virtual cohorts.

Methods

We applied DISTINCT to the National Lung Screening Trial (NLST) and a companion virtual trial dataset (VLST). The algorithm jointly aligned typical continuous (age, BMI) and categorical (sex, race, ethnicity) variables by constructing multidimensional bins based on discretized covariates. For a given target size, DISTINCT samples individuals to match the joint demographic distribution of the reference population. We evaluated the demographic similarity between VLST and progressively larger NLST subsamples using Wasserstein and Kolmogorov-Smirnov (K-S) distances to identify the maximal subsample size with acceptable alignment. After demographic alignment, we evaluated lung cancer risk prediction performance by applying two established NLST risk scores to the aligned subsamples and assessing their stability with receiver operating characteristic (ROC) analysis.

Results

The DISTINCT algorithm identified a maximal demographically aligned NLST subsample of 9974 participants that preserved similarity to the VLST population. To assess whether such aligned subsets were sufficient for downstream applications, we applied two established NLST lung cancer risk scores and evaluated their performance using ROC analysis. Area under the curve (AUC) estimates stabilized once subsample sizes exceeded approximately 6000 participants, demonstrating that moderately sized aligned subsets provide reliable predictive model evaluation. Stratified analyses revealed demographic-specific variations in AUC, underscoring the importance of covariate alignment for fair and representative comparisons.

Conclusion

DISTINCT provides a statistically rigorous and scalable approach for covariate alignment between real and virtual imaging cohorts based on demographic factors of variability. Although demonstrated for lung cancer screening with low-dose CT, the framework is broadly applicable to other imaging modalities and diseases, and across wide ranges of factors of variability. By enabling fair and representative performance assessments, DISTINCT advances the integration of VITs into imaging research and protocol optimization workflows.

Department

Description

Provenance

Subjects

Humans, Lung Neoplasms, Phantoms, Imaging, Demography, Aged, Middle Aged, Female, Male

Citation

Published Version (Please cite this version)

10.1002/mp.70364

Publication Info

Ghosh, Dhrubajyoti, Fakrul Tushar, Lavsen Dahal, Liesbeth Vancoillie, Kyle J Lafata, Ehsan Samei, Joseph Y Lo, Sheng Luo, et al. (2026). Demographic distribution matching between real-world and virtual phantom population. Medical physics, 53(3). p. e70364. 10.1002/mp.70364 Retrieved from https://hdl.handle.net/10161/34351.

This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.

Scholars@Duke

Lafata

Kyle Jon Lafata

Thaddeus V. Samulski Associate Professor of Radiation Oncology

Kyle Lafata is the Thaddeus V. Samulski Associate Professor at Duke University with faculty appointments in Radiation Oncology, Radiology, Medical Physics, Electrical & Computer Engineering, and Mathematics. He joined the faculty at Duke in 2020 following postdoctoral training at the US Department of Veterans Affairs. His dissertation work focused on the applied analysis of stochastic partial differential equations and high-dimensional image phenotyping, where he developed physics-based computational methods and soft-computing paradigms to interrogate images. These included stochastic modeling, self-organization, and quantum machine learning (i.e., an emerging branch of research that explores the methodological and structural similarities between quantum systems and learning systems).

Prof. Lafata has worked in various areas of computational medicine and biology, resulting in over 80 academic papers, 30 invited talks, and more than 100 national conference presentations. At Duke, the Lafata Laboratory focuses on the theory, development, and application of computational oncology. The lab interrogates disease at different length-scales of its biological organization via high-performance computing, multiscale modeling, advanced imaging technology, and the applied analysis of stochastic partial differential equations. Current research interests include tumor topology, cellular dynamics, tumor immune microenvironment, drivers of radiation resistance and immune dysregulation, molecular insight into tissue heterogeneity, and biologically-guided adaptative treatment strategies.

Lo

Joseph Yuan-Chieh Lo

Professor in Radiology

My research is at the intersection of computer vision, machine learning, and medical imaging, with a dual focus on mammography and computed tomography (CT). Together with our industry partner, we developed deep learning algorithms for breast cancer screening with 2D/3D mammography, and that product is now undergoing FDA approval with anticipated rollout to clinics worldwide. We also pioneer the creation of "digital twin" anatomical models from patient imaging data, using these models to forge new paths in CT scan analysis through virtual readers and deep learning techniques. Additionally, we're developing a computer-aided triage system for detecting diseases across multiple organs in body CT scans, leveraging hospital-scale datasets and integrating natural language processing with deep learning for comprehensive disease classification.

Luo

Sheng Luo

Professor of Biostatistics & Bioinformatics

Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.