Explainable Artificial Intelligence Techniques in Medical Imaging Analysis

Limited Access
This item is unavailable until:



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Artificial intelligence (AI), including classic machine learning (ML) and deep learning (DL), has recently made an impact on advanced medical image analysis. Classic ML learns the data representation by manual image feature engineering, namely radiomics, based on experts' domain knowledge. DL directly learns the image feature through hierarchical data modeling directly from the input data. Both classic ML and DL models have emerged as promising AI tools for medical image analysis. Despite promising academic research in which algorithms are beginning to outperform humans, clinical radiography analysis still has limited AI involvement. One issue of current AI development (for both classic ML and DL) is the lack of model explainability, i.e., the extent to which the internal mechanics of an AI model can be explained in human terms from a clinical perspective. The unexplainable issues include, but are not limited to, model confidence ('Can we trust the results with some clues?'), data utilization ('Do we need this as a part of the model?'), and model generalization ('How do I know if it works?'). Without such model explainability, AI models remain a black box in implementation, which leads to a lack of accountability and confidence in clinic application. We hypothesize that the current medical domain knowledge, both in theory and in practice, can be incorporated into AI designs to provide explainability. Therefore, the objective of this dissertation is to explore potential techniques to enhance AI model explainability. Specifically, three novel AI models were developed: • The first model aimed to explore a radiomic filtering model to quantify and visualize radiomic features associated with pulmonary ventilation from lung computed tomography (CT). In this model, lung volume was segmented on 46 CT images, and a 3D sliding window kernel was implemented across the lung volume to capture the spatial-encoded image information. Fifty-three radiomic features were extracted within the kernel, resulting in a 4th-order tensor object. As such, each voxel coordinate of the original lung was represented as a 53-dimensional feature vector, such that radiomic features could be viewed as feature maps within the lungs. To test the technique as a potential pulmonary ventilation biomarker, the radiomic feature maps were compared to paired functional images (Galligas-positron emission tomography, PET or DTPA-single photon emission computed tomography, SPECT) based on Spearman correlation (?) analysis. From the results, the radiomic feature map Gray Level Run Length Matrix (GLRLM)-based Run-Length Non-Uniformity and Gray Level Co-occurrence Matrix (GLCOM)-based Sum Average are found to be highly correlated with functional imaging. The achieved ? (median [range]) for the two features are 0.46 [0.05, 0.67] and 0.45 [0.21, 0.65] across 46 patients and 2 functional imaging modalities, respectively. The results provide evidence that local regions of sparsely encoded heterogeneous lung parenchyma on CT are associated with diminished radiotracer uptake and measured lung ventilation defects on PET/SPECT imaging. Collectively, these findings demonstrate the potential of radiomic filtering to provide a visual explanation of lung CT radiomic features associated with lung ventilation. The developed technique may serve as a complementary tool to the current lung quantification techniques and provide hypothesis-generating data for future studies. • The second model aimed to explore a neural ordinary differential equation (ODE)-based segmentation model to observe deep neural network (DNN) behavior in multi-parametric magnetic resonance imaging (MRI)-based glioma segmentation. In this model, by hypothesizing that deep feature extraction can be modeled as a spatiotemporally continuous process, we implemented a novel DL model, neural ODE, in which deep feature extraction was governed by an ODE parameterized by a neural network. The dynamics of 1) MR images after interactions with the DNN and 2) segmentation formation can thus be visualized after solving the ODE. An accumulative contribution curve (ACC) was designed to quantitatively evaluate each MR image’s utilization by the DNN toward the final segmentation results. The proposed neural ODE model was demonstrated using 369 glioma patients with a 4-modality multi-parametric MRI protocol: T1, contrast-enhanced T1 (T1-Ce), T2, and fluid-attenuated inversion recovery (FLAIR). Three neural ODE models were trained to segment enhancing tumor (ET), tumor core (TC), and whole tumor (WT), respectively. The key MR modalities with significant utilization by DNNs were identified based on ACC analysis. Segmentation results by DNNs using only the key MR modalities were compared to the ones using all 4 MR modalities in terms of Dice coefficient, accuracy, sensitivity, and specificity. From the results, all neural ODE models successfully illustrated image dynamics as expected. ACC analysis identified T1-Ce as the only key modality in ET and TC segmentations, while both FLAIR and T2 were key modalities in WT segmentation. Compared to the U-Net results using all 4 MR modalities, the Dice coefficient of ET (0.784→0.775), TC (0.760→0.758), and WT (0.841→0.837). Collectively, the neural ODE model offers a new tool for optimizing the DL model inputs with enhanced explainability in data utilization. The presented methodology can be generalized to other medical image-related DL applications. • The third model aimed to explore a multi-feature-combined (MFC) model to quantify the role of radiomic features, DL image features, and their combination in predicting local failure from pre-treatment CT images of early-stage non-small cell lung cancer (NSCLC) patients after either lung surgery or stereotactic body radiation therapy (SBRT). The MFC model comprised three key steps. (1) Extraction of 92 handcrafted radiomic features from the gross tumor volume (GTV) segmented on pre-treatment CT images. (2) Extraction of 512 deep features from pre-trained DL U-Net encoder structure. Specifically, the 512 latent activation values from the last fully connected layers were studied. (3) The extracted 92 handcrafted radiomic features, 512 deep features, along with 4 patient demographic information (i.e., gender, age, tumor volume, and Charlson comorbidity index), were concatenated as a multi-dimensional input to three classifiers: logistic regression (LR), supporting vector machine (SVM), and random forest (RF) to predict the local failure. Two NSCLC patient cohorts from our institution were investigated: (1) the surgery cohort includes 83 patients who underwent segmentectomy or wedge resection (with 7 local failures), and (2) the SBRT cohort includes 84 patients who received lung SBRT (with 9 local failures). The MFC model was developed and evaluated independently for both patient cohorts. For each cohort, the MFC model was also compared against (1) the R model: LR/SVM/RF prediction models using only radiomic features, (2) the PI model: LR/SVM/RF prediction models using only patient demographic information, and (3) the DL model: DL design that directly predicts the local failure based on the U-Net encoder. All models were tested based on two validation methods: leave-one-out cross-validation (LOOCV) and 100-fold Monte Carlo cross-validation (MCCV) with a 70%-30% train-test ratio. ROC with AUC analysis was adopted as the main evaluator to measure the prediction performance. The student’s t-test was performed to identify the statistically significant differences when applicable. In LOOCV, the AUC range of the proposed MFC model (for three classifiers) was 0.811-0.956 for the surgery patient cohort and 0.913-0.981 for the SBRT cohort, which was higher than the other studied models: the AUC range was 0.356-0.480 (surgery) and 0.295-0.347 (SBRT) for the PI models, 0.388-0.655 (surgery) and 0.648-0.747 (SBRT) for the R models, and 0.816 (surgery) and 0.842 (SBRT) for the DL models. Similar results can be observed in the 100-fold MCCV: the MFC model again showed the highest AUC results (surgery: 0.831-0.841, SBRT: 0.860-0.947), which were significantly higher than the PI models (surgery: 0.464-0.564, SBRT: 0.457-0.519), R models (surgery: 0.546-0.653, SBRT: 0.559-0.667), and DL models (surgery: 0.690, SBRT: 0.773). Collectively, the developed MFC model improves the ability to predict the occurrence of local failure for both surgery and SBRT patient cohorts with enhanced explainability in the role of different feature sources. It may hold the potential to assist clinicians to optimize treatment procedures in the future. In summary, the three developed models provide substantial contributions to enhance the explainability of current classic ML and DL models. The concepts and techniques developed in this dissertation, as well as understandings and inspirations from the key results, provide valuable knowledge for the future development of AI techniques toward wide clinical trust and acceptance.





Yang, Zhenyu (2023). Explainable Artificial Intelligence Techniques in Medical Imaging Analysis. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/27597.


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.