Improving Clinical Prediction Models with Statistical Representation Learning
This dissertation studies novel statistical machine learning approaches for healthcare risk prediction applications in the presence of challenging scenarios, such as rare events, noisy observations, data imbalance, missingness and censoring. Such scenarios manifest frequently in practice, and they compromise the validity of standard predictive models which often expect clean and complete data. As such, alleviating the negative impacts of real-world data challenges is of great significance and constitutes the overarching goal of this dissertation, which investigates novel strategies to (i) account for data uncertainties and statistical characteristics of low-prevalence events; (ii) re-balancing and augmenting the representations for minority data under proper causal assumptions; (iii) dynamically assigning scores and attend to the observed units to derive robust features.By integrating the ideas from representation learning, variational Bayes, causal inference, and contrastive training, this dissertation builds tools for risk modeling frameworks that are robust to various peculiarities of real-world datasets to yield reliable individualized risk evaluations.
This dissertation starts with a systematic review of classical risk prediction models in Chapter 1, and discusses the new opportunities and challenges presented by the big data era. With the increasing availability of healthcare data and the current rapid development of machine learning models, clinical decision support systems have seen new opportunities to improve clinical practice. However, in healthcare risk prediction applications, statistical analysis is not only challenged by data incompleteness and skewed distributions but also the complexity of the inputs. To motivate the subsequent developments, discussions on the limitations in risk minimization methods, robustness against high-dimensional data with incompleteness, and the need for individualization are provided.
As a concrete example to address a canonical problem, Chapter 2 proposes a variational disentanglement approach to semi-parametrically learns from the heavily imbalanced binary classification datasets. In this new method, which is named Variational Inference for Extremals (VIE), we apply an extreme value theory to enable efficient learning with few observations. By organically integrating the generalized additive model and isotonic neural nets, VIE enjoys the merits of improved robustness, interpretability, and generalizability for the accurate prediction of rare events. An analysis of the COVID-19 cohort from Duke Hospitals demonstrates that the proposed approach outperforms competing solutions. We investigate a more generalized setting of a multi-classification problem with heavily imbalanced data in Chapter 3, from the perspective of causal machine learning to promote sample efficiency and model generalization. Our solution, named Energy-based Causal Representation Transfer (ECRT), posits a meta-distributional scenario, where the data generating mechanism for label-conditional features is invariant across different labels. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities. This allows us to leverage a causally informed data augmentation procedure based on nonlinear independent component analysis to enrich the representation of minority classes and simultaneously data whitening. The effectiveness and enhanced prediction accuracy are demonstrated through synthetic data and real-world benchmarks compared with state-of-art models.
In Chapter 4 we deal with the time-to-event prediction with censored (missing in outcomes) and incomplete (missing in covariates) observations. Also known as survival analysis, time-to-event prediction plays a crucial role in many clinical applications, yet the classical survival solutions scale poorly w.r.t. data complexity and incompleteness. To better handle sophisticated modern health data and alleviate the impact of real-world data challenges, we introduce a self-attention based model to capture helpful information for time-to-event prediction, called Energy-based Latent Self-Attentive Survival Analysis (ELSSA). A key novelty of this approach is the integration of contrastive mutual information based loss that non-parametrically maximizes the informativeness of learned data representations. The effectiveness of our approaches has been intensively validated with synthetic and real-world benchmarks, showing improved performance relative to competing solutions.
In summary, this dissertation presents flexible risk prediction frameworks that acknowledge representation uncertainties, data heterogeneity and incompleteness. Altogether, it presents three contributions: improved efficient learning from imbalanced data, enhanced robustness to missing data, and better generalizability to out-of-sample subjects.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations
Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info