Advancements in Probabilistic Machine Learning and Causal Inference for Personalized Medicine

Thumbnail Image




Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



In this dissertation, we present four novel contributions to the field of statistics with the shared goal of personalizing medicine to individual patients. These methods are developed to directly address problems in health care through two subfields of statistics: probabilistic machine learning and causal inference. These projects include improving predictions of adverse events after surgeries, or learning the effectiveness of treatments for specific subgroups and for individuals. We begin the dissertation in Chapter 1 with a discussion of personalized medicine, the use of electronic health record (EHR) data, and a brief discussion on learning heterogeneous treatment effects. In chapter 2, we present a novel algorithm, Predictive Hierarchical Clustering (PHC), for agglomerative hierarchical clustering of current procedural terminology (CPT) codes. Our predictive hierarchical clustering aims to cluster subgroups, not individual observations, found within our data, such that the clusters discovered result in optimal performance of a classification model, specifically for predicting surgical complications. In chapter 3, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data across subpopulations while sharing information to improve inference and prediction. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients at Duke University Health System (DUHS). The last chapters of the dissertation address personalized medicine from a causal perspective, where the goal is to understand how interventions affect individuals not full populations. In chapter 4, we address heterogeneous treatment effects across subgroups, where guidance for observational comparisons within subgroups is lacking as is a connection to classic design principles for propensity score (PS) analyses. We address these shortcomings by proposing a novel propensity score method for subgroup analysis (SGA) that seeks to balance existing strategies in an automatic and efficient way. With the use of overlap weights, we prove that an over-specified propensity model including interactions between subgroups and all covariates results in exact covariate balance within subgroups. This is paired with variable selection approaches to adjust for a possibly overspecified propensity score model. Finally, chapter 5 discusses our final contribution, a longitudinal matching algorithm aiming to predict individual treatment effects of a medication change for diabetes patients. This project aims to develop a novel and generalizable causal inference framework for learning heterogeneous treatment effects from Electronic Health Records (EHR) data. The key methodological innovation is to cast the sparse and irregularly-spaced EHR time series into functional data analysis in the design stage to adjust for confounding that changes over time. We conclude the dissertation and discuss future work in Section 6, outlining many directions for continued research on these topics.





Lorenzi, Elizabeth Catherine (2019). Advancements in Probabilistic Machine Learning and Causal Inference for Personalized Medicine. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.