Making Model Aware: Pattern Recognition and Analysis in Environmental and Healthcare Data with Machine Learning Models

Loading...

Date

2024

Journal Title

Journal ISSN

Volume Title

Abstract

Discovering intrinsic patterns in environmental and healthcare data is often very helpful for analyzing correlations and causations among different variables and making informed predictions about targets of interest. Traditional methodologies typically involve defining intervention sources and boundary conditions and conducting simulations through mathematical models. However, these approaches face significant challenges when applied to high-dimensional data due to substantial computational demands. Conversely, machine learning (ML) models have emerged as a prominent alternative. This thesis focuses on proposing and evaluating various ML techniques for pattern recognition and analysis in environmental and healthcare datasets, with the aim of applying these methods to real-world scenarios.

To be specific, when working with environmental data like satellite imagery, deep learning (DL) models are often a suitable choice for capturing spatial relationships. Among these DL models, convolutional neural networks (CNNs) appear to be a promising technique to predict highly localized fine particulate matter (i.e., PM2.5 levels) based on high-resolution satellite imagery. Unfortunately, CNNs typically require large amounts of supervised data to perform well, whereas this application generally has lots of unsupervised data (all satellite imagery) and relatively sparse supervised data (measurements from ground sensors). Previous work used transfer learning from another visual task to initialize the CNN weights; however, I hypothesize that standard transfer learning strategies would bias the CNN to focus on irrelevant details of the image for real-world applications. Instead, I develop a novel framework called Spatiotemporal Contrastive Learning (SCL) to pre-train the CNN. I then test both regular contrastive learning and SCL on predicting PM2.5 levels from satellite images in two different cities, Delhi and Beijing, and compare to CNNs with parameters initialized randomly and by transfer learning. The results show that regular contrastive learning and SCL frameworks both manage to better capture spatial variation of ground-level PM2.5 concentrations compared to traditional initialization schemes, and that this performance gap increases as the number of ground sensors decreases, implying that the approach will be even more valuable in cities with fewer ground sensors. My work demonstrates that contrastive learning is a powerful pre-training technique to build better spatial maps of PM2.5 and can be broadly applied in related situations.

As mentioned earlier, contrastive learning is well-suited for capturing spatial variation. However, certain properties, such as spatial smoothness and seasonality, can be more effectively modeled by selecting an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, CNNs are frequently used in remote sensing, which is subject to strong seasonal effects. I propose to blend the strengths of NNs and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). I implement this idea by combining a deep network and an efficient mapping function based on either Nyström approximation or random Fourier features, which is called Implicit Composite Kernel (ICK). I then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. I demonstrate that ICK has superior performance and flexibility on both synthetic and real-world datasets including a remote sensing dataset. The ICK framework can be used to include prior information into neural networks in many applications.

Unlike environmental data, patterns in healthcare data are often better described through causal relationships. Estimating causal effects is therefore crucial for assessing the impact of interventions in healthcare settings. Although numerous methods have been developed for causal effect estimation, few are effective at handling data with complex structures, such as medical images. To fill this gap, I propose Causal Multi-task Deep Ensemble (CMDE), a novel framework that learns both shared and group-specific information from the study population. I also provide proofs demonstrating equivalency of CDME to a multi-task Gaussian process (GP) with a coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. I then evaluate this method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on many of these tasks.

Causal mediation analysis (CMA) is another commonly used technique in healthcare data analysis, allowing researchers to decompose the total treatment effect into direct and mediated effects. This is critical for identifying the mechanisms underlying a treatment’s impact in many scientific applications. However, in many cases, the mediator is unobserved, though related measurements may be available. For example, we may want to identify how changes in brain activity or structure mediate an antidepressant's effect on behavior, but we may only have access to electrophysiological or imaging brain measurements. To date, most CMA methods assume the mediator is one-dimensional and observable, which oversimplifies such real-world scenarios. To overcome this limitation, I introduce a CMA framework that can handle complex and indirectly observed mediators based on the identifiable variational autoencoder (iVAE) architecture. I show that the joint distribution over observed and latent variables is identifiable with this method both theoretically and empirically. In addition, this framework captures a disentangled representation of the indirectly observed mediator and yields an accurate estimation of the direct and mediated effects in synthetic and semi-synthetic experiments, providing evidence of its potential utility in practical applications.

Lastly, while causal reasoning is also highly valuable in environmental and spatial data analysis, it presents challenges due to the difficulty in observing all confounders and the potential for spatial interference, where the outcome for one spatial unit can be influenced by the treatment applied to a nearby unit. To address this, I propose a neural network (NN) based framework integrated with an approximate Gaussian process (GP) to manage spatial interference and unobserved confounding. Additionally, I adopt a generalized propensity-score-based approach to address partially observed outcomes when estimating causal effects with continuous treatments. I then evaluate this framework using synthetic, semi-synthetic, and real-world data inferred from satellite imagery. The results demonstrate that NN-based models significantly outperform linear spatial regression models in estimating causal effects. Furthermore, in real-world case studies, NN-based models offer more reasonable predictions of causal effects, facilitating decision-making in relevant applications such as urban planning.

Description

Provenance

Subjects

Environmental science, Health sciences, Computer science, Causal Inference, Environmental Data Analysis, Healthcare Data Analysis, Machine Learning, Pattern Recognition

Citation

Citation

Jiang, Ziyang (2024). Making Model Aware: Pattern Recognition and Analysis in Environmental and Healthcare Data with Machine Learning Models. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/32600.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.