Robust and Scalable Causal Analysis: Domain Alignment, Time-series Counterfactuals, Multi-Treatment & Multi-Outcome Causal Effects, and LLM-Integrated Causal Discovery
Date
2025
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Attention Stats
Abstract
Modern decision-making in science and industry increasingly hinges on drawing reliable causal conclusions from complex, high-dimensional, and distributionally-shifted observational data. However, prevailing methods often falter when: (i) sufficient overlap between differing groups (e.g., treated/control, source/target domains) is compromised by distribution shifts, a foundational challenge for valid generalization; (ii) confounders evolve dynamically in time-series settings, distorting sequential treatment effect estimation; (iii) many treatments/interventions and outcomes interact in complex ways, demanding scalable and holistic modeling; and (iv) the application of advanced models like Large Language Models (LLMs) to causal discovery falters significantly due to their inadequate handling of observational data: they frequently ignore it, or their reasoning capabilities are even undermined when presented with statistical summaries from that data, thereby degrading the accuracy of the inferred causal structures. This dissertation addresses these challenges by developing a cohesive suite of methodologies for Robust and Scalable Causal Inference.
To address these limitations, my doctoral research first establishes a rigorous theoretical understanding of sub-domain alignment in Unsupervised Domain Adaptation (UDA) to tackle insufficient overlap and distribution shifts (i). This work demonstrates that these sub-domain based methods optimize a stronger error bound and introduces an algorithmic adaptation for robustness under sub-domain marginal weight shifts, providing a foundational strategy for improving alignment across differing groups (e.g., treated/control, source/target domains). Building upon this sub-domain alignment theory, we then address Temporal Counterfactuals (ii) by proposing a novel framework that synergistically integrates (a) Sub-treatment Group Alignment (SGA), which uses iterative treatment-agnostic clustering to identify and align fine-grained sub-treatment groups for improved deconfounding with (b) Random Temporal Masking (RTM) for enhanced temporal generalization. For Multi-Treatment and Multi-Outcome Causal Effects (iii), we introduce MOTTO, a Mixture-of-Experts-based model that explicitly learns inter-treatment and inter-outcome relationships and ensures robust counterfactual balancing by selectively aligning treatment-shared expert representations, demonstrating scalability in complex ecosystems. Finally, to overcome the identified LLM deficiencies and achieve effective LLM-Integrated Causal Discovery (iv), we develop CARE. This supervised fine-tuning framework specifically teaches LLMs to effectively utilize outputs from established causal discovery algorithms, enabling them to integrate its own knowledge with the external algorithmic clues, thus significantly improve their causal reasoning from observational data. Methodologies developed within this thesis are substantiated through theoretical analyses and extensive empirical evaluations on synthetic datasets, diverse real-world benchmarks, and large-scale industrial applications.
The contributions of my dissertation demonstrate advancements in achieving robust and scalable causal inference. The theoretical framework for sub-domain alignment offers a stronger foundation for domain adaptation and, by extension, for improving overlap in causal settings, with the SGA method derived from this theory effectively deconfounding time-series data. MOTTO provides a scalable solution for complex multi-faceted causal estimation, and CARE pioneers a new direction for enhancing LLM causal reasoning. Collectively, the developed methodologies, validated through rigorous analysis and experiments, offer effective solutions to pressing challenges in domain alignment, time-series counterfactual outcome estimation, multi-treatment & multi-outcome treatment effect estimation, and LLM-integrated causal discovery, paving the way for more reliable, insightful, and scalable data-driven decision-making in complex real-world environments.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Liu, Yiling (2025). Robust and Scalable Causal Analysis: Domain Alignment, Time-series Counterfactuals, Multi-Treatment & Multi-Outcome Causal Effects, and LLM-Integrated Causal Discovery. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/33372.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.
