Causal Inference Theory and Methods for Design and Analysis of Observational Studies

Limited Access
This item is unavailable until:
2026-05-19

Date

2025

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

71
views
13
downloads

Abstract

This dissertation develops novel theoretical and methodological contributions to the design and analysis of observational studies in causal inference, addressing several longstanding challenges in the field. Specifically, it contributes to three major areas: (i) principal stratification analysis for handling non-compliance in time-to-event outcomes subject to censoring; (ii) estimation methodologies that integrate external data to improve the efficiency of clinical trials; and (iii) power analysis and sample size calculation for the design of clinical trials. These methodological advances are motivated by real-world applications, where existing approaches exhibit limitations. We validate all these methods through extensive simulation studies and apply them to real-world applications.

Randomized clinical trials are the gold standard for evaluating and comparing the effect of different treatments. However, post-randomization events, also known as intercurrent events, such as treatment noncompliance and censoring due to a terminal event, are common in clinical trials. Principal stratification is a framework for causal inference in the presence of intercurrent events. The existing literature on principal stratification lacks generally applicable and accessible methods for time-to-event outcomes. In Chapter \ref{chap:2}, we focus on the noncompliance setting. We specify two causal estimands for time-to-event outcomes in principal stratification and provide a nonparametric identification formula. For estimation, we adopt the latent mixture modeling approach and illustrate the general strategy with a mixture of Bayesian parametric Weibull-Cox proportional hazards model for the outcome. We utilize the Stan programming language to obtain automatic posterior sampling of the model parameters. We provide analytical forms of the causal estimands as functions of the model parameters and an alternative numerical method when analytical forms are not available. We apply the proposed method to the ADAPTABLE trial to evaluate the causal effect of taking 81 mg versus 325 mg aspirin on the risk of major adverse cardiovascular events. We develop the corresponding R package \textbf{PStrata}.

It is increasingly common to augment randomized controlled trial with external controls from observational data, to evaluate the treatment effect of an intervention. Traditional approaches to treatment effect estimation involve ambiguous estimands and unrealistic or strong assumptions, such as mean exchangeability. In Chapter \ref{chap:3}, we introduce a double-indexed notation for potential outcomes to define causal estimands transparently and clarify distinct sources of implicit bias. We show that the concurrent control arm is critical in assessing the plausibility of assumptions and providing unbiased causal estimation. We derive a consistent and locally efficient estimator for a class of weighted average treatment effect estimands that combines concurrent and external data without assuming mean exchangeability. This estimator incorporates an estimate of the systematic difference in outcomes between the concurrent and external units, of which we propose a Frish-Waugh-Lovell style partial regression method to obtain. We compare the proposed methods with existing methods using extensive simulation and applied to cardiovascular clinical trials.

Sample size calculation is of great importance in trial designs. For randomized trials, sample size calculation is straightforward from the two-sample $t$-tests; however, for observational studies, it remains unclear what information is necessary for accurate calculation of sample size, and the current approaches are often ad hoc. In Chapter \ref{chap:4}, we develop theoretically justified analytical formulas for sample size and power calculation in the propensity score analysis of causal inference using observational data. By analyzing the variance of the inverse probability weighting estimator of the average treatment effect (ATE), we clarify the three key components for sample size calculations: propensity score distribution, potential outcome distribution, and their correlation. We devise analytical procedures to identify these components based on commonly available and interpretable summary statistics. We elucidate the critical role of covariate overlap between treatment groups in determining the sample size. In particular, we propose to use the Bhattacharyya coefficient as a measure of covariate overlap, which, together with the treatment proportion, leads to a uniquely identifiable and easily computable propensity score distribution. The proposed method is applicable to both continuous and binary outcomes. We show that the standard two-sample $z$-test and variance inflation factor methods often lead to, sometimes vastly, inaccurate sample size estimates, especially with limited overlap. We also derive formulas for the average treatment effects for the treated (ATT) and overlapped population (ATO) estimands. We provide simulated and real examples to illustrate the proposed method. We develop an associated R package \textbf{PSpower}.

The development of these novel methods fills the gap and solves long-standing challenges in various aspects of causal inference with observational studies. In addition, we provide R packages with user-friendly interfaces to facilitate the application of these methods in clinical research.

Description

Provenance

Subjects

Statistics, Mathematics

Citation

Citation

LIU, BO (2025). Causal Inference Theory and Methods for Design and Analysis of Observational Studies. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/32708.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.