Volfovsky, AlexanderTierney, Graham2023-06-082023-06-082023https://hdl.handle.net/10161/27701<p>The central theme of this dissertation is causal inference for complex data, and highlighting how for certain estimation problems, collecting more data has limited benefit. The central application areas are natural language data and multivariate time series. For text, large language models are trained on predictive tasks not necessarily well-suited for causal inference. Moreover, documents that vary in some treatment feature will often also vary systematically in other, unknown ways that prohibit attribution of causal effects to the feature of interest. Multivariate time series, even with high-quality contemporaneous predictors, still exhibit positive dependencies such that even with many treated and control units, the amount of information available to estimate causal quantities is quite low. </p><p>Chapter 2 builds a model for short text, as is typically found on social media platforms. Chapter 3 analyzes a randomized experiment that paired Democrats and Republicans to have a conversation about politics, then develops a sensitivity procedure to test for mediation effects attributable to the politeness of the conversation. Chapter 4 expands on the limitations of observational, model-based methods for causal inference with text and designs an experiment to validate how significant those limitations are. Chapter 5 covers experimentation with multivariate time series. </p><p>The general conclusion from these chapters is that causal inference always requires untestable assumptions. A researcher trying to make causal conclusions needs to understand the underlying structure of the problem they are studying to validate whether those assumptions hold. The work here shows how to still conduct causal analysis when commonly made assumptions are violated. </p>StatisticsBayesian analysisCausal inferenceNatural language processingTime seriesCausal Inference for Natural Language Data and Multivariate Time SeriesDissertation