Browsing by Author "Volfovsky, Alexander"
Results Per Page
Sort Options
Item Open Access Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation.(CoRR, 2020) Morucci, Marco; Orlandi, Vittorio; Rudin, Cynthia; Roy, Sudeepa; Volfovsky, AlexanderWe propose a matching method for observational data that matches units with others in unit-specific, hyper-box-shaped regions of the covariate space. These regions are large enough that many matches are created for each unit and small enough that the treatment effect is roughly constant throughout. The regions are found as either the solution to a mixed integer program, or using a (fast) approximation algorithm. The result is an interpretable and tailored estimate of a causal effect for each unit.Item Open Access An Investigation into the Bias and Variance of Almost Matching Exactly Methods(2021) Morucci, MarcoThe development of interpretable causal estimation methods is a fundamental problem for high-stakes decision settings in which results must be explainable. Matching methods are highly explainable, but often lack the accuracy of black-box nonparametric models for causal effects. In this work, we propose to investigate theoretically the statistical bias and variance of Almost Matching Exactly (AME) methods for causal effect estimation. These methods aim to overcome the inaccuracy of matching by learning on a separate training dataset an optimal metric to match units on. While these methods are both powerful and interpretable, we currently lack an understanding of their statistical properties. In this work we present a theoretical characterization of the finite-sample and asymptotic properties of AME. We show that AME with discrete data has bounded bias in finite samples, and is asymptotically normal and consistent at a root-n rate. Additionally, we show that AME methods for matching on networked data also have bounded bias and variance in finite-samples, and achieve asymptotic consistency in sparse enough graphs. Our results can be used to motivate the construction of approximate confidence intervals around AME causal estimates, providing a way to quantify their uncertainty.
Item Open Access Causal Inference for Natural Language Data and Multivariate Time Series(2023) Tierney, GrahamThe central theme of this dissertation is causal inference for complex data, and highlighting how for certain estimation problems, collecting more data has limited benefit. The central application areas are natural language data and multivariate time series. For text, large language models are trained on predictive tasks not necessarily well-suited for causal inference. Moreover, documents that vary in some treatment feature will often also vary systematically in other, unknown ways that prohibit attribution of causal effects to the feature of interest. Multivariate time series, even with high-quality contemporaneous predictors, still exhibit positive dependencies such that even with many treated and control units, the amount of information available to estimate causal quantities is quite low.
Chapter 2 builds a model for short text, as is typically found on social media platforms. Chapter 3 analyzes a randomized experiment that paired Democrats and Republicans to have a conversation about politics, then develops a sensitivity procedure to test for mediation effects attributable to the politeness of the conversation. Chapter 4 expands on the limitations of observational, model-based methods for causal inference with text and designs an experiment to validate how significant those limitations are. Chapter 5 covers experimentation with multivariate time series.
The general conclusion from these chapters is that causal inference always requires untestable assumptions. A researcher trying to make causal conclusions needs to understand the underlying structure of the problem they are studying to validate whether those assumptions hold. The work here shows how to still conduct causal analysis when commonly made assumptions are violated.
Item Open Access Communities in Social Networks: Detection, Heterogeneity and Experimentation(2022) Mathews, HeatherThe study of network data in the social and health sciences frequently concentrates on understanding how and why connections form. In particular, the task of determining latent mechanisms driving connection has received a lot of attention across statistics, machine learning, and information theory. In social networks, this mechanism often manifests as community structure. As a result, this work provides methods for discovering and leveraging these communities to better understand networks and the data they generate.
We provide three main contributions. First, we present methodology for performing community detection in challenging regimes. Existing literature has focused on modeling the spectral embedding of a network using Gaussian mixture models (GMMs) in scaling regimes where the ability to detect community memberships improves with the size of the network. However, these regimes are not very realistic. As such, we provide tractable methodology motivated by new theoretical results for networks with non-vanishing noise by using GMMs that incorporate truncation and shrinkage effects.
Further, when covariate information is available, often we want to understand how covariates impact connections. It is likely that the effects of covariates on edge formation differ between communities (e.g. age might play a different role in friendship formation in communities across a city). To address this issue, we introduce a latent space network model where coefficients associated with certain covariates can depend on latent community membership of the nodes. We show that ignoring such structure can lead to either over- or under-estimation of covariate importance to edge formation and propose a Markov Chain Monte Carlo approach for simultaneously learning the latent community structure and the community specific coefficients.
Finally, we consider how community structure can impact experimentation. It is evident that communities can act in different ways, and it is natural that this propagates into experimental design. As as result, this observation motivates our development of community informed experimental design. This design recognizes that information between individuals likely flows along within community edges rather than across community edges. We demonstrate that this design improves estimation of global average treatment effect, even when the community structure of the graph needs to be estimated.
Item Open Access Comparison of Bayesian Inference Methods for Probit Network Models(2021) Shen, YueMingThis thesis explores and compares Bayesian inference procedures for probit network models. Network data typically exhibit high dyadic correlation due to reciprocity. For binary network data, presence of dyadic correlation often leads to inefficiency of a basic implementation of Markov chain Monte Carlo (MCMC). We first explore variational inference as a fast approximation to the posterior distribution. Aware of its insufficiency in quantifying posterior uncertainties, we propose an alternative MCMC algorithm which is more efficient and accurate. In particular, we propose to update the dyadic correlation parameter $\rho$ using the marginal likelihood unconditional of the latent relations $Z$. This reduces autocorrelations in the posterior samples of $\rho$ and hence improves mixing. Simulation study and real data examples are provided to compare the performance of these Bayesian inference methods.
Item Open Access Corrupted Data and the Illicit Arms Trade(2020-04) Graves, RoseWith the ever-increasing advancements in weapons technology, the illicit arms trade has steadily become a greater threat to international security. The small arms trade, consisting of portable weapons and their parts, is not only a profitable good, but also a method of gaining power through violent and threatening means. Being able to identify when and what countries are engaging in illicit arms trade is essential in order to make informed policy decisions. The driving question behind this project is: how do we recognize corrupted network data and how does corrupted network data impact our statistical analyses? The arms trade takes the form of network data consisting of actors (nodes) and the relationship between them (edges). This analysis of methods initially looks at simulated data. We show that if data is sampled from a pre-specified model then increasing the amount of corrupt data present impacts posterior statistics such as the intercept and row and column coefficients, as well as posterior predictive descriptive statistics such as degree distributions, triangle counts, betweenness, and Eigen vector centrality. This analysis demonstrates if data is corrupt, then by replacing the corrupted values with NAs these missing values will be imputed from the true pre-specified model and thus will not impact inference. These methods are then applied to actual small arms trade data, to see what nations may be engaging in illicit arms trading.Item Open Access dame-flame: A Python Library Providing Fast Interpretable Matching for Causal Inference.(CoRR, 2021) Gupta, Neha R; Orlandi, Vittorio; Chang, Chia-Rui; Wang, Tianyu; Morucci, Marco; Dey, Pritam; Howell, Thomas J; Sun, Xian; Ghosal, Angikar; Roy, Sudeepa; Rudin, Cynthia; Volfovsky, Alexanderdame-flame is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. This package implements the Dynamic Almost Matching Exactly (DAME) and Fast Large-Scale Almost Matching Exactly (FLAME) algorithms, which match treatment and control units on subsets of the covariates. The resulting matched groups are interpretable, because the matches are made on covariates (rather than, for instance, propensity scores), and high-quality, because machine learning is used to determine which covariates are important to match on. DAME solves an optimization problem that matches units on as many covariates as possible, prioritizing matches on important covariates. FLAME approximates the solution found by DAME via a much faster backward feature selection procedure. The package provides several adjustable parameters to adapt the algorithms to specific applications, and can calculate treatment effects after matching. Descriptions of these parameters, details on estimating treatment effects, and further examples, can be found in the documentation at https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package/Item Open Access Exposure to opposing views on social media can increase political polarization.(Proceedings of the National Academy of Sciences of the United States of America, 2018-09) Bail, Christopher A; Argyle, Lisa P; Brown, Taylor W; Bumpus, John P; Chen, Haohan; Hunzaker, MB Fallin; Lee, Jaemin; Mann, Marcus; Merhout, Friedolin; Volfovsky, AlexanderThere is mounting concern that social media sites contribute to political polarization by creating "echo chambers" that insulate people from opposing views about current events. We surveyed a large sample of Democrats and Republicans who visit Twitter at least three times each week about a range of social policy issues. One week later, we randomly assigned respondents to a treatment condition in which they were offered financial incentives to follow a Twitter bot for 1 month that exposed them to messages from those with opposing political ideologies (e.g., elected officials, opinion leaders, media organizations, and nonprofit groups). Respondents were resurveyed at the end of the month to measure the effect of this treatment, and at regular intervals throughout the study period to monitor treatment compliance. We find that Republicans who followed a liberal Twitter bot became substantially more conservative posttreatment. Democrats exhibited slight increases in liberal attitudes after following a conservative Twitter bot, although these effects are not statistically significant. Notwithstanding important limitations of our study, these findings have significant implications for the interdisciplinary literature on political polarization and the emerging field of computational social science.Item Open Access Modeling Heterogeneity With Bayesian Additive Regression Trees(2023) Orlandi, VittorioThis work focuses on using Bayesian Additive Regression Trees (BART), a flexible and computationally efficient regression method, to model heterogeneity in data. In particular, we focus on the closely related tasks of hierarchical modeling, latent variable modeling, and density regression. We begin by introducing BART in Chapter 2, presenting the prior, various extensions, and an in-depth case study using BART to analyze the impact of ABO-incompatible cardiac transplant on infants. Chapter 3 describes a methodological contribution, in which we use BART to model data structured within known groups by allowing for group-specific forests, each of which is only updated using units corresponding to that group. We further introduce an intercept forest common to all units and a hierarchical prior across the leaf variances in order to allow for sharing of information. We find that such an approach yields more parsimonious models than other BART-based approaches in the literature, which in turn translates to better out-of-sample accuracy, at virtually no added computational cost. In Chapter 4, we consider models involving latent variables within BART. The original motivation is to extend the known-group approach in Chapter 3 to a setting where group information is unavailable. However, this idea lends itself well to many different analyses, including those involving continuous omitted or latent variables. Another application is a generalization of a BART-based approach to sensitivity analysis, in which we allow for the unobserved confounder to flexibly influence the outcome. The latent variable framework we consider is computationally efficient, can help BART model data much more accurately than if restricting oneself to observed covariates, and is widely applicable to many different settings. In Chapter 5, we study one such application in great detail: using BART for density regression. By integrating out the latent variable in our model, we can model conditional densities in a way that outperforms a variety of other approaches on simulated tasks, and also allows us to bound its posterior concentration rate. We hope that the tools we develop in this work are useful to practitioners seeking to model heterogeneity in their data and also serve as a foundation for future methodological advances.
Item Open Access Optimizing the Network Sampling With Memory Algorithm(2022) Le Barbenchon, ClaireNetwork Sampling with Memory (NSM), a novel sampling method that extends existing Respondent Driven Sampling (RDS) methods, is becoming increasingly attractive to sociologists, demographers, and others to sample hidden populations while attempting to explore the full network of the target population, highlighting the importance of improving and testing this method under various conditions. Since its elaboration, three large-scale data collection efforts have employed NSM to efficiently sample from Chinese immigrant populations.Given increased interest in using these methods in the field, this thesis contributes to the literature on the Network Sampling with Memory algorithm in three key ways: (1) Creates a user-friendly version of the sampler for future researchers by creating R functions that can be run easily in R Studio. (2) Tests the NSM sampler under different parameter configurations with 2 different target outcome variables, to help guide the practical researcher to select the optimal parameter configuration depending on the goals of the project. (3) Tests a directed NSM sampling algorithm which preferentially selects nodes that have certain characteristics. We show that different parameter configurations can improve estimation and lower design effects, depending on the outcome variable of interest. We also show that a directed sampler is feasible to implement, and that it can have low design effects at smaller sample sizes.
Item Open Access Stochastic Process Models on Dynamic Networks(2021) Bu, FanWe present novel model frameworks and inference procedures for stochastic point processes on dynamic networks. The point process can be defined for a random phenomenon that spreads among the network nodes, and for the temporally evolving network itself. Methods development is motivated by the needs of health and social science data, where partial observations or latent structures are common and create challenges to likelihood-based inference. In this dissertation, we will discuss parameter estimation techniques that can handle these latent variables and make effective use of the complete data likelihood for efficient inference. We start with developing individualized continuous time Markov chain models for stochastic epidemics on a dynamic contact network. Data-augmentation algorithms are designed to address partial observations (such as missing infection and recovery times) in epidemic data while accommodating the network dynamics. We apply the frameworks to the study of non-pharmaceutical interventions in a college population. Next, we move on to study the higher-order latent structures of dynamic inter-personal interactions by combining a multi-resolution spatio-temporal stochastic process with a latent factor model for a dynamic social network. We apply it to analyzing basketball data where the discovered latent structure defines a metric that helps evaluate the quality of game play. Finally, we discuss extensions to a non-Markovian setting of self and mutually exciting point processes (Hawkes processes). We utilize the branching structure of the Hawkes processes to uncover the latent replying structure of a group conversation, which can be further employed to quantitatively measure individual social impact.