Browsing by Author "Roy, Sudeepa"
Results Per Page
Sort Options
Item Open Access Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation.(CoRR, 2020) Morucci, Marco; Orlandi, Vittorio; Rudin, Cynthia; Roy, Sudeepa; Volfovsky, AlexanderWe propose a matching method for observational data that matches units with others in unit-specific, hyper-box-shaped regions of the covariate space. These regions are large enough that many matches are created for each unit and small enough that the treatment effect is roughly constant throughout. The regions are found as either the solution to a mixed integer program, or using a (fast) approximation algorithm. The result is an interpretable and tailored estimate of a causal effect for each unit.Item Open Access CAUSAL INFERENCE FOR HIGH-STAKES DECISIONS(2023) Parikh, Harsh JCausal inference methods are commonly used across domains to aid high-stakes decision-making. The validity of causal studies often relies on strong assumptions that might not be realistic in high-stakes scenarios. Inferences based on incorrect assumptions frequently result in sub-optimal decisions with high penalties and long-term consequences. Unlike prediction or machine learning methods, it is particularly challenging to evaluate the performance of causal methods using just the observed data because the ground truth causal effects are missing for all units. My research presents frameworks to enable validation of causal inference methods in one of the following three ways: (i) auditing the estimation procedure by a domain expert, (ii) studying the performance using synthetic data, and (iii) using placebo tests to identify biases. This work enables decision-makers to reason about the validity of the estimation procedure by thinking carefully about the underlying assumptions. Our Learning-to-Match framework is an auditable-and-accurate approach that learns an optimal distance metric for estimating heterogeneous treatment effects. We augment Learning-to-Match framework with pharmacological mechanistic knowledge to study the long-term effects of untreated seizure-like brain activities in critically ill patients. Here, the auditability of the estimator allowed neurologists to qualitatively validate the analysis via a chart-review. We also propose Credence, a synthetic data based framework to validate causal inference methods. Credence simulates data that is stochastically indistinguishable from the observed data while allowing for user-designed treatment effects and selection biases. We demonstrate Credence's ability to accurately assess the relative performance of causal estimation techniques in an extensive simulation study and two real-world data applications. We also discuss an approach to combines experimental and observational studies. Our approach provides a principled approach to test for the violations of no-unobserved confounder assumption and estimate treatment effects under this violation.
Item Open Access dame-flame: A Python Library Providing Fast Interpretable Matching for Causal Inference.(CoRR, 2021) Gupta, Neha R; Orlandi, Vittorio; Chang, Chia-Rui; Wang, Tianyu; Morucci, Marco; Dey, Pritam; Howell, Thomas J; Sun, Xian; Ghosal, Angikar; Roy, Sudeepa; Rudin, Cynthia; Volfovsky, Alexanderdame-flame is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. This package implements the Dynamic Almost Matching Exactly (DAME) and Fast Large-Scale Almost Matching Exactly (FLAME) algorithms, which match treatment and control units on subsets of the covariates. The resulting matched groups are interpretable, because the matches are made on covariates (rather than, for instance, propensity scores), and high-quality, because machine learning is used to determine which covariates are important to match on. DAME solves an optimization problem that matches units on as many covariates as possible, prioritizing matches on important covariates. FLAME approximates the solution found by DAME via a much faster backward feature selection procedure. The package provides several adjustable parameters to adapt the algorithms to specific applications, and can calculate treatment effects after matching. Descriptions of these parameters, details on estimating treatment effects, and further examples, can be found in the documentation at https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package/Item Open Access Detecting and Reducing Resource Interferences in Data Analytics Frameworks(2019) Kalmegh, PrajaktaData analytics frameworks on shared clusters host a large number of diverse workloads submitted by multiple tenants. Modern cluster schedulers incentivize users to share the cluster resources by promising fairness and isolation along with high performance and resource utilization. Nevertheless, it is hard to meet these guarantees as resource contentions among such collocated workloads cause significant performance issues and is one of the key reasons for unpredictable performance and missed workload Service-Level-Agreements (SLAs) in data analytics frameworks. Despite meticulous measures to prevent interferences, contention for unmanaged resources still prevails and causes undue waiting times for queries. It is highly important to safeguard the progress of data analytical queries from such variable impacts caused by ad-hoc jobs. In general, for any dataflow-based execution of queries on these frameworks, analyzing the inter-query resource interactions is critical in order to answer questions like `who is creating resource conflicts for my query'. Today cluster schedulers face the challenge of accurately detecting such concurrency-related slowdown, and acting on them in a timely manner to reduce multi-resource interferences in an online workload.
We present a novel approach to detecting and reducing resource contentions in an online workload that helps the system/database administrators narrow down and regulate the many possibilities of query performance degradation. It uses data from historical executions, runtime cluster metrics and heuristics applicable to a pipeline execution model to build a robust and scalable solution. Our solution (i) models the resource conflicts between concurrent queries using a multi-level directed acyclic graph. Using the resource-blocked times faced by a query, we develop a blame attribution metric that helps users identify concurrency-related problems in the execution of a query, and (ii) feeds pair-wise task disutility preferences to a contention-aware scheduler that creates fair and stable task co-locations on the cluster to minimize the resource waiting times. We have also developed a novel dynamic resource re-allocation scheme that detects a point in a query's execution timeline after which it is vulnerable to the impacts due to concurrency problems. This scheme also safeguards the query's progress. We evaluate our system on a wide range of workloads using TPCDS benchmark queries to show that our approach to contention analysis and blame attribution is substantially more accurate than other methods that are based on overlap time between concurrent queries. We also demonstrate that it reduces scheduling queue wait times and resource blocked times of queries significantly and provides a more predictable query execution with improved performance. Our solution out-performs dataflow-agnostic and contention-oblivious schedulers while improving cluster utilization.
Item Open Access Interpretable Almost-Matching Exactly with Instrumental Variables(2019) Liu, YamengWe aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework.
The method proposed in this work aims to match units on a weighted Hamming distance, taking into account the relative importance of the covariates; To match units on as many relevant variables as possible, the algorithm creates a hierarchy of covariate combinations on which to match (similar to downward closure), in the process solving an optimization problem for each unit in order to construct the optimal matches. The algorithm uses a single dynamic program to solve all of the units' optimization problems simultaneously. Notable advantages of our method over existing matching procedures are its high-quality interpretable matches, versatility in handling different data distributions that may have irrelevant variables, and ability to handle missing data by matching on as many available covariates as possible. We also adapt the matching framework by using instrumental variables (IV) to the presence of observed categorical confounding that breaks the randomness assumptions and propose an approximate algorithm which speedily generates high-quality interpretable solutions.We show that our algorithms construct better matches than other existing methods on simulated datasets, produce interesting results in applications to crime intervention and political canvassing.
Item Open Access Simplifying Human-in-the-loop Data Science Pipeline: Explanations, Debugging, and Data Preparation(2022) Miao, ZhengjieData science has been reshaping almost every single field in the past decade. The emerging data science pipeline is empowering a broad range of users of various programming experiences to gain insights from raw data. Meanwhile, bottlenecks remain in crucial components of the data science pipeline, preventing users from manipulating, analyzing, and understanding their data easily. The goal of this dissertation is to reduce the user burden in the data science pipeline. Specifically, we present approaches to assisting users by simplifying critical steps of the pipeline: (i) by helping users write correct database queries, (ii) by providing explanations for unexpected aggregate query results, and (iii) by reducing the training data requirement in building machine learning models for data preparation.
For (i), we developed systems that find small counterexamples pointing out user query errors and allow users to trace how the query executes, thereby helping users fix wrong queries.For (ii), we developed systems that explain surprising aggregation outcomes using contextual information that are not captured in data provenance. For (iii), we showed how to leverage limited training examples to generate new ones, thus reducing the burden of human users in building machine-learning-based data preparation solutions. In experiments for performance evaluation, our query debugging tools can provide explanations for wrong queries at interactive speeds ($<200$ ms on average), our systems for explaining query results scale better than baseline approaches, and our data augmentation approach outperforms the state-of-the-art entity matching and data cleaning systems in low-resource settings (with only hundreds of labels available). Our qualitative evaluation and user studies show that our query debugging tools are effective for helping users spot and understand bugs in database queries; the explanations by our systems are more meaningful compared with existing approaches. Works in this dissertation have practical impacts and are in real use. Our tools for debugging database queries have been used by students in undergraduate and graduate database courses at Duke in the past several years.