Simplifying Human-in-the-loop Data Science Pipeline: Explanations, Debugging, and Data Preparation

Loading...
Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

48
views
200
downloads

Abstract

Data science has been reshaping almost every single field in the past decade. The emerging data science pipeline is empowering a broad range of users of various programming experiences to gain insights from raw data. Meanwhile, bottlenecks remain in crucial components of the data science pipeline, preventing users from manipulating, analyzing, and understanding their data easily. The goal of this dissertation is to reduce the user burden in the data science pipeline. Specifically, we present approaches to assisting users by simplifying critical steps of the pipeline: (i) by helping users write correct database queries, (ii) by providing explanations for unexpected aggregate query results, and (iii) by reducing the training data requirement in building machine learning models for data preparation.

For (i), we developed systems that find small counterexamples pointing out user query errors and allow users to trace how the query executes, thereby helping users fix wrong queries.For (ii), we developed systems that explain surprising aggregation outcomes using contextual information that are not captured in data provenance. For (iii), we showed how to leverage limited training examples to generate new ones, thus reducing the burden of human users in building machine-learning-based data preparation solutions. In experiments for performance evaluation, our query debugging tools can provide explanations for wrong queries at interactive speeds ($<200$ ms on average), our systems for explaining query results scale better than baseline approaches, and our data augmentation approach outperforms the state-of-the-art entity matching and data cleaning systems in low-resource settings (with only hundreds of labels available). Our qualitative evaluation and user studies show that our query debugging tools are effective for helping users spot and understand bugs in database queries; the explanations by our systems are more meaningful compared with existing approaches. Works in this dissertation have practical impacts and are in real use. Our tools for debugging database queries have been used by students in undergraduate and graduate database courses at Duke in the past several years.

Description

Provenance

Citation

Citation

Miao, Zhengjie (2022). Simplifying Human-in-the-loop Data Science Pipeline: Explanations, Debugging, and Data Preparation. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/26796.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.