Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy

Patwa, Shweta Jayant

Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy

View / Download2.05 MB

Date

2023

Authors

Patwa, Shweta Jayant

Advisors

Machanavajjhala, Ashwin

Repository Usage Stats

30
views

6
downloads

Abstract

The abundance of data, often containing private and sensitive information, coupled with an ever-growing interest in using the data for research and driving business value at scale has raised concerns about privacy protections. Formal policies have made access to such data heavily regulated, often resulting in users waiting months or years before they can even start analyzing the data to determine fit for their tasks. In the recent past, generation of synthetic data has emerged as an alternative. A synthetic dataset is a collection of records that is designed to capture structural and statistical properties in the sensitive dataset while ensuring that private properties of individuals are not revealed. Synthetic data generation, especially under Differential Privacy, is gaining popularity for making copies of sensitive data for testing, benchmarking, data analysis, training models and privacy preservation.

This dissertation tackles two challenges in the field of synthetic data generation: (1) generating synthetic data for multi-relational data while preserving the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs), and (2) estimating per-query error, so a user can know if their queries are answered reliably on the synthetic data. More specifically, the former problem pertains to imputing a missing foreign key column, FK, in a relation with respect to CCs on the join view and ICs that restrict which tuples can share an FK value. We give a novel framework based on declarative CCs and ICs, prove NP-hardness and propose a novel two-phase solution satisfying ICs. Next, we study how to judge the quality of aggregated statistics on synthetic data when it is known that the standard data generation systems do not provide “per-query" guarantees. We present a novel framework DP-PQD to privately detect if query answers on the private and synthetic data are within a user-specified threshold of each other. We introduce the notion of effectiveness to capture the range of thresholds for which a per-query decider is expected to do well. We give a suite of private algorithms, analyze their properties, and evaluate them experimentally. We illustrate and explain why accuracy may not be monotonic in the privacy budget.

Type

Dissertation

Department

Computer Science

Subjects

Computer science, Data Privacy and Security, Differential privacy, Information Integration and Data Quality, Performance Evaluation and Benchmarking, Provenance and Workflows

Permalink

https://hdl.handle.net/10161/29155

Citation

Patwa, Shweta Jayant (2023). Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/29155.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections