Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy
Abstract
The abundance of data, often containing private and sensitive information, coupled with an ever-growing interest in using the data for research and driving business value at scale has raised concerns about privacy protections. Formal policies have made access to such data heavily regulated, often resulting in users waiting months or years before they can even start analyzing the data to determine fit for their tasks. In the recent past, generation of synthetic data has emerged as an alternative. A synthetic dataset is a collection of records that is designed to capture structural and statistical properties in the sensitive dataset while ensuring that private properties of individuals are not revealed. Synthetic data generation, especially under Differential Privacy, is gaining popularity for making copies of sensitive data for testing, benchmarking, data analysis, training models and privacy preservation.
This dissertation tackles two challenges in the field of synthetic data generation: (1) generating synthetic data for multi-relational data while preserving the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs), and (2) estimating per-query error, so a user can know if their queries are answered reliably on the synthetic data. More specifically, the former problem pertains to imputing a missing foreign key column, FK, in a relation with respect to CCs on the join view and ICs that restrict which tuples can share an FK value. We give a novel framework based on declarative CCs and ICs, prove NP-hardness and propose a novel two-phase solution satisfying ICs. Next, we study how to judge the quality of aggregated statistics on synthetic data when it is known that the standard data generation systems do not provide “per-query" guarantees. We present a novel framework DP-PQD to privately detect if query answers on the private and synthetic data are within a user-specified threshold of each other. We introduce the notion of effectiveness to capture the range of thresholds for which a per-query decider is expected to do well. We give a suite of private algorithms, analyze their properties, and evaluate them experimentally. We illustrate and explain why accuracy may not be monotonic in the privacy budget.
Type
Department
Description
Provenance
Citation
Permalink
Citation
Patwa, Shweta Jayant (2023). Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/29155.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.