Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy

dc.contributor.advisor

Machanavajjhala, Ashwin

dc.contributor.author

Patwa, Shweta Jayant

dc.date.accessioned

2023-10-03T13:36:04Z

dc.date.issued

2023

dc.department

Computer Science

dc.description.abstract

The abundance of data, often containing private and sensitive information, coupled with an ever-growing interest in using the data for research and driving business value at scale has raised concerns about privacy protections. Formal policies have made access to such data heavily regulated, often resulting in users waiting months or years before they can even start analyzing the data to determine fit for their tasks. In the recent past, generation of synthetic data has emerged as an alternative. A synthetic dataset is a collection of records that is designed to capture structural and statistical properties in the sensitive dataset while ensuring that private properties of individuals are not revealed. Synthetic data generation, especially under Differential Privacy, is gaining popularity for making copies of sensitive data for testing, benchmarking, data analysis, training models and privacy preservation.

This dissertation tackles two challenges in the field of synthetic data generation: (1) generating synthetic data for multi-relational data while preserving the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs), and (2) estimating per-query error, so a user can know if their queries are answered reliably on the synthetic data. More specifically, the former problem pertains to imputing a missing foreign key column, FK, in a relation with respect to CCs on the join view and ICs that restrict which tuples can share an FK value. We give a novel framework based on declarative CCs and ICs, prove NP-hardness and propose a novel two-phase solution satisfying ICs. Next, we study how to judge the quality of aggregated statistics on synthetic data when it is known that the standard data generation systems do not provide “per-query" guarantees. We present a novel framework DP-PQD to privately detect if query answers on the private and synthetic data are within a user-specified threshold of each other. We introduce the notion of effectiveness to capture the range of thresholds for which a per-query decider is expected to do well. We give a suite of private algorithms, analyze their properties, and evaluate them experimentally. We illustrate and explain why accuracy may not be monotonic in the privacy budget.

dc.identifier.uri

https://hdl.handle.net/10161/29155

dc.subject

Computer science

dc.subject

Data Privacy and Security

dc.subject

Differential privacy

dc.subject

Information Integration and Data Quality

dc.subject

Performance Evaluation and Benchmarking

dc.subject

Provenance and Workflows

dc.title

Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy

dc.type

Dissertation

duke.embargo.months

12

duke.embargo.release

2024-09-14T00:00:00Z

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Patwa_duke_0066D_17540.pdf
Size:
2.05 MB
Format:
Adobe Portable Document Format

Collections