Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy

Limited Access
This item is unavailable until:



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



The abundance of data, often containing private and sensitive information, coupled with an ever-growing interest in using the data for research and driving business value at scale has raised concerns about privacy protections. Formal policies have made access to such data heavily regulated, often resulting in users waiting months or years before they can even start analyzing the data to determine fit for their tasks. In the recent past, generation of synthetic data has emerged as an alternative. A synthetic dataset is a collection of records that is designed to capture structural and statistical properties in the sensitive dataset while ensuring that private properties of individuals are not revealed. Synthetic data generation, especially under Differential Privacy, is gaining popularity for making copies of sensitive data for testing, benchmarking, data analysis, training models and privacy preservation.

This dissertation tackles two challenges in the field of synthetic data generation: (1) generating synthetic data for multi-relational data while preserving the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs), and (2) estimating per-query error, so a user can know if their queries are answered reliably on the synthetic data. More specifically, the former problem pertains to imputing a missing foreign key column, FK, in a relation with respect to CCs on the join view and ICs that restrict which tuples can share an FK value. We give a novel framework based on declarative CCs and ICs, prove NP-hardness and propose a novel two-phase solution satisfying ICs. Next, we study how to judge the quality of aggregated statistics on synthetic data when it is known that the standard data generation systems do not provide “per-query" guarantees. We present a novel framework DP-PQD to privately detect if query answers on the private and synthetic data are within a user-specified threshold of each other. We introduce the notion of effectiveness to capture the range of thresholds for which a per-query decider is expected to do well. We give a suite of private algorithms, analyze their properties, and evaluate them experimentally. We illustrate and explain why accuracy may not be monotonic in the privacy budget.





Patwa, Shweta Jayant (2023). Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.