Browsing by Author "Machanavajjhala, Ashwin"
Results Per Page
Sort Options
Item Open Access Answering and Explaining SQL Queries Privately(2022) Tao, YuchaoData privacy has been receiving an increasing amount of attention in recent years. While large-scale personal information is collected for scientific research and commercial products, a privacy breach is not acceptable as a trade-off. In the last decade, differential privacy has become a gold standard to protect data privacy and has been applied in many organizations. Past work focused on building a differentially private SQL query answering system as a building block for wider applications. However, answering counting queries with joins under differential privacy appears as a challenge. The join operator allows any user to have an unbounded impact on the query result, which impedes hiding the existence of a single user by differential privacy. On the other hand, the introduction of differential privacy to the query answering also prevents the users from understanding the query results correctly, since she needs to distinguish the effect of differential privacy from the contribution of data.
In this thesis, we study two problems about answering and explaining SQL queries privately. First, we present efficient algorithms to compute local sensitivities of counting queries with joins, which is an important premise for answering these queries under differential privacy. We track the sensitivity contributed by each tuple, based on which we propose a truncation mechanism that answers counting queries with joins privately with high utility. Second, we propose a formal framework DPXPlain, a three-phase framework that allows users to get explanations for group-by COUNT/SUM/AVG query results while preserving DP. We utilize confidence intervals to help users understand the uncertainty in the query results introduced by differential privacy, and further provide top-k explanations under differential privacy to explain the contribution of data to the results.
Item Open Access Applying Differential Privacy with Sparse Vector Technique(2018) Chen, YanIn today's fast-paced developing digital world, a wide range of services such as web services, social networks, and mobile devices collect a large amount of
personal data from their users. Although sharing and mining large-scale personal data can help improve the functionality of
these services, it also raises privacy concerns for the individuals who contribute to the data.
Differential privacy has emerged as a de facto standard for analyzing sensitive data with strong provable privacy guarantees for individuals. There is a rich literature that
has led to the development of differentially private algorithms for numerous data analysis tasks. The privacy proof of these algorithms are mainly based on (a) the privacy guarantees
of a small number of primitives, and (b) a set of composition theorems that help reason about the privacy guarantee of algorithms based on the used private primitives. In this dissertation, we focus on the usage of one popular differentially private primitive, Sparse Vector Technique, which can support multiple queries with limited privacy cost.
First, we revisit the original Sparse Vector Technique and its variants, proving that many of its variants violate the definition of differential privacy. Furthermore, we design an attack
algorithm demonstrating that an adversary can reconstruct the true database with high accuracy having access to these ``broken" variants.
Next, we utilize the original Sparse Vector Technique primitive to design new solutions for practical problems. We propose the first algorithms to publish regression diagnostics
under differential privacy for evaluating regression models. Specifically, we create differentially private versions of residual plots for linear regression as well as receiver operating characteristic (ROC) curves and binned residual plot for logistic regression. Comprehensive empirical studies show these algorithms are effective and enable users to evaluate the correctness of their model assumptions.
We then make use of Sparse Vector Technique as a key primitive to design a novel algorithm for differentially private stream processing, supporting queries on streaming data. This novel algorithm is data adaptive and can simultaneously support multiple queries, such as unit counts, sliding windows and event monitoring, over a single or multiple stream resolutions. Through extensive evaluations, we show that this new technique outperforms the state-of-the-art algorithms, which are specialized to particular query types.
Item Open Access Enabling Fine-Grained Permissions in Smartphones(2019) Raval, NisargThe increasing popularity of smart devices that continuously monitor various aspects of users' life and the prevalence of third-party services that utilize these data feeds have resulted in a serious threat to users' privacy. One-sided focus on the utility of these applications (apps) and lack of proper access control mechanism often lead to inadvertent (or deliberate) leak of sensitive information about users. At the core of protecting user data on smart devices lies the permissions framework. It arbitrates apps' accesses to resources on the device. The existing permissions frameworks in smartphones are largely coarse-grained allowing apps to collect more information than that is required for their functionality thereby putting users' privacy at risk.
In this dissertation, we address these privacy concerns by proposing an extensible permissions framework that gives users fine-grained control over the resources accessed by apps. It uses permissions plugins which are special modules that govern the app's access to resources on the device. We develop a number of permissions plugins to arbitrate access to key resources including location, contacts, camera and external storage. Moreover, we show that the existing privacy solutions can be easily integrated in our framework via plugins. We also develop two novel privacy frameworks that help users balance privacy-utility tradeoffs, and allow them to take an informed decision about sharing their data to apps in order to obtain services in return. We envision a repository of permissions plugins where privacy experts publish plugins that are customized to the needs of users as well as apps, and users simply install the plugins they are interested in to protect their privacy.
Item Open Access ENCRYPTED DATA MANAGEMENT SYSTEMS WITH TUNABLE PRIVACY(2023) Wang, ChenghongEncrypted data management systems allow untrusted platforms to manage and process encrypted data, thus enabling collaborative computations over confidential data. However, the pursuit of robust privacy protection and mitigation of privacy loss against cryptographic leakages often necessitates significant compromises to essential system guarantees, such as accuracy and performance, thereby rendering their implementation in real-world industries a highly improbable scenario.
This thesis aims to present a practical approach to constructing encrypted data management systems that effectively balance privacy and system guarantees. The fundamental design principle revolves around establishing quantifiable and adjustable privacy guarantees. This feature allows for fine-tuning of privacy levels to balance between privacy and system guarantees, enabling practitioners and system designers to navigate system tradeoffs according to their actual needs. First, we discuss a novel encrypted growing database framework, DP-Sync, which interoperates with a large class of existing encrypted databases and supports efficient updates while providing provable privacy for any single update. To accomplish this, we employ differential privacy constraints to rigorously quantify privacy loss against update leakages. Rather than defining a fixed privacy guarantee, our model allows for adjustable privacy, which offers flexibility for systems built on top of it to navigate tradeoffs.
Next, we expand the general design of the DP-Sync to accommodate more complex functionalities, namely private materializations and view-based secure query processing. The enhanced features enable untrusted platforms to securely maintain materialized views and process queries based on the computed views rather than accessing the underlying data. To achieve this, we propose a novel framework known asIncShrink, that: (i) supports view maintenance by utilizing incremental Multi-Party Computation (MPC) operators, effectively eliminating the necessity for trusted third parties; and (ii) guarantees that the privacy loss against view maintenance leakage satisfies DP definitions.
Lastly, we demonstrate that rather than traditional databases, the concept of tunable privacy design can also be applied to private decentralized data management systems, such as private Proof-of-Stake (PoS) blockchains. This showcases the potential for incorporating adjustable privacy mechanisms in distributed and decentralized settings, thereby opening up avenues for further exploration and research in the domain of encrypted data management systems. Specifically, we first present a new type of attack, namely, the stake inference attack that exploits permissible leakages present in state-of-the-art (SOTA) private Proof-of-Stake (PoS) blockchains, enabling the inference of precise stake values owned by each honest participant. Afterward, we utilize DP private stake distortion to achieve privacy in PoS blockchains. We formulate certain privacy requirements and design two stake distortion mechanisms that any PoS protocol can use.
Item Open Access Fairness in Differentially Private Data Release(2022) Pujol, David AnthonyData privacy has received an increased amount of attention in the recent decade. Large scale data collection has been the norm for both scientific and commercial uses. This private information about individuals is often leaked in the form of data release, aggregate statistics and machine learning models. Differential privacy has become the gold standard to limit private data leakage. Differentially private mechanisms work by infusing noise into private data releases, obfuscating the contribution of any individual. It is unclear on how the noise introduced by differential privacy affects the utility experienced by different population groups. It has been shown recently that typical uses for such private data such as machine learning and allocation tasks can result in different utilities across population groups and it is unknown how differential privacy interacts with these existing inequities.
We investigate the effects of differential privacy on the downstream utility experienced by different population groups. First we study the downstream effects of naive applications of differential privacy on well known census tasks. We show that differential privacy can magnify the impact of existing inequities as well as introduce inequities which were previously not present. Likewise we show that a carefully constructed mechanism with knowledge of the task at hand can reduce such inequities. We propose two frameworks for acknowledging and addressing these possible inequities, query answering and synthetic data. For query answering we propose a multi-analyst approach where different representatives for each group can share resources to maximize utility while ensuring a fair distribution of utility across groups. We propose a framework for ensuring fair and private synthetic data. Our approach creates a private synthetic dataset which preserves statistics from the original dataset while limiting the influence of known protected classes in classification tasks.
Item Open Access Policy Driven Data Sharing with Provable Privacy Guarantees(2018) He, XiCompanies such as Google or Facebook collect a substantial amount of data about their users to provide useful services. The release of these datasets for general use can enable numerous innovative applications and scientific research. However, such data contains sensitive information about users, and simple anonymization techniques have been shown to be ineffective to ensure users’ privacy. These privacy concerns have motivated the development of algorithms that share data with provable privacy guarantees including differential privacy. However, the focus of differentially private algorithm design has been on simplified problem settings. Real world applications must satisfy to complex privacy policies (beyond whether an individual is in or out of the dataset) and adhere to complex constraints, which hinders the deployment of differentially private algorithms.
This dissertation presents a novel policy-driven approach to design provable privacy guarantees for complex settings. This policy-driven approach results in a useful class of provable privacy definitions, named as Blowfish privacy, (a) generalize differential privacy to handle complex privacy preferences and constraints, (b) unify several variants of differential privacy that are used in practice, and (c) allow the creation of new well founded privacy definitions that allow flexible trade-offs between privacy, accuracy, and performance, based on the application’s requirements. The usefulness of this approach is shown in two use cases of data sharing: (1) analyzing location data which involves complex data types and privacy preferences, and (2) scaling private record linkage which involves secure computations between multiple parties. This work concludes with directions for future privacy research in private data analysis.
Item Open Access Query Answering in Multi-Relational Databases Under Differential Privacy(2019) Kotsogiannis, IosData collection has become a staple of both our digital and ``off-line'' activities. Government agencies, medical institutions, Internet companies, and academic institutions are among the main actors that collect and store users' data. Analysis and sharing of this data is paramount in our increasingly data-driven world.
Data sharing provides a large positive societal value; however, it does not come cost-free: data sharing is at fundamental odds with individuals' privacy.
As a result, data privacy has become a major research area, with differential privacy emerging as the de facto data privacy framework.
To mask the presence of any individual in the database, differentially private algorithms usually add noise to data releases. This noise is calibrated by the so called "privacy budget", a parameter that quantifies the privacy loss allowed.
One major shortcoming of both the definition and the supporting literature is that it applies to flat tables and extensions for multi-relational schemas are non trivial. In the context of multi-relational schemas, the privacy semantics are not well defined since individuals might be affecting multiple relations each of which in a different degree.
Moreover, there is no system that permits accurate differentially private answering of SQL queries while imposing a fixed privacy loss across all queries posed by the analyst.
In this thesis, we present PrivSQL, a first of its kind end-to-end differentially private relational database system. PrivSQL allows analysts to query a standard relational database using a rich class of SQL queries. PrivSQL enables the data owner to flexibly define the privacy semantics over the schema and provides a fixed privacy loss across all queries submitted by the analyst.
PrivSQL works by carefully selecting a set of views over the database schema, generating a set of private synopses over those views, and lastly answering incoming analyst queries based on the synopses.
Additionally, PrivSQL employs a variety of novel techniques like view selection for differential privacy, policy-aware view rewriting, and view truncation. These techniques allow PrivSQL to offer automatic support for custom-tailored privacy semantics and permit low error in query answering.
Item Open Access Student Paths in CS1: Case Studies of Initial Poor Performers(2019) Kim, Ji YeonWith the high influx of computer science enrollment in universities in the last decade, there is increasing value and wide-reaching effects in improving pedagogy in the field. This improvement is especially useful in introductory computer science courses (CS1). Student experience in the first programming course is known to heavily influence students' desires to stay in the field.
We present a set of student case studies that were enrolled in COMPSCI 101, an introductory computer science course at Duke University in fall of 2018. These case studies consist of students all with a poor initial performance in the course. We then consider their overall grade and help-seeking behavior in terms of what kind of help was sought and how often they sought help throughout their entire time in the course. We found no consistencies across any of the case studies.
Item Open Access Synthesizing Linked Data and Detecting Per-Query Gaps Under Differential Privacy(2023) Patwa, Shweta JayantThe abundance of data, often containing private and sensitive information, coupled with an ever-growing interest in using the data for research and driving business value at scale has raised concerns about privacy protections. Formal policies have made access to such data heavily regulated, often resulting in users waiting months or years before they can even start analyzing the data to determine fit for their tasks. In the recent past, generation of synthetic data has emerged as an alternative. A synthetic dataset is a collection of records that is designed to capture structural and statistical properties in the sensitive dataset while ensuring that private properties of individuals are not revealed. Synthetic data generation, especially under Differential Privacy, is gaining popularity for making copies of sensitive data for testing, benchmarking, data analysis, training models and privacy preservation.
This dissertation tackles two challenges in the field of synthetic data generation: (1) generating synthetic data for multi-relational data while preserving the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs), and (2) estimating per-query error, so a user can know if their queries are answered reliably on the synthetic data. More specifically, the former problem pertains to imputing a missing foreign key column, FK, in a relation with respect to CCs on the join view and ICs that restrict which tuples can share an FK value. We give a novel framework based on declarative CCs and ICs, prove NP-hardness and propose a novel two-phase solution satisfying ICs. Next, we study how to judge the quality of aggregated statistics on synthetic data when it is known that the standard data generation systems do not provide “per-query" guarantees. We present a novel framework DP-PQD to privately detect if query answers on the private and synthetic data are within a user-specified threshold of each other. We introduce the notion of effectiveness to capture the range of thresholds for which a per-query decider is expected to do well. We give a suite of private algorithms, analyze their properties, and evaluate them experimentally. We illustrate and explain why accuracy may not be monotonic in the privacy budget.