Reiter, Jerome PLin, Tong2023-06-082023-06-082023https://hdl.handle.net/10161/27803<p>Survey sampling is a popular technique used in various fields for making inferences about populations from samples. However, the release of survey data can lead to confidentiality concerns due to the presence of sensitive information about individuals. To mitigate this issue, data stewards generate synthetic data that reflects the statistical features of confidential data to obscure sensitive variables. Synthetic data can be released for public use as a substitute of confidential data. However, the quality of synthetic data may impact the accuracy of inferences drawn from it. Therefore, assessing the quality of inferences derived from synthetic data is essential. Researchers have proposed a verification procedure that allows analysts to submit queries regarding their inferences and evaluate their accuracy by comparing results from synthetic data with those from confidential data. This approach enables the protection of individual privacy while facilitating the public use of confidential data.</p><p>This thesis proposes a differentially private verification measure for synthetic data in the context of complex survey designs. To ensure differential privacy, we use the sub-sample and aggregate method. We partition the confidential data into disjoint partitions and compute survey-weighted estimates of the statistics of interest. Analysts can set a tolerance interval reflecting their desired level of estimate accuracy from synthetic data. Since smaller partitions have higher variance in estimates, we suggest to use a wider tolerance interval for partitions. We refer to a tolerance interval that does not account for such higher variance as a fixed tolerance interval, while a tolerance interval with inflation as a varying one. We define an indicator to signify whether estimates from the partitions fall within the tolerance interval, and compute the sum of indicators from all partitions. To satisfy differential privacy, we add a noise from the Laplace Mechanism to this metric. Bayesian post-processing is then applied to improve interpretability, and the summary statistics of the posterior distribution of the metric is released.</p><p>The proposed measure generalized the application of privacy-preserving techniques and enables analysts to validate the quality of their inferences based on synthetic data in the context of complex survey data sets. </p>StatisticsDifferentially Private Verification with Survey WeightsMaster's thesis