A Comparison of Strategies for Generating Synthetic Data for Complex Survey

Chen, Min

A Comparison of Strategies for Generating Synthetic Data for Complex Survey

View / Download449.66 KB

Date

2024

Authors

Chen, Min

Advisors

Reiter, Jerome

Repository Usage Stats

16
views

62
downloads

Abstract

Synthetic data is a type of method for protecting data privacy. In the context of disseminating confidential data for public utilization, some statistical agencies employ the generation of fully synthetic datasets. This practice is applied to census and administrative records. It is important to note that many research datasets come from surveys with complex sampling methods, which is not ignorable when constructing synthetic data. The thesis presents an illustration for the comparison of three different synthetic data strategies. Each of them has different procedures to generate the synthetic data. Two of them are based on the bootstrap methods, one is Bayesian bootstrap, and the other is regular bootstrap. The third method is based on the posterior inference with pseudo-likelihood. Using simulation studies with probability proportional to size sampling, we show that all three methods can result in accurate estimates of the mean of a finite population. However, when estimating the sampling statistic's variance, only the method based on the Bayesian bootstrap method can provide an approximately unbiased estimate in these simulations.

Type

Master's thesis

Department

Statistical Science

Subjects

Statistics

Permalink

https://hdl.handle.net/10161/31056

Rights

https://creativecommons.org/licenses/by-nc-nd/4.0/

Citation

Chen, Min (2024). A Comparison of Strategies for Generating Synthetic Data for Complex Survey. Master's thesis, Duke University. Retrieved from https://hdl.handle.net/10161/31056.

Collections

Masters Theses

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

A Comparison of Strategies for Generating Synthetic Data for Complex Survey

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Rights

Citation

Collections