A Comparison of Strategies for Generating Synthetic Data for Complex Survey
Date
2024
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
Synthetic data is a type of method for protecting data privacy. In the context of disseminating confidential data for public utilization, some statistical agencies employ the generation of fully synthetic datasets. This practice is applied to census and administrative records. It is important to note that many research datasets come from surveys with complex sampling methods, which is not ignorable when constructing synthetic data. The thesis presents an illustration for the comparison of three different synthetic data strategies. Each of them has different procedures to generate the synthetic data. Two of them are based on the bootstrap methods, one is Bayesian bootstrap, and the other is regular bootstrap. The third method is based on the posterior inference with pseudo-likelihood. Using simulation studies with probability proportional to size sampling, we show that all three methods can result in accurate estimates of the mean of a finite population. However, when estimating the sampling statistic's variance, only the method based on the Bayesian bootstrap method can provide an approximately unbiased estimate in these simulations.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Chen, Min (2024). A Comparison of Strategies for Generating Synthetic Data for Complex Survey. Master's thesis, Duke University. Retrieved from https://hdl.handle.net/10161/31056.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.