Methods for Imputing Missing Values and Synthesizing Confidential Values for Continuous and Magnitude Data

Wei, Lan

Methods for Imputing Missing Values and Synthesizing Confidential Values for Continuous and Magnitude Data

View / Download10.62 MB

Date

2016

Authors

Wei, Lan

Advisors

Reiter, Jerome P

Repository Usage Stats

376
views

817
downloads

Abstract

Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.

The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.

The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.

The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.

Type

Dissertation

Department

Statistical Science

Subjects

Statistics

Permalink

https://hdl.handle.net/10161/12897

Citation

Wei, Lan (2016). Methods for Imputing Missing Values and Synthesizing Confidential Values for Continuous and Magnitude Data. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/12897.

Collections

Dissertations

Full item page

Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.

Methods for Imputing Missing Values and Synthesizing Confidential Values for Continuous and Magnitude Data

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections