Nonparametric Bayesian Methods for Multiple Imputation of Large Scale Incomplete Categorical Data in Panel Studies

Thumbnail Image




Si, Yajuan


Reiter, Jerome P

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



The thesis develops nonparametric Bayesian models to handle incomplete categorical variables in data sets with high dimension using the framework of multiple imputation. It presents methods for ignorable missing data in cross-sectional studies, and potentially non-ignorable missing data in panel studies with refreshment samples.

The first contribution is a fully Bayesian, joint modeling approach of multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient.

I illustrate repeated sampling properties of the approach

using simulated data. This approach offers better performance than default chained equations methods, which are often used in such settings. I apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.

For the second contribution, I extend the nonparametric Bayesian imputation engine to consider a mix of potentially non-ignorable attrition and ignorable item nonresponse in multiple wave panel studies. Ignoring the attrition in models for panel data can result in biased inference if the reason for attrition is systematic and related to the missing values. Panel data alone cannot estimate the attrition effect without untestable assumptions about the missing data mechanism. Refreshment samples offer an extra data source that can be utilized to estimate the attrition effect while reducing reliance on strong assumptions of the missing data mechanism.

I consider two novel Bayesian approaches to handle the attrition and item non-response simultaneously under multiple imputation in a two wave panel with one refreshment sample when the variables involved are categorical and high dimensional.

First, I present a semi-parametric selection model that includes an additive non-ignorable attrition model with main effects of all variables, including demographic variables and outcome measures in wave 1 and wave 2. The survey variables are modeled jointly using Bayesian mixture of multinomial distributions. I develop the posterior computation algorithms for the semi-parametric selection model under different prior choices for the regression coefficients in the attrition model.

Second, I propose two Bayesian pattern mixture models for this scenario that use latent classes to model the dependency among the variables and the attrition. I develop a dependent Bayesian latent pattern mixture model for which variables are modeled via latent classes and attrition is treated as a covariate in the class allocation weights. And, I develop a joint Bayesian latent pattern mixture model, for which attrition and variables are modeled jointly via latent classes.

I show via simulation studies that the pattern mixture models can recover true parameter estimates, even when inferences based on the panel alone are biased from attrition.

I apply both the selection and pattern mixture models to data from the 2007-2008 Associated Press/Yahoo News election panel study.





Si, Yajuan (2012). Nonparametric Bayesian Methods for Multiple Imputation of Large Scale Incomplete Categorical Data in Panel Studies. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.