An Empirical Comparison of Multiple Imputation Methods for Categorical Data

dc.contributor.author

Akande, O

dc.contributor.author

Li, F

dc.contributor.author

Reiter, J

dc.date.accessioned

2018-09-22T16:16:54Z

dc.date.available

2018-09-22T16:16:54Z

dc.date.issued

2017-04-03

dc.date.updated

2018-09-22T16:16:52Z

dc.description.abstract

© 2017 American Statistical Association. Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. Supplementary material for this article is available online.

dc.identifier.issn

0003-1305

dc.identifier.issn

1537-2731

dc.identifier.uri

https://hdl.handle.net/10161/17536

dc.language

English

dc.publisher

Informa UK Limited

dc.relation.ispartof

The American Statistician

dc.relation.isversionof

10.1080/00031305.2016.1277158

dc.subject

Science & Technology

dc.subject

Physical Sciences

dc.subject

Statistics & Probability

dc.subject

Mathematics

dc.subject

Latent

dc.subject

Missing

dc.subject

Mixture

dc.subject

Nonresponse

dc.subject

Tree

dc.subject

FULLY CONDITIONAL SPECIFICATION

dc.subject

MULTIVARIATE IMPUTATION

dc.subject

CHAINED EQUATIONS

dc.subject

MISSING DATA

dc.subject

IMPLEMENTATION

dc.title

An Empirical Comparison of Multiple Imputation Methods for Categorical Data

dc.type

Journal article

duke.contributor.orcid

Li, F|0000-0002-0390-3673

duke.contributor.orcid

Reiter, J|0000-0002-8374-3832

pubs.begin-page

162

pubs.end-page

170

pubs.issue

2

pubs.organisational-group

Trinity College of Arts & Sciences

pubs.organisational-group

Duke

pubs.organisational-group

Statistical Science

pubs.organisational-group

Biostatistics & Bioinformatics

pubs.organisational-group

Basic Science Departments

pubs.organisational-group

School of Medicine

pubs.organisational-group

Duke Population Research Institute

pubs.organisational-group

Sanford School of Public Policy

pubs.organisational-group

Center for Child and Family Policy

pubs.organisational-group

Duke Population Research Center

pubs.organisational-group

Student

pubs.publication-status

Published

pubs.volume

71

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
An Empirical Comparison of Multiple Imputation Methods for Categorical Data_PrePrint.pdf
Size:
343.21 KB
Format:
Adobe Portable Document Format
Description:
Accepted version