Browsing by Subject "Multiple imputation"
- Results Per Page
- Sort Options
Item Open Access A Comparison Of Multiple Imputation Methods For Categorical Data(2015) Akande, Olanrewaju MichaelThis thesis evaluates the performance of several multiple imputation methods for categorical data, including multiple imputation by chained equations using generalized linear models, multiple imputation by chained equations using classification and regression trees and non-parametric Bayesian multiple imputation for categorical data (using the Dirichlet process mixture of products of multinomial distributions model). The performance of each method is evaluated with repeated sampling studies using housing unit data from the American Community Survey 2012. These data afford exploration of practical problems such as multicollinearity and large dimensions. This thesis highlights some advantages and limitations of each method compared to others. Finally, it provides suggestions on which method should be preferred, and conditions under which the suggestions hold.
Item Open Access Bayesian Models for Combining Information from Multiple Sources(2022) Tang, JiuruiThis dissertation develops Bayesian methods for combining information from multiple sources. I focus on developing Bayesian bipartite modeling for simultaneous regression and record linkage, as well as leveraging auxiliary information on marginal distributions for handling item and unit nonresponse and accounting for survey weights.
The first contribution is a Bayesian hierarchical model that allows analysts to perform simultaneous linear regression and probabilistic record linkage. This model allows analysts to leverage relationships among the variables to improve linkage quality. It also potentially offers more accurate estimates of regression parameters compared to approaches that use a two-step process, i.e., link the records first, then estimate the linear regression on the linked data. I propose and evaluate three Markov chain Monte Carlo algorithms for implementing the Bayesian model.
The second contribution is examining the performance of an approach for generating multiple imputation data sets for item nonresponse. The method allows analysts to use auxliary information. I examine the approach via simulation studies with Poisson sampling. I also give suggestions on parameter tuning.
The third contribution is a model-based imputation approach that can handle both item and unit nonresponse while accounting for auxiliary margins and survey weights. This approach includes an innovative combination of a pattern mixture model for unit nonresponse and a selection model for item nonresponse. Both unit and item nonresponse can be nonignorable. I demonstrate the model performance with simulation studies under the situations when the design weights for unit respondents are known and when they are not. I show that the model can generate multiple imputation data sets that both retain the relationship among survey variables and yield design-based estimates that agree with auxiliary margins. I use the model to analyze voter turnout overall and across subgroups in North Carolina, with data from the 2018 Current Population Survey.
Item Open Access Bayesian Models for Imputing Missing Data and Editing Erroneous Responses in Surveys(2019) Akande, Olanrewaju MichaelThis thesis develops Bayesian methods for handling unit nonresponse, item nonresponse, and erroneous responses in large scale surveys and censuses containing categorical data. I focus on applications to nested household data where individuals are nested within households and certain combinations of the variables are not allowed, such as the U.S. Decennial Census, as well as surveys subject to both unit and item nonresponse, such as the Current Population Survey.
The first contribution is a Bayesian model for imputing plausible values for item nonresponse in data nested within households, in the presence of impossible combinations. The imputation is done using a nested data Dirichlet process mixture of products of multinomial distributions model, truncated so that impossible household configurations have zero probability in the model. I show how to generate imputations from the Markov Chain Monte Carlo sampler, and describe strategies for improving the computational efficiency of the model estimation. I illustrate the performance of the approach with data that mimic the variables collected in the U.S. Decennial Census. The results indicate that my approach can generate high quality imputations in such nested data.
The second contribution extends the imputation engine in the first contribution to allow for the editing and imputation of household data containing faulty values. The approach relies on a Bayesian hierarchical model that uses the nested data Dirichlet process mixture of products of multinomial distributions as a model for the true unobserved data, but also includes a model for the location of errors, and a reporting model for the observed responses in error. I illustrate the performance of the edit and imputation engine using data from the 2012 American Community Survey. I show that my approach can simultaneously estimate multivariate relationships in the data accurately, adjust for measurement errors, and respect impossible combinations in estimation and imputation.
The third contribution is a framework for using auxiliary information to specify nonignorable models that can handle both item and unit nonresponse simultaneously. My approach focuses on how to leverage auxiliary information from external data sources in nonresponse adjustments. This method is developed for specifying imputation models so that users can posit distinct specifications of missingness mechanisms for different blocks of variables, for example, a nonignorable model for variables with auxiliary marginal information and an ignorable model for the variables exclusive to the survey.
I illustrate the framework using data on voter turnout in the Current Population Survey.
The final contribution extends the framework in the third contribution to complex surveys, specifically, handling nonresponse in complex surveys, such that we can still leverage auxiliary data while respecting the survey design through survey weights. Using several simulations, I illustrate the performance of my approach when the sample is generated primarily through stratified sampling.
Item Open Access Creating linked datasets for SME energy-assessment evidence-building: Results from the U.S. Industrial Assessment Center Program(Energy Policy, 2017-12-01) Dalzell, NM; Boyd, GA; Reiter, JP© 2017 Elsevier Ltd Lack of information is commonly cited as a market failure resulting in an energy-efficiency gap. Government information policies to fill this gap may enable improvements in energy efficiency and social welfare because of the externalities of energy use. The U.S. Department of Energy Industrial Assessment Center (IAC) program is one such policy intervention, providing no-cost assessments to small and medium enterprises (SME). The IAC program has assembled a wealth of data on these assessments, but the database does not include information about participants after the assessment or on non-participants. This study addresses that lack by creating a new linked dataset using the public IAC and non-public data at the Census Bureau. The IAC database excludes detail needed for an exact match, so the study developed a linking methodology to account for uncertainty in the matching process. Based on the linking approach, a difference in difference analysis for SME that received an assessment was done; plants that received an assessment improve their performance over time, relative to industry peers that did not. This new linked dataset is likely to shed even more light on the impact of the IAC and similar programs in advancing energy efficiency.Item Open Access Modeling Missing Data In Panel Studies With Multiple Refreshment Samples(2012) Deng, YitingMost panel surveys are subject to missing data problems caused by panel attrition. The Additive Non-ignorable (AN) model proposed by Hirano et al. (2001) utilizes refreshment samples in panel surveys to impute missing data, and offers flexibility in modeling the missing data mechanism to incorporate both ignorable and non-ignorable models. We extend the AN model to settings with three waves and two refreshment samples. We address identication and estimation issues related to the proposed model under four different types of survey design, featured by whether the missingness is monotone and whether subjects in the refreshment samples are followed up in subsequent waves of the survey. We apply this approach and multiple imputation techniques to the 2007-2008 Associated Press-Yahoo! News Poll (APYN) panel dataset to analyze factors affecting people's political interest. We find that, when attrition bias is not accounted for, the carry-on effects of past political interest on current political interest are underestimated. This highlights the importance of dealing with attrition bias and the potential of refreshment samples for doing so.
Item Open Access Multiple Imputation Inferences for Count Data(2021) Liu, BoMultiple imputation is frequently used for inference with missing data. In cases when the population quantity of interest is desired to be an integer, the original methods for inference need to be modified, as the point estimates based on the average are generally not integers.In this thesis, I propose a modification to the original combining rules, which provides the point estimate as the median of quantities from imputed datasets. Thus, the point estimate of the population quantity of interest is integer-valued when the number of imputed datasets is odd. I derive an estimator of the variance of this modified estimator, as well as a method for obtaining confidence intervals. I compare this method to other ad-hoc methods, such as rounding the original point estimate. Simulations show that these two methods provide similar results, although the novel method has slightly larger mean absolute error. The coverage rate of both methods are close to the nominal coverage of 95%. The correct derivation of variance is important, and simulations show that if one uses the median as point estimate without correcting the variance, the coverage rate is systematically lower.
Item Open Access Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic Geographies(2014) Paiva, Thais VianaThis thesis presents methods for multiple imputation that can be applied to missing data and data with confidential variables. Imputation is useful for missing data because it results in a data set that can be analyzed with complete data statistical methods. The missing data are filled in by values generated from a model fit to the observed data. The model specification will depend on the observed data pattern and the missing data mechanism. For example, when the reason why the data is missing is related to the outcome of interest, that is nonignorable missingness, we need to alter the model fit to the observed data to generate the imputed values from a different distribution. Imputation is also used for generating synthetic values for data sets with disclosure restrictions. Since the synthetic values are not actual observations, they can be released for statistical analysis. The interest is in fitting a model that approximates well the relationships in the original data, keeping the utility of the synthetic data, while preserving the confidentiality of the original data. We consider applications of these methods to data from social sciences and epidemiology.
The first method is for imputation of multivariate continuous data with nonignorable missingness. Regular imputation methods have been used to deal with nonresponse in several types of survey data. However, in some of these studies, the assumption of missing at random is not valid since the probability of missing depends on the response variable. We propose an imputation method for multivariate data sets when there is nonignorable missingness. We fit a truncated Dirichlet process mixture of multivariate normals to the observed data under a Bayesian framework to provide flexibility. With the posterior samples from the mixture model, an analyst can alter the estimated distribution to obtain imputed data under different scenarios. To facilitate that, I developed an R application that allows the user to alter the values of the mixture parameters and visualize the imputation results automatically. I demonstrate this process of sensitivity analysis with an application to the Colombian Annual Manufacturing Survey. I also include a simulation study to show that the correct complete data distribution can be recovered if the true missing data mechanism is known, thus validating that the method can be meaningfully interpreted to do sensitivity analysis.
The second method uses the imputation techniques for nonignorable missingness to implement a procedure for adaptive design in surveys. Specifically, I develop a procedure that agencies can use to evaluate whether or not it is effective to stop data collection. This decision is based on utility measures to compare the data collected so far with potential follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios considered by the analyst. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufactures.
The third method is for imputation of confidential data sets with spatial locations using disease mapping models. We consider data that include fine geographic information, such as census tract or street block identifiers. This type of data can be difficult to release as public use files, since fine geography provides information that ill-intentioned data users can use to identify individuals. We propose to release data with simulated geographies, so as to enable spatial analyses while reducing disclosure risks. We fit disease mapping models that predict areal-level counts from attributes in the file, and sample new locations based on the estimated models. I illustrate this approach using data on causes of death in North Carolina, including evaluations of the disclosure risks and analytic validity that can result from releasing synthetic geographies.
Item Open Access Multiple Imputation on Missing Values in Time Series Data(2015) Oh, SohaeFinancial stock market data, for various reasons, frequently contain missing values. One reason for this is that, because the markets close for holidays, daily stock prices are not always observed. This creates gaps in information, making it difficult to predict the following day’s stock prices. In this situation, information during the holiday can be “borrowed” from other countries’ stock market, since global stock prices tend to show similar movements and are in fact highly correlated. The main goal of this study is to combine stock index data from various markets around the world and develop an algorithm to impute the missing values in individual stock index using “information-sharing” between different time series. To develop imputation algorithm that accommodate time series-specific features, we take multiple imputation approach using dynamic linear model for time-series and panel data. This algorithm assumes ignorable missing data mechanism, as which missingness due to holiday. The posterior distributions of parameters, including missing values, is simulated using Monte Carlo Markov Chain (MCMC) methods and estimates from sets of draws are then combined using Rubin’s combination rule, rendering final inference of the data set. Specifically, we use the Gibbs sampler and Forward Filtering and Backward Sampling (FFBS) to simulate joint posterior distribution and posterior predictive distribution of latent variables and other parameters. A simulation study is conducted to check the validity and the performance of the algorithm using two error-based measurements: Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). We compared the overall trend of imputed time series with complete data set, and inspected the in-sample predictability of the algorithm using Last Value Carried Forward (LVCF) method as a bench mark. The algorithm is applied to real stock price index data from US, Japan, Hong Kong, UK and Germany. From both of the simulation and the application, we concluded that the imputation algorithm performs well enough to achieve our original goal, predicting the stock price for the opening price after a holiday, outperforming the benchmark method. We believe this multiple imputation algorithm can be used in many applications that deal with time series with missing values such as financial and economic data and biomedical data.
Item Open Access Nonparametric Bayesian Methods for Multiple Imputation of Large Scale Incomplete Categorical Data in Panel Studies(2012) Si, YajuanThe thesis develops nonparametric Bayesian models to handle incomplete categorical variables in data sets with high dimension using the framework of multiple imputation. It presents methods for ignorable missing data in cross-sectional studies, and potentially non-ignorable missing data in panel studies with refreshment samples.
The first contribution is a fully Bayesian, joint modeling approach of multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient.
I illustrate repeated sampling properties of the approach
using simulated data. This approach offers better performance than default chained equations methods, which are often used in such settings. I apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.
For the second contribution, I extend the nonparametric Bayesian imputation engine to consider a mix of potentially non-ignorable attrition and ignorable item nonresponse in multiple wave panel studies. Ignoring the attrition in models for panel data can result in biased inference if the reason for attrition is systematic and related to the missing values. Panel data alone cannot estimate the attrition effect without untestable assumptions about the missing data mechanism. Refreshment samples offer an extra data source that can be utilized to estimate the attrition effect while reducing reliance on strong assumptions of the missing data mechanism.
I consider two novel Bayesian approaches to handle the attrition and item non-response simultaneously under multiple imputation in a two wave panel with one refreshment sample when the variables involved are categorical and high dimensional.
First, I present a semi-parametric selection model that includes an additive non-ignorable attrition model with main effects of all variables, including demographic variables and outcome measures in wave 1 and wave 2. The survey variables are modeled jointly using Bayesian mixture of multinomial distributions. I develop the posterior computation algorithms for the semi-parametric selection model under different prior choices for the regression coefficients in the attrition model.
Second, I propose two Bayesian pattern mixture models for this scenario that use latent classes to model the dependency among the variables and the attrition. I develop a dependent Bayesian latent pattern mixture model for which variables are modeled via latent classes and attrition is treated as a covariate in the class allocation weights. And, I develop a joint Bayesian latent pattern mixture model, for which attrition and variables are modeled jointly via latent classes.
I show via simulation studies that the pattern mixture models can recover true parameter estimates, even when inferences based on the panel alone are biased from attrition.
I apply both the selection and pattern mixture models to data from the 2007-2008 Associated Press/Yahoo News election panel study.
Item Open Access Some Recent Advances in Non- and Semiparametric Bayesian Modeling with Copulas, Mixtures, and Latent Variables(2013) Murray, JaredThis thesis develops flexible non- and semiparametric Bayesian models for mixed continuous, ordered and unordered categorical data. These methods have a range of possible applications; the applications considered in this thesis are drawn primarily from the social sciences, where multivariate, heterogeneous datasets with complex dependence and missing observations are the norm.
The first contribution is an extension of the Gaussian factor model to Gaussian copula factor models, which accommodate continuous and ordinal data with unspecified marginal distributions. I describe how this model is the most natural extension of the Gaussian factor model, preserving its essential dependence structure and the interpretability of factor loadings and the latent variables. I adopt an approximate likelihood for posterior inference and prove that, if the Gaussian copula model is true, the approximate posterior distribution of the copula correlation matrix asymptotically converges to the correct parameter under nearly any marginal distributions. I demonstrate with simulations that this method is both robust and efficient, and illustrate its use in an application from political science.
The second contribution is a novel nonparametric hierarchical mixture model for continuous, ordered and unordered categorical data. The model includes a hierarchical prior used to couple component indices of two separate models, which are also linked by local multivariate regressions. This structure effectively overcomes the limitations of existing mixture models for mixed data, namely the overly strong local independence assumptions. In the proposed model local independence is replaced by local conditional independence, so that the induced model is able to more readily adapt to structure in the data. I demonstrate the utility of this model as a default engine for multiple imputation of mixed data in a large repeated-sampling study using data from the Survey of Income and Participation. I show that it improves substantially on its most popular competitor, multiple imputation by chained equations (MICE), while enjoying certain theoretical properties that MICE lacks.
The third contribution is a latent variable model for density regression. Most existing density regression models are quite flexible but somewhat cumbersome to specify and fit, particularly when the regressors are a combination of continuous and categorical variables. The majority of these methods rely on extensions of infinite discrete mixture models to incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally different approach, introducing a continuous latent variable which depends on covariates through a parametric regression. In turn, the observed response depends on the latent variable through an unknown function. I demonstrate that a spline prior for the unknown function is quite effective relative to Dirichlet Process mixture models in density estimation settings (i.e., without covariates) even though these Dirichlet process mixtures have better theoretical properties asymptotically. The spline formulation enjoys a number of computational advantages over more flexible priors on functions. Finally, I demonstrate the utility of this model in regression applications using a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling as a smooth function of the quantile index.