Browsing by Subject "Missing data"
- Results Per Page
- Sort Options
Item Open Access A Comparison Of Multiple Imputation Methods For Categorical Data(2015) Akande, Olanrewaju MichaelThis thesis evaluates the performance of several multiple imputation methods for categorical data, including multiple imputation by chained equations using generalized linear models, multiple imputation by chained equations using classification and regression trees and non-parametric Bayesian multiple imputation for categorical data (using the Dirichlet process mixture of products of multinomial distributions model). The performance of each method is evaluated with repeated sampling studies using housing unit data from the American Community Survey 2012. These data afford exploration of practical problems such as multicollinearity and large dimensions. This thesis highlights some advantages and limitations of each method compared to others. Finally, it provides suggestions on which method should be preferred, and conditions under which the suggestions hold.
Item Open Access An optimal Wilcoxon-Mann-Whitney test of mortality and a continuous outcome.(Stat Methods Med Res, 2016-01-01) Matsouaka, Roland A; Singhal, Aneesh B; Betensky, Rebecca AWe consider a two-group randomized clinical trial, where mortality affects the assessment of a follow-up continuous outcome. Using the worst-rank composite endpoint, we develop a weighted Wilcoxon-Mann-Whitney test statistic to analyze the data. We determine the optimal weights for the Wilcoxon-Mann-Whitney test statistic that maximize its power. We derive a formula for its power and demonstrate its accuracy in simulations. Finally, we apply the method to data from an acute ischemic stroke clinical trial of normobaric oxygen therapy.Item Open Access Bayesian Models for Combining Information from Multiple Sources(2022) Tang, JiuruiThis dissertation develops Bayesian methods for combining information from multiple sources. I focus on developing Bayesian bipartite modeling for simultaneous regression and record linkage, as well as leveraging auxiliary information on marginal distributions for handling item and unit nonresponse and accounting for survey weights.
The first contribution is a Bayesian hierarchical model that allows analysts to perform simultaneous linear regression and probabilistic record linkage. This model allows analysts to leverage relationships among the variables to improve linkage quality. It also potentially offers more accurate estimates of regression parameters compared to approaches that use a two-step process, i.e., link the records first, then estimate the linear regression on the linked data. I propose and evaluate three Markov chain Monte Carlo algorithms for implementing the Bayesian model.
The second contribution is examining the performance of an approach for generating multiple imputation data sets for item nonresponse. The method allows analysts to use auxliary information. I examine the approach via simulation studies with Poisson sampling. I also give suggestions on parameter tuning.
The third contribution is a model-based imputation approach that can handle both item and unit nonresponse while accounting for auxiliary margins and survey weights. This approach includes an innovative combination of a pattern mixture model for unit nonresponse and a selection model for item nonresponse. Both unit and item nonresponse can be nonignorable. I demonstrate the model performance with simulation studies under the situations when the design weights for unit respondents are known and when they are not. I show that the model can generate multiple imputation data sets that both retain the relationship among survey variables and yield design-based estimates that agree with auxiliary margins. I use the model to analyze voter turnout overall and across subgroups in North Carolina, with data from the 2018 Current Population Survey.
Item Open Access Bayesian Models for Imputing Missing Data and Editing Erroneous Responses in Surveys(2019) Akande, Olanrewaju MichaelThis thesis develops Bayesian methods for handling unit nonresponse, item nonresponse, and erroneous responses in large scale surveys and censuses containing categorical data. I focus on applications to nested household data where individuals are nested within households and certain combinations of the variables are not allowed, such as the U.S. Decennial Census, as well as surveys subject to both unit and item nonresponse, such as the Current Population Survey.
The first contribution is a Bayesian model for imputing plausible values for item nonresponse in data nested within households, in the presence of impossible combinations. The imputation is done using a nested data Dirichlet process mixture of products of multinomial distributions model, truncated so that impossible household configurations have zero probability in the model. I show how to generate imputations from the Markov Chain Monte Carlo sampler, and describe strategies for improving the computational efficiency of the model estimation. I illustrate the performance of the approach with data that mimic the variables collected in the U.S. Decennial Census. The results indicate that my approach can generate high quality imputations in such nested data.
The second contribution extends the imputation engine in the first contribution to allow for the editing and imputation of household data containing faulty values. The approach relies on a Bayesian hierarchical model that uses the nested data Dirichlet process mixture of products of multinomial distributions as a model for the true unobserved data, but also includes a model for the location of errors, and a reporting model for the observed responses in error. I illustrate the performance of the edit and imputation engine using data from the 2012 American Community Survey. I show that my approach can simultaneously estimate multivariate relationships in the data accurately, adjust for measurement errors, and respect impossible combinations in estimation and imputation.
The third contribution is a framework for using auxiliary information to specify nonignorable models that can handle both item and unit nonresponse simultaneously. My approach focuses on how to leverage auxiliary information from external data sources in nonresponse adjustments. This method is developed for specifying imputation models so that users can posit distinct specifications of missingness mechanisms for different blocks of variables, for example, a nonignorable model for variables with auxiliary marginal information and an ignorable model for the variables exclusive to the survey.
I illustrate the framework using data on voter turnout in the Current Population Survey.
The final contribution extends the framework in the third contribution to complex surveys, specifically, handling nonresponse in complex surveys, such that we can still leverage auxiliary data while respecting the survey design through survey weights. Using several simulations, I illustrate the performance of my approach when the sample is generated primarily through stratified sampling.
Item Open Access Comparison of regression imputation methods of baseline covariates that predict survival outcomes.(Journal of clinical and translational science, 2020-09) Solomon, Nicole; Lokhnygina, Yuliya; Halabi, SusanIntroduction
Missing data are inevitable in medical research and appropriate handling of missing data is critical for statistical estimation and making inferences. Imputation is often employed in order to maximize the amount of data available for statistical analysis and is preferred over the typically biased output of complete case analysis. This article examines several types of regression imputation of missing covariates in the prediction of time-to-event outcomes subject to right censoring.Methods
We evaluated the performance of five regression methods in the imputation of missing covariates for the proportional hazards model via summary statistics, including proportional bias and proportional mean squared error. The primary objective was to determine which among the parametric generalized linear models (GLMs) and least absolute shrinkage and selection operator (LASSO), and nonparametric multivariate adaptive regression splines (MARS), support vector machine (SVM), and random forest (RF), provides the "best" imputation model for baseline missing covariates in predicting a survival outcome.Results
LASSO on an average observed the smallest bias, mean square error, mean square prediction error, and median absolute deviation (MAD) of the final analysis model's parameters among all five methods considered. SVM performed the second best while GLM and MARS exhibited the lowest relative performances.Conclusion
LASSO and SVM outperform GLM, MARS, and RF in the context of regression imputation for prediction of a time-to-event outcome.Item Open Access Modeling Missing Data In Panel Studies With Multiple Refreshment Samples(2012) Deng, YitingMost panel surveys are subject to missing data problems caused by panel attrition. The Additive Non-ignorable (AN) model proposed by Hirano et al. (2001) utilizes refreshment samples in panel surveys to impute missing data, and offers flexibility in modeling the missing data mechanism to incorporate both ignorable and non-ignorable models. We extend the AN model to settings with three waves and two refreshment samples. We address identication and estimation issues related to the proposed model under four different types of survey design, featured by whether the missingness is monotone and whether subjects in the refreshment samples are followed up in subsequent waves of the survey. We apply this approach and multiple imputation techniques to the 2007-2008 Associated Press-Yahoo! News Poll (APYN) panel dataset to analyze factors affecting people's political interest. We find that, when attrition bias is not accounted for, the carry-on effects of past political interest on current political interest are underestimated. This highlights the importance of dealing with attrition bias and the potential of refreshment samples for doing so.
Item Open Access Multiple Imputation Inferences for Count Data(2021) Liu, BoMultiple imputation is frequently used for inference with missing data. In cases when the population quantity of interest is desired to be an integer, the original methods for inference need to be modified, as the point estimates based on the average are generally not integers.In this thesis, I propose a modification to the original combining rules, which provides the point estimate as the median of quantities from imputed datasets. Thus, the point estimate of the population quantity of interest is integer-valued when the number of imputed datasets is odd. I derive an estimator of the variance of this modified estimator, as well as a method for obtaining confidence intervals. I compare this method to other ad-hoc methods, such as rounding the original point estimate. Simulations show that these two methods provide similar results, although the novel method has slightly larger mean absolute error. The coverage rate of both methods are close to the nominal coverage of 95%. The correct derivation of variance is important, and simulations show that if one uses the median as point estimate without correcting the variance, the coverage rate is systematically lower.
Item Open Access Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic Geographies(2014) Paiva, Thais VianaThis thesis presents methods for multiple imputation that can be applied to missing data and data with confidential variables. Imputation is useful for missing data because it results in a data set that can be analyzed with complete data statistical methods. The missing data are filled in by values generated from a model fit to the observed data. The model specification will depend on the observed data pattern and the missing data mechanism. For example, when the reason why the data is missing is related to the outcome of interest, that is nonignorable missingness, we need to alter the model fit to the observed data to generate the imputed values from a different distribution. Imputation is also used for generating synthetic values for data sets with disclosure restrictions. Since the synthetic values are not actual observations, they can be released for statistical analysis. The interest is in fitting a model that approximates well the relationships in the original data, keeping the utility of the synthetic data, while preserving the confidentiality of the original data. We consider applications of these methods to data from social sciences and epidemiology.
The first method is for imputation of multivariate continuous data with nonignorable missingness. Regular imputation methods have been used to deal with nonresponse in several types of survey data. However, in some of these studies, the assumption of missing at random is not valid since the probability of missing depends on the response variable. We propose an imputation method for multivariate data sets when there is nonignorable missingness. We fit a truncated Dirichlet process mixture of multivariate normals to the observed data under a Bayesian framework to provide flexibility. With the posterior samples from the mixture model, an analyst can alter the estimated distribution to obtain imputed data under different scenarios. To facilitate that, I developed an R application that allows the user to alter the values of the mixture parameters and visualize the imputation results automatically. I demonstrate this process of sensitivity analysis with an application to the Colombian Annual Manufacturing Survey. I also include a simulation study to show that the correct complete data distribution can be recovered if the true missing data mechanism is known, thus validating that the method can be meaningfully interpreted to do sensitivity analysis.
The second method uses the imputation techniques for nonignorable missingness to implement a procedure for adaptive design in surveys. Specifically, I develop a procedure that agencies can use to evaluate whether or not it is effective to stop data collection. This decision is based on utility measures to compare the data collected so far with potential follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios considered by the analyst. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufactures.
The third method is for imputation of confidential data sets with spatial locations using disease mapping models. We consider data that include fine geographic information, such as census tract or street block identifiers. This type of data can be difficult to release as public use files, since fine geography provides information that ill-intentioned data users can use to identify individuals. We propose to release data with simulated geographies, so as to enable spatial analyses while reducing disclosure risks. We fit disease mapping models that predict areal-level counts from attributes in the file, and sample new locations based on the estimated models. I illustrate this approach using data on causes of death in North Carolina, including evaluations of the disclosure risks and analytic validity that can result from releasing synthetic geographies.
Item Open Access Stochastic Latent Domain Approaches to the Recovery and Prediction of High Dimensional Missing Data(2023) Cannella, Christopher BrianThis work presents novel techniques for approaching missing data using generative models. The main focus of these techniques is on leveraging the latent spaces of generative models, both to improve inference performance and to overcome many of the architectural challenges missing data poses for current generative models. This work includes methodologies that are broadly applicable regardless of model architecture and model specific techniques.
The first half of this work is dedicated to model agnostic techniques. Here, we present our Linearized-Marginal Restricted Boltzmann Machine (LM-RBM), a method for directly approximating the conditional and marginal distributions of RBMs used to infer missing data. We also present our Semi-Empirical Ab Initio objective functions for Markov Chain Monte Carlo (MCMC) proposal optimization, which are objective functions of a restricted functional class that are fit to recover analytically known optimal proposals. These Semi-Empirical Ab Initio objective functions are shown to avoid failures exhibited by current objective functions for MCMC propsal optimization with highly expressive neural proposals and enable the more confident optimization of deep generative architectures for MCMC techniques.
The second half of this work is dedicated to techniques applicable to specific generative architectures. We present Projected-Latent Markov Chain Monte Carlo (PL-MCMC), a technique for performing asymptotically exact conditional inference of missing data using normalizing flows. We evaluate the performance of PL-MCMC based on its applicability to tasks of training from and inferring missing data. We also present our Perceiver Attentional Copula for Time Series (PrACTiS), which utilizes attention with learned latent vectors to significantly improve the computational efficiency of attention based modeling in light of the additional challenges that time series data pose with respect to missing data inference.