Browsing by Author "Reiter, Jerome P"
- Results Per Page
- Sort Options
Item Open Access 2014 Data Expeditions Call for Proposals(2015-11-30) Reiter, Jerome P; Calderbank, RobertItem Open Access 2015 Call for Proposals(2015) Bendich, Paul L; Calderbank, Robert; Reiter, Jerome PItem Open Access A Comparison Of Multiple Imputation Methods For Categorical Data(2015) Akande, Olanrewaju MichaelThis thesis evaluates the performance of several multiple imputation methods for categorical data, including multiple imputation by chained equations using generalized linear models, multiple imputation by chained equations using classification and regression trees and non-parametric Bayesian multiple imputation for categorical data (using the Dirichlet process mixture of products of multinomial distributions model). The performance of each method is evaluated with repeated sampling studies using housing unit data from the American Community Survey 2012. These data afford exploration of practical problems such as multicollinearity and large dimensions. This thesis highlights some advantages and limitations of each method compared to others. Finally, it provides suggestions on which method should be preferred, and conditions under which the suggestions hold.
Item Open Access A Differentially Private Bayesian Approach to Replication Analysis(2022) Yang, ChengxinReplication analysis is widely used in many fields of study. Once a research is published, other researchers will conduct analysis to assess the reliability of the published research. However, what if the data are confidential? In particular, if the data sets used for the studies are confidential, we cannot release the results of replication analyses to any entity without the permission to access the data sets, otherwise it may result in privacy leakage especially when the published study and replication studies are using similar or common data sets. In this paper, we present two methods for replication analysis. We illustrate the properties of our methods by a combination of theoretical analysis and simulation.
Item Open Access Bayesian Approaches to File Linking with Faulty Data(2017) Dalzell, Nicole MFile linking allows analysts to combine information from two or more sources of information, creating linked data bases. From linking school records to track student progress across years, to official statistics and linking patient health files, linked data bases allow analysts to use existing sources of information to perform rich statistical analysis. However, the quality of this inference is dependent upon the accuracy of the data linkage. In this dissertation, we present methods for file linking and performing inference on linked data when the variables used to estimate matches are believed to be inconsistent, incorrect or missing.
In Chapter 2, we present BLASE, a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical matching variables may not agree for some matched pairs. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2) a model for the linking vector given the true values of the matching variables, (3) a measurement error model for reported values of the matching variables given their true values, and (4) a model for the true values of the matching variables. We also describe algorithms for sampling from the posterior distribution of the model and illustrate the methodology using artificial data and data from education records in the state of North Carolina.
In Chapter 3, we present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census Bureau's Census of Manufactures and use LFCMV to link the records.
The linkage techniques in Chapters 2 and 3 assume that all data involved in the linking is complete, i.e., contains no missing data. However, file linking applications can involve files prone to item non-response. The linking application in Chapter 3, for instance, involves a file that has been completed by imputing missing data. In Chapter 4, we use simulated data to examine the impact of imputations in a file linking scenario. Specifically, we frame linking multiply imputed data sets as a two-stage imputation scenario, and, within this context, conduct a simulation study in which we introduce missing data, impute the missing values, and compare the accuracy of estimating matches on the imputed versus the fully observed data sets. We also consider the process of performing inference and discuss a Bayesian technique for posterior estimation on multiply-imputed linked data sets. We apply this technique to the simulated data and examine the effect of the imputations upon posterior estimates.
Item Open Access Bayesian Inference on Ratios Subject to Differentially Private Noise(2021) Li, LinlinData privacy is a long-term issue for data sharing, especially in health-related data. Agencies need to find a balance between data utility and the privacy of the respondents. Differential privacy provides a solution by ensuring that the statistic of interest is basically the same, regardless of whether an individual is included or excluded. Due to this differentially private perturbation, analysts need to infer the true value from the released value. In this thesis, I propose Bayesian inference methods to infer the posterior distribution of ratios of two counts given the released values. I illustrate the Bayesian inference methods under several scenarios and with two commonly used differentially private mechanisms and prior distributions. The Bayesian inference method not only provides a point estimate but also provides posterior intervals. Simulation studies show that the Bayesian inference method can generate accurate inferences with close to nominal coverage rates, and have small values of bias and mean squared error.
Item Open Access Bayesian Inference Via Partitioning Under Differential Privacy(2018) Amitai, GiladIn this thesis, I develop differentially private methods to report posterior probabilities and posterior quantiles of linear regression coefficients. I accomplish this by randomly partitioning the data, taking an intermediate outcome of the data within each partition, aggregating the intermediate outcomes so that they approximate the statistic of interest, and adding Laplace noise to ensure differential privacy. I find the posterior probability by assuming that the variance of the posterior distribution given data from one partition is proportional to the variance of the posterior distribution given the full dataset. The mean posterior probability of the data within each partition is found as the intermediate outcome. The posterior probabilities given the data within one partition are averaged and the variance is rescaled so that the averaged probability approximates the posterior probability given the full dataset. Added noise ensures that the released quantity satisfies differential privacy. I find the posterior quantile by fitting the Bayesian model on the data within each partition where the likelihood has been inflated to rescale the posterior variance so that it approximates the posterior variance given the full dataset. The posterior quantile of the data within each partition is found as an intermediate outcome, and averaged to approximate the posterior quantile given the whole dataset. I add noise to ensure the released quantity satisfies differential privacy. Simulations show that both the partitioning methods and the noise mechanism can return accurate estimates of the statistics they are perturbing.
Item Unknown Bayesian Models for Combining Information from Multiple Sources(2022) Tang, JiuruiThis dissertation develops Bayesian methods for combining information from multiple sources. I focus on developing Bayesian bipartite modeling for simultaneous regression and record linkage, as well as leveraging auxiliary information on marginal distributions for handling item and unit nonresponse and accounting for survey weights.
The first contribution is a Bayesian hierarchical model that allows analysts to perform simultaneous linear regression and probabilistic record linkage. This model allows analysts to leverage relationships among the variables to improve linkage quality. It also potentially offers more accurate estimates of regression parameters compared to approaches that use a two-step process, i.e., link the records first, then estimate the linear regression on the linked data. I propose and evaluate three Markov chain Monte Carlo algorithms for implementing the Bayesian model.
The second contribution is examining the performance of an approach for generating multiple imputation data sets for item nonresponse. The method allows analysts to use auxliary information. I examine the approach via simulation studies with Poisson sampling. I also give suggestions on parameter tuning.
The third contribution is a model-based imputation approach that can handle both item and unit nonresponse while accounting for auxiliary margins and survey weights. This approach includes an innovative combination of a pattern mixture model for unit nonresponse and a selection model for item nonresponse. Both unit and item nonresponse can be nonignorable. I demonstrate the model performance with simulation studies under the situations when the design weights for unit respondents are known and when they are not. I show that the model can generate multiple imputation data sets that both retain the relationship among survey variables and yield design-based estimates that agree with auxiliary margins. I use the model to analyze voter turnout overall and across subgroups in North Carolina, with data from the 2018 Current Population Survey.
Item Unknown Bayesian Models for Imputing Missing Data and Editing Erroneous Responses in Surveys(2019) Akande, Olanrewaju MichaelThis thesis develops Bayesian methods for handling unit nonresponse, item nonresponse, and erroneous responses in large scale surveys and censuses containing categorical data. I focus on applications to nested household data where individuals are nested within households and certain combinations of the variables are not allowed, such as the U.S. Decennial Census, as well as surveys subject to both unit and item nonresponse, such as the Current Population Survey.
The first contribution is a Bayesian model for imputing plausible values for item nonresponse in data nested within households, in the presence of impossible combinations. The imputation is done using a nested data Dirichlet process mixture of products of multinomial distributions model, truncated so that impossible household configurations have zero probability in the model. I show how to generate imputations from the Markov Chain Monte Carlo sampler, and describe strategies for improving the computational efficiency of the model estimation. I illustrate the performance of the approach with data that mimic the variables collected in the U.S. Decennial Census. The results indicate that my approach can generate high quality imputations in such nested data.
The second contribution extends the imputation engine in the first contribution to allow for the editing and imputation of household data containing faulty values. The approach relies on a Bayesian hierarchical model that uses the nested data Dirichlet process mixture of products of multinomial distributions as a model for the true unobserved data, but also includes a model for the location of errors, and a reporting model for the observed responses in error. I illustrate the performance of the edit and imputation engine using data from the 2012 American Community Survey. I show that my approach can simultaneously estimate multivariate relationships in the data accurately, adjust for measurement errors, and respect impossible combinations in estimation and imputation.
The third contribution is a framework for using auxiliary information to specify nonignorable models that can handle both item and unit nonresponse simultaneously. My approach focuses on how to leverage auxiliary information from external data sources in nonresponse adjustments. This method is developed for specifying imputation models so that users can posit distinct specifications of missingness mechanisms for different blocks of variables, for example, a nonignorable model for variables with auxiliary marginal information and an ignorable model for the variables exclusive to the survey.
I illustrate the framework using data on voter turnout in the Current Population Survey.
The final contribution extends the framework in the third contribution to complex surveys, specifically, handling nonresponse in complex surveys, such that we can still leverage auxiliary data while respecting the survey design through survey weights. Using several simulations, I illustrate the performance of my approach when the sample is generated primarily through stratified sampling.
Item Unknown Combining Information from Multiple Sources in Bayesian Modeling(2016) Schifeling, Tracy AnneSurveys can collect important data that inform policy decisions and drive social science research. Large government surveys collect information from the U.S. population on a wide range of topics, including demographics, education, employment, and lifestyle. Analysis of survey data presents unique challenges. In particular, one needs to account for missing data, for complex sampling designs, and for measurement error. Conceptually, a survey organization could spend lots of resources getting high-quality responses from a simple random sample, resulting in survey data that are easy to analyze. However, this scenario often is not realistic. To address these practical issues, survey organizations can leverage the information available from other sources of data. For example, in longitudinal studies that suffer from attrition, they can use the information from refreshment samples to correct for potential attrition bias. They can use information from known marginal distributions or survey design to improve inferences. They can use information from gold standard sources to correct for measurement error.
This thesis presents novel approaches to combining information from multiple sources that address the three problems described above.
The first method addresses nonignorable unit nonresponse and attrition in a panel survey with a refreshment sample. Panel surveys typically suffer from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, the panel data alone cannot inform the extent of the bias due to attrition, so analysts must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new
individuals during some later wave of the panel. Refreshment samples offer information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the
refreshment sample itself. As we illustrate, nonignorable unit nonresponse
can significantly compromise the analyst's ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences---corrected for panel attrition---are to different assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse
in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study.
The second method incorporates informative prior beliefs about
marginal probabilities into Bayesian latent class models for categorical data.
The basic idea is to append synthetic observations to the original data such that
(i) the empirical distributions of the desired margins match those of the prior beliefs, and (ii) the values of the remaining variables are left missing. The degree of prior uncertainty is controlled by the number of augmented records. Posterior inferences can be obtained via typical MCMC algorithms for latent class models, tailored to deal efficiently with the missing values in the concatenated data.
We illustrate the approach using a variety of simulations based on data from the American Community Survey, including an example of how augmented records can be used to fit latent class models to data from stratified samples.
The third method leverages the information from a gold standard survey to model reporting error. Survey data are subject to reporting error when respondents misunderstand the question or accidentally select the wrong response. Sometimes survey respondents knowingly select the wrong response, for example, by reporting a higher level of education than they actually have attained. We present an approach that allows an analyst to model reporting error by incorporating information from a gold standard survey. The analyst can specify various reporting error models and assess how sensitive their conclusions are to different assumptions about the reporting error process. We illustrate the approach using simulations based on data from the 1993 National Survey of College Graduates. We use the method to impute error-corrected educational attainments in the 2010 American Community Survey using the 2010 National Survey of College Graduates as the gold standard survey.
Item Unknown Community assessment project: understanding the built environment within a neighborhood health context(2009-04-24T20:51:00Z) Kroeger, GretchenPurpose: Research shows evidence of associations between the built environment (BE)—housing, commercial buildings, community resources, and infrastructure—and health outcomes. However, there is less research describing the spatial variation of BE conditions. This master’s project demonstrates the impact of this variation with a database describing the BE within a neighborhood health context. Hypothesis: The hypothesis tested is two-fold: 1) the assessment tool enables the quantification of BE conditions, and 2) the data generated offer a comprehensive index for relating the BE to public health. Methods: Trained assessors canvassed over 17,000 tax parcels in Central Durham, NC using a standardized visual assessment of 40 distinct BE variables. Data were summed into 8 indices—housing damage, property damage, security level, tenure, vacancy, crime incidents, amenities, and nuisances. Census blocks were assigned an index based on the summary score of primarily and secondarily adjacent blocks. Results: The indices describe the spatial distribution of both community assets and BE conditions that are likely to affect the health of residents. Housing damage, property damage, security level, vacancy, crime incidents, and nuisances all contained higher scores for blocks located in areas characterized by high minority and low socioeconomic status. Similarly, a low tenure score described those same blocks, indicating that the majority of residential properties within those blocks are renter-occupied. Conclusions: The community assessment tool offers a comprehensive inventory of the BE, facilitating the generation of indices measuring neighborhood health. These resulting data are useful to community members, researchers, and government leaders.Item Unknown Differentially Private Counts with Additive Constraints(2021) Wang, ZiangDifferential privacy is a rigorous mathematical definition of privacy, which guarantees that data analysis should not be able to distinguish whether any individual is in or out of the dataset. To reduce disclosure risks, statistical agencies and other organizations can release noisy counts that satisfy differential privacy. In some contexts, the released counts need to satisfy additive constraints; for example, the released value of a total should equal the sum of the released values of its components. In this thesis, I present a simple post-processing procedure for satisfying such additive constraints. The basic idea is (i) compute approximate posterior modes or find posterior samples of the true counts given the noisy counts, (ii) construct a multinomial distribution with trial size equal to the posterior mode or the posterior draw of the total and probability vector equal to fractions derived from the posterior modes or the posterior draws of the components, and (iii) find and release a mode or samples of this multinomial distribution. I also present an approach for making Bayesian inferences about the true counts given these post-processed, differentially private counts. I illustrate these methods using simulations.
Item Open Access Differentially Private Verification ofPredictions from Synthetic Data(2017) Yu, HaoyangWhen data are confidential, one approach for releasing public available files is to make synthetic data, i.e, data simulated from statistical models estimated on the confidential data. Given access only to synthetic data, users have no way to tell if the synthetic data can preserve the adequacy of their analysis. Thus, I present methods that can help users to make such assessments automatically while controlling the information disclosure risks in the confidential data. There are three verification methods presented in this thesis: differentially private prediction tolerance intervals, differentially private prediction histogram, and differentially private Kolmogorov-Smirnov test. I use simulation to illustrate these prediction verification methods.
Item Open Access Differentially Private Verification with Survey Weights(2023) Lin, TongSurvey sampling is a popular technique used in various fields for making inferences about populations from samples. However, the release of survey data can lead to confidentiality concerns due to the presence of sensitive information about individuals. To mitigate this issue, data stewards generate synthetic data that reflects the statistical features of confidential data to obscure sensitive variables. Synthetic data can be released for public use as a substitute of confidential data. However, the quality of synthetic data may impact the accuracy of inferences drawn from it. Therefore, assessing the quality of inferences derived from synthetic data is essential. Researchers have proposed a verification procedure that allows analysts to submit queries regarding their inferences and evaluate their accuracy by comparing results from synthetic data with those from confidential data. This approach enables the protection of individual privacy while facilitating the public use of confidential data.
This thesis proposes a differentially private verification measure for synthetic data in the context of complex survey designs. To ensure differential privacy, we use the sub-sample and aggregate method. We partition the confidential data into disjoint partitions and compute survey-weighted estimates of the statistics of interest. Analysts can set a tolerance interval reflecting their desired level of estimate accuracy from synthetic data. Since smaller partitions have higher variance in estimates, we suggest to use a wider tolerance interval for partitions. We refer to a tolerance interval that does not account for such higher variance as a fixed tolerance interval, while a tolerance interval with inflation as a varying one. We define an indicator to signify whether estimates from the partitions fall within the tolerance interval, and compute the sum of indicators from all partitions. To satisfy differential privacy, we add a noise from the Laplace Mechanism to this metric. Bayesian post-processing is then applied to improve interpretability, and the summary statistics of the posterior distribution of the metric is released.
The proposed measure generalized the application of privacy-preserving techniques and enables analysts to validate the quality of their inferences based on synthetic data in the context of complex survey data sets.
Item Open Access Dirichlet Process Mixture Models for Nested Categorical Data(2015) Hu, JingchenThis thesis develops Bayesian latent class models for nested categorical data, e.g., people nested in households. The applications focus on generating synthetic microdata for public release and imputing missing data for household surveys, such as the 2010 U.S. Decennial Census.
The first contribution is methods for evaluating disclosure risks in fully synthetic categorical data. I quantify disclosure risks by computing Bayesian posterior probabilities that intruders can learn confidential values given the released data and assumptions about their prior knowledge. I demonstrate the methodology on a subset of data from the American Community Survey (ACS). The methods can be adapted to synthesizers for nested data, as demonstrated in later chapters of the thesis.
The second contribution is a novel two-level latent class model for nested categorical data. Here, I assume that all configurations of groups and units are theoretically possible. I use a nested Dirichlet Process prior distribution for the class membership probabilities. The nested structure facilitates simultaneous modeling of variables at both group and unit levels. I illustrate the modeling by generating synthetic data and imputing missing data for a subset of data from the 2012 ACS household data. I show that the model can capture within group relationships more effectively than standard one-level latent class models.
The third contribution is a version of the nested latent class model adapted for theoretically impossible combinations, e.g. a household with two household heads or a child older than her biological father. This version assigns zero probability to those impossible groups and units. I present a proof that the Markov Chain Monte Carlo (MCMC) sampling strategy estimates the desired target distribution. I illustrate this model by generating synthetic data and imputing missing data for a subset of data from the 2011 ACS household data. The results indicate that this version can estimate the joint distribution more effectively than the previous version.
Item Open Access Methods for Imputing Missing Values and Synthesizing Confidential Values for Continuous and Magnitude Data(2016) Wei, LanAbstract
Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.
The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.
The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.
The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.
Item Open Access Missing Data Imputation for Voter Turnout Using Auxiliary Margins(2020) Ren, YangfanMissing data is one of the essential problems while conducting analysis related to data. Typically, researchers make strong assumptions or constraints to handle them such as missing at random. However, these assumptions are improvable sometimes and could even introduce severe bias while data is missing systematically. Under such circumstances, it is desirable to consider nonignorable missing mechanisms. Since the missing values are inaccessible based on the observed data itself, the missing data with auxiliary margins (MD-AM) framework proposed by Akande et al. (2019) provides a fexible method to characterize nonignorable missing models by combining auxiliary margins. Previous research applied the MD-AM framework on CPS voter turnout data with few variables. In this thesis, I extend the MD-AM framework on turnout data with nine primary variables. By changing the assumptions about how vote aects missing, I specify two chain models for joint distribution of primary variables, their associated missing indicator models and unit nonresponse model. Furthermore, I conduct sensitivity check for two models and compare results.
Item Open Access Model Selection and Multivariate Inference Using Data Multiply Imputed for Disclosure Limitation and Nonresponse(2007-12-07) Kinney, Satkartar KThis thesis proposes some inferential methods for use with multiple imputation for missing data and statistical disclosure limitation, and describes an application of multiple imputation to protect data confidentiality. A third component concerns model selection in random effects models.The use of multiple imputation to generate partially synthetic public release files for confidential datasets has the potential to limit unauthorized disclosure while allowing valid inferences to be made. When confidential datasets contain missing values, it is natural to use multiple imputation to handle the missing data simultaneously with the generation of synthetic data. This is done in a two-stage process so that the variability may be estimated properly. The combining rules for data multiply imputed in this fashion differ from those developed for multiple imputation in a single stage. Combining rules for scalar estimands have been derived previously; here hypothesis tests for multivariate components are derived. Longitudinal business data are widely desired by researchers, but difficult to make available to the public because of confidentiality constraints. An application of partially synthetic data to the U. S. Census Longitudinal Business Database is described. This is a large complex economic census for which nearly the entire database must be imputed in order for it to be considered for public release. The methods used are described and analytical results for synthetic data generated for a subgroup are described. Modifications to the multiple imputation combining rules for population data are also developed.Model selection is an area in which few methods have been developed for use with multiply-imputed data. Careful consideration is given to how Bayesian model selection can be conducted with multiply-imputed data. The usual assumption of correspondence between the imputation and analyst models is not amenable to model selection procedures. Hence, the model selection procedure developed incorporates the imputation model and assumes that the imputation model is known to the analyst.Lastly, a model selection problem outside the multiple imputation context is addressed. A fully Bayesian approach for selecting fixed and random effects in linear and logistic models is developed utilizing a parameter expanded stochastic search Gibbs sampling algorithm to estimate the exact model-averaged posterior distribution. This approach automatically identifies subsets of predictors having nonzero fixed coefficients or nonzero random effects variance, while allowing uncertainty in the model selection process.Item Open Access Modeling Missing Data In Panel Studies With Multiple Refreshment Samples(2012) Deng, YitingMost panel surveys are subject to missing data problems caused by panel attrition. The Additive Non-ignorable (AN) model proposed by Hirano et al. (2001) utilizes refreshment samples in panel surveys to impute missing data, and offers flexibility in modeling the missing data mechanism to incorporate both ignorable and non-ignorable models. We extend the AN model to settings with three waves and two refreshment samples. We address identication and estimation issues related to the proposed model under four different types of survey design, featured by whether the missingness is monotone and whether subjects in the refreshment samples are followed up in subsequent waves of the survey. We apply this approach and multiple imputation techniques to the 2007-2008 Associated Press-Yahoo! News Poll (APYN) panel dataset to analyze factors affecting people's political interest. We find that, when attrition bias is not accounted for, the carry-on effects of past political interest on current political interest are underestimated. This highlights the importance of dealing with attrition bias and the potential of refreshment samples for doing so.
Item Open Access Multiple Imputation Inferences for Count Data(2021) Liu, BoMultiple imputation is frequently used for inference with missing data. In cases when the population quantity of interest is desired to be an integer, the original methods for inference need to be modified, as the point estimates based on the average are generally not integers.In this thesis, I propose a modification to the original combining rules, which provides the point estimate as the median of quantities from imputed datasets. Thus, the point estimate of the population quantity of interest is integer-valued when the number of imputed datasets is odd. I derive an estimator of the variance of this modified estimator, as well as a method for obtaining confidence intervals. I compare this method to other ad-hoc methods, such as rounding the original point estimate. Simulations show that these two methods provide similar results, although the novel method has slightly larger mean absolute error. The coverage rate of both methods are close to the nominal coverage of 95%. The correct derivation of variance is important, and simulations show that if one uses the median as point estimate without correcting the variance, the coverage rate is systematically lower.