Bayesian Approaches to File Linking with Faulty Data

Thumbnail Image




Dalzell, Nicole M.


Reiter, Jerome P

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



File linking allows analysts to combine information from two or more sources of information, creating linked data bases. From linking school records to track student progress across years, to official statistics and linking patient health files, linked data bases allow analysts to use existing sources of information to perform rich statistical analysis. However, the quality of this inference is dependent upon the accuracy of the data linkage. In this dissertation, we present methods for file linking and performing inference on linked data when the variables used to estimate matches are believed to be inconsistent, incorrect or missing.

In Chapter 2, we present BLASE, a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical matching variables may not agree for some matched pairs. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2) a model for the linking vector given the true values of the matching variables, (3) a measurement error model for reported values of the matching variables given their true values, and (4) a model for the true values of the matching variables. We also describe algorithms for sampling from the posterior distribution of the model and illustrate the methodology using artificial data and data from education records in the state of North Carolina.

In Chapter 3, we present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census Bureau's Census of Manufactures and use LFCMV to link the records.

The linkage techniques in Chapters 2 and 3 assume that all data involved in the linking is complete, i.e., contains no missing data. However, file linking applications can involve files prone to item non-response. The linking application in Chapter 3, for instance, involves a file that has been completed by imputing missing data. In Chapter 4, we use simulated data to examine the impact of imputations in a file linking scenario. Specifically, we frame linking multiply imputed data sets as a two-stage imputation scenario, and, within this context, conduct a simulation study in which we introduce missing data, impute the missing values, and compare the accuracy of estimating matches on the imputed versus the fully observed data sets. We also consider the process of performing inference and discuss a Bayesian technique for posterior estimation on multiply-imputed linked data sets. We apply this technique to the simulated data and examine the effect of the imputations upon posterior estimates.






Dalzell, Nicole M. (2017). Bayesian Approaches to File Linking with Faulty Data. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.