Skip to main content
Duke University Libraries
DukeSpace Scholarship by Duke Authors
  • Login
  • Ask
  • Menu
  • Login
  • Ask a Librarian
  • Search & Find
  • Using the Library
  • Research Support
  • Course Support
  • Libraries
  • About
View Item 
  •   DukeSpace
  • Theses and Dissertations
  • Duke Dissertations
  • View Item
  •   DukeSpace
  • Theses and Dissertations
  • Duke Dissertations
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Bayesian Approaches to File Linking with Faulty Data

Thumbnail
View / Download
1.7 Mb
Date
2017
Author
Dalzell, Nicole M.
Advisor
Reiter, Jerome P
Repository Usage Stats
310
views
534
downloads
Abstract

File linking allows analysts to combine information from two or more sources of information, creating linked data bases. From linking school records to track student progress across years, to official statistics and linking patient health files, linked data bases allow analysts to use existing sources of information to perform rich statistical analysis. However, the quality of this inference is dependent upon the accuracy of the data linkage. In this dissertation, we present methods for file linking and performing inference on linked data when the variables used to estimate matches are believed to be inconsistent, incorrect or missing.

In Chapter 2, we present BLASE, a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical matching variables may not agree for some matched pairs. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2) a model for the linking vector given the true values of the matching variables, (3) a measurement error model for reported values of the matching variables given their true values, and (4) a model for the true values of the matching variables. We also describe algorithms for sampling from the posterior distribution of the model and illustrate the methodology using artificial data and data from education records in the state of North Carolina.

In Chapter 3, we present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census Bureau's Census of Manufactures and use LFCMV to link the records.

The linkage techniques in Chapters 2 and 3 assume that all data involved in the linking is complete, i.e., contains no missing data. However, file linking applications can involve files prone to item non-response. The linking application in Chapter 3, for instance, involves a file that has been completed by imputing missing data. In Chapter 4, we use simulated data to examine the impact of imputations in a file linking scenario. Specifically, we frame linking multiply imputed data sets as a two-stage imputation scenario, and, within this context, conduct a simulation study in which we introduce missing data, impute the missing values, and compare the accuracy of estimating matches on the imputed versus the fully observed data sets. We also consider the process of performing inference and discuss a Bayesian technique for posterior estimation on multiply-imputed linked data sets. We apply this technique to the simulated data and examine the effect of the imputations upon posterior estimates.

Type
Dissertation
Department
Statistical Science
Subject
Statistics
Permalink
https://hdl.handle.net/10161/14533
Citation
Dalzell, Nicole M. (2017). Bayesian Approaches to File Linking with Faulty Data. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/14533.
Collections
  • Duke Dissertations
More Info
Show full item record
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

Rights for Collection: Duke Dissertations


Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info

Make Your Work Available Here

How to Deposit

Browse

All of DukeSpaceCommunities & CollectionsAuthorsTitlesTypesBy Issue DateDepartmentsAffiliations of Duke Author(s)SubjectsBy Submit DateThis CollectionAuthorsTitlesTypesBy Issue DateDepartmentsAffiliations of Duke Author(s)SubjectsBy Submit Date

My Account

LoginRegister

Statistics

View Usage Statistics
Duke University Libraries

Contact Us

411 Chapel Drive
Durham, NC 27708
(919) 660-5870
Perkins Library Service Desk

Digital Repositories at Duke

  • Report a problem with the repositories
  • About digital repositories at Duke
  • Accessibility Policy
  • Deaccession and DMCA Takedown Policy

TwitterFacebookYouTubeFlickrInstagramBlogs

Sign Up for Our Newsletter
  • Re-use & Attribution / Privacy
  • Harmful Language Statement
  • Support the Libraries
Duke University