A Bayesian Approach to Graphical Record Linkage and Deduplication

Loading...
Thumbnail Image

Date

2016-10-01

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

391
views
286
downloads

Citation Stats

Abstract

© 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online.

Department

Description

Provenance

Subjects

Citation

Published Version (Please cite this version)

10.1080/01621459.2015.1105807

Publication Info

Steorts, RC, R Hall and SE Fienberg (2016). A Bayesian Approach to Graphical Record Linkage and Deduplication. Journal of the American Statistical Association, 111(516). pp. 1660–1672. 10.1080/01621459.2015.1105807 Retrieved from https://hdl.handle.net/10161/11817.

This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.

Scholars@Duke

Steorts

Rebecca Carter Steorts

Associate Professor of Statistical Science

You can find more information about my research group and work at:

https://resteorts.github.io/

Recent papers of mine can be found at 

https://arxiv.org/search/?query=steorts&searchtype=all&source=header


Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.