A Bayesian Approach to Graphical Record Linkage and Deduplication

dc.contributor.author

Steorts, RC

dc.contributor.author

Hall, R

dc.contributor.author

Fienberg, SE

dc.date.accessioned

2016-04-14T20:02:46Z

dc.date.issued

2016-10-01

dc.description.abstract

© 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online.

dc.identifier.eissn

1537-274X

dc.identifier.issn

0162-1459

dc.identifier.uri

https://hdl.handle.net/10161/11817

dc.publisher

Informa UK Limited

dc.relation.ispartof

Journal of the American Statistical Association

dc.relation.isversionof

10.1080/01621459.2015.1105807

dc.title

A Bayesian Approach to Graphical Record Linkage and Deduplication

dc.type

Journal article

pubs.begin-page

1660

pubs.end-page

1672

pubs.issue

516

pubs.organisational-group

Basic Science Departments

pubs.organisational-group

Biostatistics & Bioinformatics

pubs.organisational-group

Computer Science

pubs.organisational-group

Duke

pubs.organisational-group

School of Medicine

pubs.organisational-group

Statistical Science

pubs.organisational-group

Trinity College of Arts & Sciences

pubs.publication-status

Published

pubs.volume

111

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1312.4645v4.pdf
Size:
766.4 KB
Format:
Adobe Portable Document Format
Description:
Accepted version