dc.contributor.author |
Steorts, RC |
|
dc.contributor.author |
Hall, R |
|
dc.contributor.author |
Fienberg, SE |
|
dc.date.accessioned |
2016-04-14T20:02:46Z |
|
dc.date.issued |
2016-10-01 |
|
dc.identifier.issn |
0162-1459 |
|
dc.identifier.uri |
https://hdl.handle.net/10161/11817 |
|
dc.description.abstract |
© 2016 American Statistical Association.We propose an unsupervised approach for linking
records across arbitrarily many files, while simultaneously detecting duplicate records
within files. Our key innovation involves the representation of the pattern of links
between records as a bipartite graph, in which records are directly linked to latent
true individuals, and only indirectly linked to other records. This flexible representation
of the linkage structure naturally allows us to estimate the attributes of the unique
observable people in the population, calculate transitive linkage probabilities across
records (and represent this visually), and propagate the uncertainty of record linkage
into later analyses. Our method makes it particularly easy to integrate record linkage
with post-processing procedures such as logistic regression, capture–recapture, etc.
Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain
Monte Carlo algorithm, which overcomes many obstacles encountered by previously record
linkage approaches, despite the high-dimensional parameter space. We illustrate our
method using longitudinal data from the National Long Term Care Survey and with data
from the Italian Survey on Household and Wealth, where we assess the accuracy of our
method and show it to be better in terms of error rates and empirical scalability
than other approaches in the literature. Supplementary materials for this article
are available online.
|
|
dc.publisher |
Informa UK Limited |
|
dc.relation.ispartof |
Journal of the American Statistical Association |
|
dc.relation.isversionof |
10.1080/01621459.2015.1105807 |
|
dc.title |
A Bayesian Approach to Graphical Record Linkage and Deduplication |
|
dc.type |
Journal article |
|
duke.contributor.id |
Steorts, RC|0682018 |
|
pubs.begin-page |
1660 |
|
pubs.end-page |
1672 |
|
pubs.issue |
516 |
|
pubs.organisational-group |
Basic Science Departments |
|
pubs.organisational-group |
Biostatistics & Bioinformatics |
|
pubs.organisational-group |
Computer Science |
|
pubs.organisational-group |
Duke |
|
pubs.organisational-group |
School of Medicine |
|
pubs.organisational-group |
Statistical Science |
|
pubs.organisational-group |
Trinity College of Arts & Sciences |
|
pubs.publication-status |
Published |
|
pubs.volume |
111 |
|
dc.identifier.eissn |
1537-274X |
|