Advances in Efficient and Scalable Bayesian Record Linkage
Abstract
Record linkage (or entity resolution) is the task of identifying unique entities within and across data files. This task is straightforward when data files have reliable information, but becomes difficult when that information is subject to errors or changes over time. Probabilistic record linkage techniques are effective in this domain, but are challenging to implement due the massive scale of modern datasets. In this dissertation, we offer several advancements in Bayesian record linkage to meet these challenges. We introduce the `Fast Beta Linkage (fabl) framework for record linkage, a Bayesian model for record linkage coupled with a dimension reduction technique for efficient posterior inference and enhanced scalability. We then provide Variational Beta Linkage (vabl), and show this model can be estimated through variational inference much more quickly than through MCMC sampling. Lastly, we introduce Dirichlet Record Linkage (DRL), a generalization of the fast beta linkage approach that allows one record in one file to match to multiple records in another file. Together, these advancements offer practitioners much more flexibility in conducting record linkage as the balance speed and accuracy according to organizational needs.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Kundinger, Brian Alexander (2024). Advances in Efficient and Scalable Bayesian Record Linkage. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/32585.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.