Variational Bayes for Merging Noisy Databases

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

187
views
95
downloads

Abstract

Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster sizes in entity resolution, and discuss directions for future work in this domain.

Department

Description

Provenance

Citation

Scholars@Duke

Steorts

Rebecca Carter Steorts

Associate Professor of Statistical Science

You can find more information about my research group and work at:

https://resteorts.github.io/

Recent papers of mine can be found at 

https://arxiv.org/search/?query=steorts&searchtype=all&source=header


Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.