Extending Probabilistic Record Linkage

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Probabilistic record linkage is the task of combining multiple data sources for statistical analysis by identifying records pertaining to the same individual in different databases. The need to perform probabilistic record linkage arises in comparative effectiveness research and other clinical research scenarios when records in different databases do not share an error-free unique patient identifier. This dissertation seeks to develop new methodology for probabilistic record linkage to address two highly practical and recurring challenges: how to implement record linkage in a manner that optimizes downstream statistical analyses of the linked data, and how to efficiently link databases having a clustered or multi-level data structure.

In Chapter 2 we propose a new framework for balancing the tradeoff between false positive and false negative linkage errors when linked data are analyzed in a generalized linear model framework and non-linked records lead to missing data for the study outcome variable. Our method seeks to maximize the probability that the point estimate of the parameter of interest will have the correct sign and that the confidence interval around this estimate will correctly exclude the null value of zero. Using large sample approximations and a model for linkage errors, we derive expressions relating bias and hypothesis testing power to the user's choice of threshold that determines how many records will be linked. We use these results to propose three data-driven threshold selection rules. Under one set of simplifying assumptions we prove that maximizing asymptotic power requires that the threshold be relaxed at least until the point where all pairs with >50% probability of being a true match are linked.

In Chapter 3 we explore the consequences of linkage errors when the study outcome variable is determined by linkage status and so linkage errors may cause outcome misclassification. This scenario arises when the outcome is disease status and those linked are classified as having the disease while those not linked are classified as disease-free. We assume the parameter of interest can be expressed as a linear combination of binomial proportions having mean zero under the null hypothesis. We derive an expression for the asymptotic relative efficiency of a Wald test calculated with a misclassified outcome compared to an error-free outcome using a linkage error model and large sample approximations. We use this expression to generate insights for planning and implementing studies using record linkage.

In Chapter 4 we develop a modeling framework for linking files with a clustered data structure. Linking such clustered data is especially challenging when error-free identifiers are unavailable for both individual-level and cluster-level units. The proposed approach improves over current methodology by modeling inter-pair dependencies in clustered data and producing collective link decisions. It is novel in that it models both record attributes and record relationships, and resolves match statuses for individual-level and cluster-level units simultaneously. We show that linkage probabilities can be estimated without labeled training data using assumptions that are less restrictive compared to existing record linkage models. Using Monte Carlo simulations based on real study data, we demonstrate its advantages over the current standard method.





Solomon, Nicole Chanel (2020). Extending Probabilistic Record Linkage. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/20891.


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.