dc.description.abstract |
<p>Probabilistic record linkage is the task of combining multiple data sources for
statistical analysis by identifying records pertaining to the same individual in different
databases. The need to perform probabilistic record linkage arises in comparative
effectiveness research and other clinical research scenarios when records in different
databases do not share an error-free unique patient identifier. This dissertation
seeks to develop new methodology for probabilistic record linkage to address two highly
practical and recurring challenges: how to implement record linkage in a manner that
optimizes downstream statistical analyses of the linked data, and how to efficiently
link databases having a clustered or multi-level data structure.</p><p>In Chapter
2 we propose a new framework for balancing the tradeoff between false positive and
false negative linkage errors when linked data are analyzed in a generalized linear
model framework and non-linked records lead to missing data for the study outcome
variable. Our method seeks to maximize the probability that the point estimate of
the parameter of interest will have the correct sign and that the confidence interval
around this estimate will correctly exclude the null value of zero. Using large sample
approximations and a model for linkage errors, we derive expressions relating bias
and hypothesis testing power to the user's choice of threshold that determines how
many records will be linked. We use these results to propose three data-driven threshold
selection rules. Under one set of simplifying assumptions we prove that maximizing
asymptotic power requires that the threshold be relaxed at least until the point where
all pairs with >50% probability of being a true match are linked.</p><p>In Chapter
3 we explore the consequences of linkage errors when the study outcome variable is
determined by linkage status and so linkage errors may cause outcome misclassification.
This scenario arises when the outcome is disease status and those linked are classified
as having the disease while those not linked are classified as disease-free. We assume
the parameter of interest can be expressed as a linear combination of binomial proportions
having mean zero under the null hypothesis. We derive an expression for the asymptotic
relative efficiency of a Wald test calculated with a misclassified outcome compared
to an error-free outcome using a linkage error model and large sample approximations.
We use this expression to generate insights for planning and implementing studies
using record linkage.</p><p>In Chapter 4 we develop a modeling framework for linking
files with a clustered data structure. Linking such clustered data is especially challenging
when error-free identifiers are unavailable for both individual-level and cluster-level
units. The proposed approach improves over current methodology by modeling inter-pair
dependencies in clustered data and producing collective link decisions. It is novel
in that it models both record attributes and record relationships, and resolves match
statuses for individual-level and cluster-level units simultaneously. We show that
linkage probabilities can be estimated without labeled training data using assumptions
that are less restrictive compared to existing record linkage models. Using Monte
Carlo simulations based on real study data, we demonstrate its advantages over the
current standard method.</p>
|
|