Automated linkage of patient records from disparate sources

Research output: Contribution to journalArticle

Abstract

We introduce an automated method of record linkage that has two key features, automated selection of match field interactions to include in the model for estimation and automated threshold determination for classifying record pairs to matches or non-matches. We applied our method to two real-world examples. The first example demonstrated results consistent with our earlier work: When data quality is adequate and the match field discriminating power is high, matching algorithms exhibit similar performance. The second example demonstrated that our method yields a lower false positive rate and higher positive predictive value than the Fellegi-Sunter model in the face of low data quality. When compared to the Fellegi-Sunter model, simulation studies suggest that our method exhibits better overall performance as indicated by higher area under the curve, and less biased estimates for both the match prevalence rate and the m- and u-probabilities over a range of data scenarios, especially when the match prevalence is extreme. Computationally, our method is as efficient as the Fellegi-Sunter model. We recommend this method in situations that an unsupervised linking algorithm is needed.

Original languageEnglish (US)
Pages (from-to)172-184
Number of pages13
JournalStatistical Methods in Medical Research
Volume27
Issue number1
DOIs
StatePublished - Jan 1 2018

Fingerprint

Linkage
Data Quality
Record Linkage
Matching Algorithm
False Positive
Model
Linking
Area Under Curve
Biased
Extremes
Simulation Study
Scenarios
Curve
Interaction
Estimate
Range of data
Data Accuracy

Keywords

  • Diagnostic tests
  • Fellegi-Sunter model
  • Latent class model
  • Log-linear model
  • Patient matching
  • Record linkage

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability
  • Health Information Management

Cite this

Automated linkage of patient records from disparate sources. / Li, Xiaochun; Xu, Huiping; Shen, Changyu; Grannis, Shaun.

In: Statistical Methods in Medical Research, Vol. 27, No. 1, 01.01.2018, p. 172-184.

Research output: Contribution to journalArticle

@article{9245633d739447d7b91f25f7d5a1ffcb,
title = "Automated linkage of patient records from disparate sources",
abstract = "We introduce an automated method of record linkage that has two key features, automated selection of match field interactions to include in the model for estimation and automated threshold determination for classifying record pairs to matches or non-matches. We applied our method to two real-world examples. The first example demonstrated results consistent with our earlier work: When data quality is adequate and the match field discriminating power is high, matching algorithms exhibit similar performance. The second example demonstrated that our method yields a lower false positive rate and higher positive predictive value than the Fellegi-Sunter model in the face of low data quality. When compared to the Fellegi-Sunter model, simulation studies suggest that our method exhibits better overall performance as indicated by higher area under the curve, and less biased estimates for both the match prevalence rate and the m- and u-probabilities over a range of data scenarios, especially when the match prevalence is extreme. Computationally, our method is as efficient as the Fellegi-Sunter model. We recommend this method in situations that an unsupervised linking algorithm is needed.",
keywords = "Diagnostic tests, Fellegi-Sunter model, Latent class model, Log-linear model, Patient matching, Record linkage",
author = "Xiaochun Li and Huiping Xu and Changyu Shen and Shaun Grannis",
year = "2018",
month = "1",
day = "1",
doi = "10.1177/0962280215626180",
language = "English (US)",
volume = "27",
pages = "172--184",
journal = "Statistical Methods in Medical Research",
issn = "0962-2802",
publisher = "SAGE Publications Ltd",
number = "1",

}

TY - JOUR

T1 - Automated linkage of patient records from disparate sources

AU - Li, Xiaochun

AU - Xu, Huiping

AU - Shen, Changyu

AU - Grannis, Shaun

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We introduce an automated method of record linkage that has two key features, automated selection of match field interactions to include in the model for estimation and automated threshold determination for classifying record pairs to matches or non-matches. We applied our method to two real-world examples. The first example demonstrated results consistent with our earlier work: When data quality is adequate and the match field discriminating power is high, matching algorithms exhibit similar performance. The second example demonstrated that our method yields a lower false positive rate and higher positive predictive value than the Fellegi-Sunter model in the face of low data quality. When compared to the Fellegi-Sunter model, simulation studies suggest that our method exhibits better overall performance as indicated by higher area under the curve, and less biased estimates for both the match prevalence rate and the m- and u-probabilities over a range of data scenarios, especially when the match prevalence is extreme. Computationally, our method is as efficient as the Fellegi-Sunter model. We recommend this method in situations that an unsupervised linking algorithm is needed.

AB - We introduce an automated method of record linkage that has two key features, automated selection of match field interactions to include in the model for estimation and automated threshold determination for classifying record pairs to matches or non-matches. We applied our method to two real-world examples. The first example demonstrated results consistent with our earlier work: When data quality is adequate and the match field discriminating power is high, matching algorithms exhibit similar performance. The second example demonstrated that our method yields a lower false positive rate and higher positive predictive value than the Fellegi-Sunter model in the face of low data quality. When compared to the Fellegi-Sunter model, simulation studies suggest that our method exhibits better overall performance as indicated by higher area under the curve, and less biased estimates for both the match prevalence rate and the m- and u-probabilities over a range of data scenarios, especially when the match prevalence is extreme. Computationally, our method is as efficient as the Fellegi-Sunter model. We recommend this method in situations that an unsupervised linking algorithm is needed.

KW - Diagnostic tests

KW - Fellegi-Sunter model

KW - Latent class model

KW - Log-linear model

KW - Patient matching

KW - Record linkage

UR - http://www.scopus.com/inward/record.url?scp=85041379645&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041379645&partnerID=8YFLogxK

U2 - 10.1177/0962280215626180

DO - 10.1177/0962280215626180

M3 - Article

VL - 27

SP - 172

EP - 184

JO - Statistical Methods in Medical Research

JF - Statistical Methods in Medical Research

SN - 0962-2802

IS - 1

ER -