DOI QR코드

DOI QR Code

A study on the probabilistic record linkage and its application

확률적 자료연계의 이론과 적용에 관한 연구

  • Choi, Yeonok (Statistics Korea, and Department of Information and Statistics, Chungnam National University) ;
  • Lee, Sangin (Department of Information and Statistics, Chungnam National University)
  • 최연옥 (통계청, 충남대학교 정보통계학과) ;
  • 이상인 (충남대학교 정보통계학과)
  • Received : 2021.06.24
  • Accepted : 2021.07.23
  • Published : 2021.10.31

Abstract

This paper aims to introduce the basic concept of probabilistic record linkage and its statistical framework, and describe the specific process and principle of performing it using a real example from Statistics Korea. First, we briefly describe the deterministic record linkage and compare it with probabilistic record linkage. We introduce the Fellegi-Sunter model framework for record linkage and the related paprameters: m-probability, u-probability, matched weight and decision rule. Finally, we show the detailed process of record linkage under Fellegi-Sunter model framework and evaluate the record linkage results, using sample data from the registered-based census and Population and Housing Census survey in Statistics Korea.

본 논문은 확률적 자료연계 방법의 기본 개념과 이론적 모형을 소개하고, 실제 통계청 데이터를 사용하여 확률적 자료연계가 진행되는 과정과 원리를 보여준다. 먼저 확률적 자료연계와 결정적 자료연계와의 차이를 간단히 알아보고, 확률적 자료연계 방법론의 토대가 되는 Fellegi-Sunter 모형의 기본 구성과 관련된 모수(m-확률, u-확률), 가중치, 매치여부 판정기준에 대해 기술한다. 그리고 통계청 등록센서스와 인구총조사 자료를 이용하여 그 모형을 적용한 자료연계가 이루어지는 구체적인 과정에 대해 설명하고, 이를 통해 얻어진 연계 결과의 정확성을 살펴본다.

Keywords

Acknowledgement

본 논문은 한국연구재단 지원에 의한 논문임(No. 2020R1I1A3071646).

References

  1. Christen P (2007). A two-step classification approach to unsupervised record linkage. In Proceedings of the Sixth Australasian Conference on Data Mining and Analytics, 70 111-119.
  2. Christen P and Goiser K (2007). Quality and complexity measures for data linkage and deduplication, Quality Measures in Data Mining, 127-151, Springer.
  3. Dunn HL (1946). Record linkage, American Journal of Public Health and the Nations Health, 36, 1412-1416, American Public Health Association. https://doi.org/10.2105/AJPH.36.12.1412
  4. Dempster AP, Laird NM, and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B (Methodological), 39, 1-22, Wiley Online Library. https://doi.org/10.2307/2347807
  5. Elfeky MG, Verykios VS,Elmagarmid AK, Ghanem TM and Huwait AR (2003). Record linkage: A machine learning approach, a toolbox, and a digital government web service,Citeseer.
  6. Fellegi IP and Sunter AB (1969). A theory for record linkage, Journal of the American Statistical Association, 64, 1183-1210, Taylor & Francis. https://doi.org/10.1080/01621459.1969.10501049
  7. Feigenbaum JJ (2016). Automated census record linking: A machine learning approach(Working Paper), Harvard University, US.
  8. Goeken R, Huynh L, Lynch TA and Vick R (2011). New methods of census record linking, Historical methods, 44, 7-14, Taylor & Francis. https://doi.org/10.1080/01615440.2010.517152
  9. Hand D and Christen P (2018). A note on using the F-measure for evaluating record linkage algorithms, Statistics and Computing, 28, 539-547, Springer. https://doi.org/10.1007/s11222-017-9746-6
  10. Herzog TN, Scheuren FJ, and Winkler WE (2007). Data Quality and Record Linkage Techniques, Springer Science & Business Media, New York.
  11. Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida, Journal of the American Statistical Association, 84, 414-420, Taylor & Francis Group https://doi.org/10.1080/01621459.1989.10478785
  12. Newcombe HB, Kennedy JM, Axford SJ, and James AP (1959). Automatic linkage of vital records, Science, 130, 954-959, JSTOR. https://doi.org/10.1126/science.130.3381.954
  13. Winkler WE (1990). String Comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceeding of the Section on Survey Research Methods,US. ERIC.
  14. Winkler WE and Thibaudeau Y (1991). An application of the Fellegi-Sunter model of record linkage to the 1990 US decennial census(Working Paper), United States Census Bureau.
  15. Winkler WE (1993). Improved decision rules in the fellegi-sunter model of record linkage, 56, Citeseer
  16. Winkler WE (1995). Matching and record linkage, Business Survey Methods, 1, 355-384, New York.
  17. Winkler WE (2000). Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods,US Bureau of the Census Washington, DC.