DOI QR코드

DOI QR Code

Secure Blocking + Secure Matching = Secure Record Linkage

  • Received : 2011.02.01
  • Accepted : 2011.03.20
  • Published : 2011.09.30

Abstract

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

Keywords

References

  1. R. Baxter, P. Christen, and T. Churches, "A comparison of fast blocking methods for record linkage," Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, 2003, pp. 25-27.
  2. The European Union, "Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data," Official Journal of the European Union, vol. L281, pp. 31-50, Nov. 1995.
  3. V. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.
  4. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, "Duplicate record detection: a survey," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1-16, 2007. https://doi.org/10.1109/TKDE.2007.250581
  5. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava, "Using q-grams in a dbms for approximate string processing," IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 28-34, 2001.
  6. M. A. Jaro, "Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida," Journal of the American Statistical Association, vol. 84, no. 406, pp. 414-420, 1989. https://doi.org/10.2307/2289924
  7. W. E. Winkler, The State of Record Linkage and Current Research Problems, Washington, DC: Statistical Research Division, US Bureau of the Census, 1999.
  8. W. E. Winkler, Overview of Record Linkage and Current Research Directions, Washington, DC: Statistical Research Division, US Census Bureau, 2006.
  9. M. K. Odell and R. C. Russell, US Patent Number 1261167, 1918.
  10. L. Philips, "Hanging on the metaphone," Computer Language, vol. 7, no. 12, pp. 39-43, Dec. 1990.
  11. L. E. Gill, "OX-LINK: the Oxford medical record linkage system," Record Linkage Techniques--1997: Proceedings of an International Workshop and Exposition, Arlington, VA, 1997, pp.15-33.
  12. R. L. Taft, Name Search Techniques. Special Report / New York State Identification and Intelligence System, Albany, NY: Bureau of Systems Development, 1970.
  13. M. G. Elfeky, V. S. Verykios, and A. K. Elmagarmid, "TAILOR: a record linkage tool box," Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, 2002.
  14. P. Christen, "Febrl--an open source data cleaning, deduplication and record linkage system with a graphical user interface," Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, 2008, pp. 1065-1068.
  15. W. W. Cohen, "The WHIRL approach to integration: an overview," Proceedings of the AAAI-98 Workshop on AI and Information Integration, Madison, WI, 1998, pp. 26-27.
  16. C. Clifton, M. Kantarcioglu, A. Doan, G. Schadow, J. Vaidya, A. Elmagarmid, and D. Suciu, "Privacy-preserving data integration and sharing," Proceedings of the 9th Workshop on Research Issues in Data Mining and Knowledge Discovery, In Conjunction with ACM SIGMOD International Conference on Management of Data, Paris, France, 2004, pp. 19-26.
  17. T. Churches and P. Christen, "Blind data linkage using n-gram similarity comparisons," Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 2004, pp. 121-126.
  18. D. Sankoff and J. B. Kruskal, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Stanford, CA: Center for the Study of Language and Information, 1999.
  19. V. S. Verykios, A. Karakasidis, and V. K. Mitrogiannis, "Privacy preserving record linkage approaches," International Journal of Data Mining, Modelling and Management, vol. 1, no. 2, pp. 206-221, 2009. https://doi.org/10.1504/IJDMMM.2009.026076
  20. A. Karakasidis and V. S. Verykios, "Privacy preserving record linkage using phonetic codes," The 4th Balkan Conference in Informatics, Thessalonikei, Greece, 2009, pp. 101-106.
  21. S. Trepetin, "Privacy-preserving string comparisons in record linkage systems: a review," Information Security Journal: A Global Perspective, vol. 17, no. 5-6, pp. 253-266, Dec. 2008. https://doi.org/10.1080/19393550802492503
  22. D. X. Song, D. Wagner, and A. Perrig, "Practical techniques for searches on encrypted data," IEEE Symposium on Security and Privacy, Berkeley, CA, 2000, pp. 44-55. https://doi.org/10.1109/SECPRI.2000.848445
  23. W. Du and M. J. Atallah, "Protocols for secure remote database access with approximate matching," Proceedings of the 7th ACM Conference on Computer and Communications, and the First Workshop on Security and Privacy in E-Commerce, Athens, Greece, 2000.
  24. E. Van Eycken, K. Haustermans, F. Buntinx, A. Ceuppens, J. Weyler, E. Wauters, H. Van Oyen, M. De Schaever, D. Van den Berge, and M. Haelterman, "Evaluation of the encryption procedure and record linkage in the Belgian National Cancer Registry," Archives of Public Health, vol. 58, no. 6, pp. 281-294, 2000.
  25. A. Inan, M. Kantarcioglu, E. Bertino, and M. Scannapieco, "A hybrid approach to private record linkage," Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 496-505.
  26. M. Scannapieco, I. Figotin, E. Bertino, and A. K. Elmagarmid, "Privacy preserving schema and data matching," ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007, pp. 653-664.
  27. M. Kantarcioglu, W. Jiang, and B. Malin, "A privacy-preserving framework for integrating person-specific databases," Privacy in Statistical Databases UNESCO Chair in Data Privacy International Conference, PSD 2008, Istanbul, 2008, pp. 24-26.
  28. S. S. Bhowmick, L. Gruenwald, M. Iwaihara, and S. Chatvichienchai, "PRIVATE-IYE: a framework for privacy preserving data integration," Proceedings of the 22nd International Conference on Data Engineering, Atlanta, GA, 2006, pp. 91-91.
  29. R. Hall and S. E. Fienberg, "Privacy-preserving record linkage," Proceedings of the International Conference on Privacy in Statistical Databases, Corfu, Greece, 2010, pp. 269-283.
  30. A. Inan, M. Kantarcioglu, G. Ghinita, and E. Bertino, "Private record matching using differential privacy," Proceedings of the 13th International Conference on Extending Database Technology: Advances in Database Technology, Lausanne, Switzerland, 2010, pp. 123-134.
  31. M. J. Atallah, F. Kerschbaum, and W. Du, "Secure and private sequence comparisons," Proceedings of the 2003 ACM Workshop on Privacy in the Electronic Society, Washington, DC, 2003, pp. 39-44.
  32. B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970. https://doi.org/10.1145/362686.362692
  33. G. Koloniari and E. Pitoura, "Distributed structural relaxation of XPath queries," Proceedings of the 25th International Conference on Data Engineering, Shanghai, China, 2009, pp. 529-540.
  34. L. Fan, P. Cao, J. Almeida, and A. Z. Broder, "Summary cache: a scalable wide-area web cache sharing protocol," IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281-293, Jun. 2000. https://doi.org/10.1109/90.851975
  35. R. Schnell, T. Bachteler, and J. Reiher, "Privacy-preserving record linkage using Bloom filters," BMC Medical Informatics and Decision Making, vol. 9, no. 1, p. 41, Aug. 2009. https://doi.org/10.1186/1472-6947-9-41
  36. W. Cohen, P. Ravikumar, and S. E. Fienberg, "A comparison of string distance metrics for name-matching tasks," Proceedings of the IJCAI 2003 Workshop on Information Integration on the Web, Acapulco, Mexico, 2003, pp. 73-78.
  37. R. Rivest, "The MD5 message-digest algorithm," http://www.ietf.org/rfc/rfc1321.txt?number=1321.
  38. M. A. Hernandez and S. J. Stolfo, "Real-world data is dirty: data cleansing and the merge/purge problem," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998. https://doi.org/10.1023/A:1009761603038
  39. O. Goldreich, Foundations of Cryptography, Vol 2: Basic Applications, New York, NY: Cambridge University Press, 2004.
  40. C. E. Shannon, "A mathematical theory of communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379-423, Jul. 1948. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  41. US Census Bureau, "DataFerrett," http://dataferrett.census.gov/.

Cited by

  1. A taxonomy of privacy-preserving record linkage techniques vol.38, pp.6, 2013, https://doi.org/10.1016/j.is.2012.11.005
  2. Linking Health Records for Federated Query Processing vol.2016, pp.3, 2016, https://doi.org/10.1515/popets-2016-0013
  3. Secure Hamming distance based record linkage with malicious adversaries vol.40, pp.6, 2014, https://doi.org/10.1016/j.compeleceng.2013.07.008
  4. Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage vol.2, pp.2, 2017, https://doi.org/10.1007/s41019-017-0041-5
  5. Privacy-preserving record linkage on large real world datasets vol.50, 2014, https://doi.org/10.1016/j.jbi.2013.12.003
  6. A practical approach to achieve private medical record linkage in light of public resources vol.20, pp.2, 2013, https://doi.org/10.1136/amiajnl-2012-000917