DOI QR코드

DOI QR Code

Analysis of k Value from k-anonymity Model Based on Re-identification Time

재식별 시간에 기반한 k-익명성 프라이버시 모델에서의 k값에 대한 연구

  • 김채운 (고려대학교 정보보호대학원) ;
  • 오준형 (고려대학교 정보보호대학원) ;
  • 이경호 (고려대학교 정보보호대학원)
  • Received : 2020.11.20
  • Accepted : 2020.12.10
  • Published : 2020.12.31

Abstract

With the development of data technology, storing and sharing of data has increased, resulting in privacy invasion. Although de-identification technology has been introduced to solve this problem, it has been proved many times that identifying individuals using de-identified data is possible. Even if it cannot be completely safe, sufficient de-identification is necessary. But current laws and regulations do not quantitatively specify the degree of how much de-identification should be performed. In this paper, we propose an appropriate de-identification criterion considering the time required for re-identification. We focused on the case of using the k-anonymity model among various privacy models. We analyzed the time taken to re-identify data according to the change in the k value. We used a re-identification method based on linkability. As a result of the analysis, we determined which k value is appropriate. If the generalized model can be developed by results of this paper, the model can be used to define the appropriate level of de-identification in various laws and regulations.

빅데이터 활용 기술의 발전으로 데이터의 저장 및 공유가 늘어나면서 그에 따른 프라이버시 침해가 일어나게 되었다. 이 문제를 해결하기 위해 비식별 기술이 도입되었지만 비식별된 데이터에 대해서도 재식별이 가능하다는 것이 여러 차례 증명되었다. 재식별 가능성이 존재하기 때문에 완전히 안전할 수 없지만 그럼에도 불구하고 충분한 비식별처리가 이루어져야 하는데, 현재 법령이나 규제는 어느 정도로 비식별 처리를 해야 하는지 정량적으로 규정하고 있지 않다. 본 논문에서는 재식별 작업을 할 때 소요되는 시간을 고려하여 적절한 비식별 기준을 제시하려고 한다. 다양한 비식별 평가 모델 중에서 k-익명성 모델에 대해 집중적으로 연구하였으며 어느 정도의 k값이 적절한 지 판단하였다. 본 연구의 결과를 일반화시킬 수 있다면 각종 법률 및 규제에서 적절한 비식별 강도를 규정하는 데 사용할 수 있을 것이다.

Keywords

Acknowledgement

This work was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2020-2015-0-00403) supervised by the IITP (Institute for Information &communications Technology Planning &Evaluation)

References

  1. Regulation (EU) 2016/679 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) 2016.
  2. ISO DIS 25237 "Health informatics - Pseudonymization," 2017.
  3. C. A. Cassa, S. C. Wieland, and K. D. Mandl, "Re-identification of home addresses from spatial locations anonymized by Gaussian skew," International journal of health geographics, vol. 7, no. 1, p. 45, 2008. https://doi.org/10.1186/1476-072X-7-45
  4. A. Cavoukian and D. Castro, "Big data and innovation, setting the record straight: de-identification does work," Information and Privacy Commissioner, vol. 18, 2014.
  5. C. Culnane, B. Rubinstein, and V. Teague, "Health data in an open world: a report on re-identifying patients in the MBS/PBS data set and the implications on future releases of Australian government data," 2017.
  6. F. K. Dankar, K. El Emam, A. Neisa, and T. Roffey, "Estimating the re-identification risk of clinical data sets," BMC medical informatics and decision making, vol. 12, no. 1, p. 66, 2012. https://doi.org/10.1186/1472-6947-12-66
  7. M. Douriez, H. Doraiswamy, J. Freire, and C. T. Silva, "Anonymizing nyc taxi data: Does it matter?," in 2016 IEEE international conference on data science and advanced analytics (DSAA), pp. 140-148. 2016.
  8. K. El Emam, "Methods for the de-identification of electronic health records for genomic research," Genome Medicine, vol. 3, no. 4, p. 25, 2011. https://doi.org/10.1186/gm239
  9. K. El Emam, E. Jonker, and B. M. Luk Arbuckle, "A systematic review of re-identification attacks on health data," PloS one, vol. 6, no. 12, 2011.
  10. S. Garfinkel of National Institute of Standards and Technology (NIST) "De-Identifying Government Datasets (2nd Draft)," 2016.
  11. A. Gkoulalas-Divanis, G. Loukides, and J. Sun, "Publishing data from electronic health records while preserving privacy: A survey of algorithms," Journal of biomedical informatics, vol. 50, pp. 4-19, 2014. https://doi.org/10.1016/j.jbi.2014.06.002
  12. N. Y. S. D. o. Health. Hospital Inpatient Discharges (SPARCS De-Identified): 2017.
  13. R. Leenes, R. Van Brakel, S. Gutwirth, and P. De Hert, Data protection and privacy: the age of intelligent machines. Bloomsbury Publishing, 2017.
  14. N. Li, T. Li, and S. Venkatasubramanian, "t-closeness: Privacy beyond k-anonymity and l-diversity," in 2007 IEEE 23rd International Conference on Data Engineering, pp. 106-115. 2007.
  15. G. Loukides, J. C. Denny, and B. Malin, "The disclosure of diagnosis codes can breach research participants' privacy," Journal of the American Medical Informatics Association, vol. 17, no. 3, pp. 322-327, 2010. https://doi.org/10.1136/jamia.2009.002725
  16. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, "l-diversity: Privacy beyond k-anonymity," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, pp. 3-es, 2007. https://doi.org/10.1145/1217299.1217302
  17. L. Rocher, J. M. Hendrickx, and Y.-A. De Montjoye, "Estimating the success of re-identifications in incomplete datasets using generative models," Nature communications, vol. 10, no. 1, pp. 1-9, 2019. https://doi.org/10.1038/s41467-018-07882-8
  18. P. Samarati and L. Sweeney, "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression," 1998.
  19. L. Sweeney, "k-anonymity: A model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557-570, 2002. https://doi.org/10.1142/S0218488502001648
  20. L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, "Information security in big data: privacy and data mining," Ieee Access, vol. 2, pp. 1149-1176, 2014. https://doi.org/10.1109/ACCESS.2014.2362522
  21. A. Basu, T. Nakamura, S. Hidano and S. Kiyomoto, "k-anonymity: Risks and the Reality," IEEE Trustcom/BigDataSE/ISPA, pp. 983-989, 2015.
  22. F. K. Dankar, K. El Emam, A. Neisa and T. Roffey, "Estimating the re-identification risk of clinical data sets," BMC Medical Informatics and Decision Making, vol. 12, no. 66, 2012.
  23. Office for Civil Rights, HHS. "Standards for privacy of individually identifiable health information. Final rule," Fed Regist. 2002 Aug 14;67(157): 53181-273, 2002.
  24. G. E. Simon, S. M. Shortreed, R. Y. Coley, R.B. Penfold, R. C. Rossom, B. E. Waitzfelder, K. Sanchez, and F. L. Lynch, "Assessing and Minimizing Re-identification Risk in Research Data Derived from Health Care Records," EGEMS (Washington, DC), 7(1), 6, 2019. https://doi.org/10.5334/egems.270
  25. Z. Yang, R. Wang, D. Luo and Y. Xiong, "Rapid Re-Identification Risk Assessment for Anonymous Data Set in Mobile Multimedia Scene," IEEE Access, vol. 8, pp.41557-41565, 2020. https://doi.org/10.1109/ACCESS.2020.2977404