DOI QR코드

DOI QR Code

De-identifying Unstructured Medical Text and Attribute-based Utility Measurement

의료 비정형 텍스트 비식별화 및 속성기반 유용도 측정 기법

  • Ro, Gun (Department of Computer Engineering, Myongji University) ;
  • Chun, Jonghoon (Department of Data Technology, School of Software Convergence, Myongji University)
  • Received : 2019.02.13
  • Accepted : 2019.02.25
  • Published : 2019.02.28

Abstract

De-identification is a method by which the remaining information can not be referred to a specific individual by removing the personal information from the data set. As a result, de-identification can lower the exposure risk of personal information that may occur in the process of collecting, processing, storing and distributing information. Although there have been many studies in de-identification algorithms, protection models, and etc., most of them are limited to structured data, and there are relatively few considerations on de-identification of unstructured data. Especially, in the medical field where the unstructured text is frequently used, many people simply remove all personally identifiable information in order to lower the exposure risk of personal information, while admitting the fact that the data utility is lowered accordingly. This study proposes a new method to perform de-identification by applying the k-anonymity protection model targeting unstructured text in the medical field in which de-identification is mandatory because privacy protection issues are more critical in comparison to other fields. Also, the goal of this study is to propose a new utility metric so that people can comprehend de-identified data set utility intuitively. Therefore, if the result of this research is applied to various industrial fields where unstructured text is used, we expect that we can increase the utility of the unstructured text which contains personal information.

비식별화는 데이터셋으로부터 개인정보를 제거함으로써 개인을 식별할 수 없도록 하는 방법으로, 정보를 수집, 가공, 저장, 배포하는 과정에서 발생할 수 있는 개인정보 노출 위험도를 낮추기 위해 사용한다. 그간 비식별화와 관련된 알고리즘, 모델 등의 관점에서 많은 연구가 이루어졌지만, 대부분은 정형 데이터를 대상으로 하는 제한적인 연구로, 비정형 데이터에 대한 고려는 상대적으로 많지 않은 실정이다. 특히 비정형 텍스트가 빈번히 사용되는 의료 분야의 경우에서는 개인 식별 정보들을 단순 제거함으로써 개인정보 노출 위험도는 낮추지만, 그에 따른 데이터 활용성이 떨어지는 점을 감수하는 실정이다. 본 연구는 개인정보 보호 이슈가 가장 중요하고 따라서 비식별화가 활발하게 연구되고 있는 의료분야 데이터 중 비정형 텍스트를 대상으로 k-익명성 보호모델을 적용한 비식별화 수행 방안을 제시하고, 비식별화 결과에 대한 새로운 유용도 측정 기법을 제안하여 이를 통해 직관적으로 데이터 활용성을 판단할 수 있도록 하는 것을 목표로 한다. 따라서 본 연구의 결과물이 의료 분야뿐만 아니라 비정형 텍스트가 활용되는 모든 산업 분야에서 활용될 경우, 개인 식별 정보가 포함된 비정형 텍스트의 활용도를 향상시킬 수 있을 것으로 기대한다.

Keywords

References

  1. Bayardo, R. J. and Agrawal, R., "Data privacy through optimal k-anonymization," 21st International Conference on Data Engineering (ICDE'05), 2005.
  2. El Emam, K., Dankar, F. K., Issa, R., and Jonker, E., "A Globally Optimal k-Anonymity Method for the De-Identification of Health Data," Journal of the American Medical Informatics Association, Vol. 16, No. 5, pp. 670-682, 2009. https://doi.org/10.1197/jamia.M3144
  3. Prasser, F. and Kohlmayer, F., "Putting Statistical Disclosure Control Into Practice: The ARX Data Anonymization Tool," Medical Data Privacy Handbook, Springer, November 2015.
  4. Finkel, J., Grenager T., and Manning, C., "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling," Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics, ACL, 2005.
  5. Garfinkel, S. L., "De-Identification of Personal Information," National Institute of Standards and Technology, 2015.
  6. Gobbel, G. T., Garvin, J., Reeves, R., Cronin, R. M., Heavirland, J., Williams, J., Weaver, A., Jayaramaraja, S., Giuse, D., Speroff, T., Brown, S. H., Xu, H., and Matheny, M. E., "Assisted annotation of medical free text using RapTAT," Journal of the American Medical Informatics Association, Vol. 21, No. 5, pp. 833-841, 2014. https://doi.org/10.1136/amiajnl-2013-002255
  7. Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C. H., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C. K., and Stanley, H. E., "PhysioBank, PhysioToolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals, Circulation, Vol. 101, No. 23, pp. E215-20, 2000.
  8. Information and Privacy Commissioner of Ontario, "De-identification Guidelines for Structured Data," Information and Privacy Commissioner of Ontario, 2016.
  9. Iyengar, V. S., "Transforming data to satisfy privacy constraints," KDD '02 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.
  10. Lewis, David D, "Reuters-21578, Distribution 1.0," UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection.
  11. Neamatullah, I., Douglass, M., Lehman, L. H., Reisner, A., Villarroel, M., Long, W. J., Szolovits, P., Moody, G. B., Mark, R. G., and Clifford, G. D., "Automated De-Identification of Free-Text Medical Records," BMC Medical Informatics and Decision Making, 2008. https://doi.org/10.1186/1472-6947-8-32
  12. Office for Civil Rights, "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act(HIPAA) Privacy Rule," U.S. Department of Health & Human Services, 2015.
  13. Park, C. W., Kim, J. W., and Kwon, H. J., "An Empirical Research on Information Privacy Risks and Policy Model in the Big data Era," The Journal of Society for e-Business Studies, Vol. 21, No. 1, pp. 131-145, 2016. https://doi.org/10.7838/jsebs.2016.21.1.131
  14. Ro, G. and Chun, J. H., "Classification and Performance Evaluation of Personal Identifiers and Quasi-identifiers for Implementing Medical Unstructured Text Deidentification System," KDBC, 2018.
  15. Sweeney, L., "Achieving k-anonymity Privacy Protection Using Generalization and Suppression," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No .5, pp. 571-588, 2002. https://doi.org/10.1142/S021848850200165X
  16. Sweeney, L., "k-anonymity: a model for protecting privacy," International Journal on Uncertainty, Fuzziness and Knowledgebased Systems, Vol. 10, No. 5, pp. 557-570, 2002. https://doi.org/10.1142/S0218488502001648