DOI QR코드

DOI QR Code

A Study on Measuring the Risk of Re-identification of Personal Information in Conversational Text Data using AI

  • Received : 2024.08.05
  • Accepted : 2024.10.07
  • Published : 2024.10.31

Abstract

최근 인공지능 기술 발전으로 다양한 챗봇이 등장하여 호텔 예약, 뉴스 확인, 법률 상담 등 일상 작업을 효율적으로 수행하고 있다. 특히 ChatGPT와 같은 생성형 챗봇은 교육, 연구, 예술 분야에서 자체 콘텐츠를 생성하는 등 활용 가능성을 확장하고 있다. 이러한 AI챗봇의 학습에는 고객 서비스 대화 기록 등 방대한 양의 '대화형 텍스트 데이터'가 필요하지만, 정제되지 않은 대화형 텍스트 데이터의 학습으로 인해 국내외에서 AI챗봇에 대한 개인정보 침해 사례가 발생하고 있다. 본 연구는 AI챗봇 학습에 사용되는 '대화형 텍스트 데이터'를 기반으로 데이터 내 포함되어 있는 개인정보 항목에 대한 재식별 위험성을 계량적으로 측정할 수 있는 방법론을 제안하고 있다. 제안 방법론에 대한 타당성 검증을 위해 가상의 대화형 데이터를 생성하여 자체실증을 하였으며, 외부 전문가 220명을 대상으로 설문조사를 실시하여 제안하는 방법론의 유의미함을 확인할 수 있었다.

With the recent advancements in artificial intelligence, various chatbots have emerged, efficiently performing everyday tasks such as hotel bookings, news updates, and legal consultations. Particularly, generative chatbots like ChatGPT are expanding their applicability by generating original content in fields such as education, research, and the arts. However, the training of these AI chatbots requires large volumes of conversational text data, such as customer service records, which has led to privacy infringement cases domestically and internationally due to the use of unrefined data. This study proposes a methodology to quantitatively assess the re-identification risk of personal information contained in conversational text data used for training AI chatbots. To validate the proposed methodology, we conducted a case study using synthetic conversational data and carried out a survey with 220 external experts, confirming the significance of the proposed approach.

Keywords

Acknowledgement

This study was funded by the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA), grant number 1781000017.

References

  1. Jun-ho Park, Artificial Intelligence-Based Chatbot System Technology Trend, Korea Information Processing Society, Vol.26, No.2, pp39-46, 2019. UCI : I410-ECN-0102-2019-500-001455843  I410-ECN-0102-2019-500-001455843
  2. Avinash Chandra Das et al, The next frontier of customer engagement: AI-enabled customer service, Mckinsey& Company, 2023.3. https://mck.co/40y0s9A/ 
  3. Chatbot Market Global Industry Analysis, Precedence Research, 2023. https://www.precedenceresearch.com/chatbot-market 
  4. Xiaodong Wu et al, Unveiling Security, Privacy, and Ethical Concerns of ChatGPT, Journal of Information and Intelligence, Vol.2, Issue 2, pp.102-115, 2024. https://doi.org/10.1016/j.jiixd.2023.10.007 
  5. Seung-Jae Jeon, Possibility of Using Personal Information as Machine Learning Data Seen Through the Iruda Case, Korea Association For Info-Media Law, Vol.25, No.2, pp.103-133, 2021. https://doi.org/10.22846/kafil.25.2.202108.004 
  6. Heui-ok Lee, Ethical Guidelines for Controlling Bias of Artificial Intelligence Chatbots, Korean Public Law Association, Vol.51, No.3, 2023.2. http:/doi.org/10.38176/PublicLaw.2023.2.51.3.715 
  7. Mark Elliot, Elaine Mackey et al, "The Anonymisation Decision-making Framework", UK Anonymisation Network, May. 2016. 
  8. Personal Information Protection Commission, "Guidelines for processing Pseudonymization information", Sep. 2020. 
  9. Simson Garfinkel, NIST800-188 De-Identifying Government Data Sets, NIST, 2022. https://doi.org/10.6028/NIST.SP.800-188 
  10. ISO/IEC 25237, Health informatics- Pseudonymization, 2017. https://www.iso.org/standard/63553.html 
  11. Personal Information Protection Commission, "Guidelines for Personal Information Impact Assessment", Apr. 2024. 
  12. Personal Information Protection Commission, "Personal information risk analysis standards and commentary", 2020. 
  13. Ministry of the Interior and Safety, guidelines for homapage personal Information exposure prevention, 2017. 
  14. Eu-gene Kim, Privacy Detection and Risk Analysis Model, Sungshin Women's University, Master's Thesis, 2011. 
  15. Su-jun Jeong, A Study on Analysis of Personal Information Risk Using Importance-Performance Analysis, The Journal of The Institute of Internet, Vol.15, No.6, pp.267-273, 2015. http://dx.doi.org/10.7236/JIIBC.2015.15.6.267 
  16. Sung-jick Lee, "Keyword Extraction from News Corpus using Modified TF-IDF", Journal of Society for e-Business Studies, 14(4):59-73, 2009. 
  17. J. A. Martilla and J. C. James, "Importance Performance Analysis", Journal of Making, Vol.41, pp. 77-79, 1977. 
  18. Jo-seong lae, " A Study of the Aged in the Leisure Life of Leisure Motivation and on the Leisure Satisfaction", Honam for master's thesis. 2013. 2 
  19. Chae-hyeon Kim, An Information Content-based Method for Measuring the Risk of Personal Information Exposure, Korean Institute of Information Scientists and Engineers, pp. 926-928, 2022. 
  20. Hye-rin Kang, A Study on the Construction of Specialized NER Dataset for Personal Information Detection, Annual Conference on Human and Language Technology, pp. 185-191, 2022. 
  21. National Institute of Korean Language, "Messenger Corpus", 2022. 
  22. KOREA PRESS FOUNDATION, Understanding BERT in the history of artificial intelligence, 2022. 
  23. Sweeney L, Re-identification Risks in HIPAA Safe Harbor Data, PubMed Cenral, (2017). https://techscience.org/a/2017082801 
  24. Dong-hyun Kim, A study on Data Context-Based Risk Measurement Method for Pseudonymized Information Processing, Journal of The Korea Society of Computer and Information, 2022, Vol.27, No.6, pp.53-63. https://doi.org/10.9708/jksci.2022.27.06.053