DOI QR코드

DOI QR Code

A Personal Information Security System using Form Recognition and Optical Character Recognition in Electronic Documents

전자문서에서 서식인식과 광학문자인식을 이용한 개인정보 탐지 및 보호 시스템

  • Baek, Jong-Kyung (Division of Computer, Graduate school of Soongsil University) ;
  • Jee, Yoon-Seok (Department of IT Policy Management, Graduate school of Soongsil University) ;
  • Park, Jae-Pyo (Graduate School of Information Science, Soongsil University)
  • 백종경 (숭실대학교 대학원 컴퓨터학과) ;
  • 지윤석 (숭실대학교 대학원 IT정책경영학과) ;
  • 박재표 (숭실대학교 정보과학대학원)
  • Received : 2020.02.04
  • Accepted : 2020.05.08
  • Published : 2020.05.31

Abstract

Format recognition and OCR techniques are widely used as methods for detecting and protecting personal information from electronic documents. However, due to the poor recognition rate of the OCR engine, personal information cannot be detected or false positives commonly occur. It also takes a long time to analyze a large amount of electronic documents. In this paper, we propose a method to improve the speed of image analysis of electronic documents, character recognition rate of the OCR engine, and detection rate of personal information by improving the existing method. The analysis speed was increased using the format recognition method while the analysis speed and character recognition rate of the OCR engine was improved by image correction. An algorithm for analyzing personal information from images was proposed to increase the reconnaissance rate of personal information. Through the experiments, 1755 image format recognition samples were analyzed in an average time of 0.24 seconds, which was 0.5 seconds higher than the conventional PAID system format recognition method, and the image recognition rate was 99%. The proposed method in this paper can be used in various fields such as public, telecommunications, finance, tourism, and security as a system to protect personal information in electronic documents.

전자문서에서 개인정보를 보호하기 위한 방법으로 서식 인식과 광학 문자 인식 기법이 많이 이용되고 있으나 OCR 엔진의 저조한 인식률로 인해서 개인정보를 탐지하지 못하거나 오탐이 많이 발생하고 있고 또한 대량의 전자문서를 분석하는데도 오랜 시간이 걸린다. 본 논문에서는 기존의 방법을 개선하여 전자문서의 이미지 분석 속도와 OCR엔진의 글자 인식률, 그리고 개인정보의 탐지율을 향상할 수 있는 방안을 제시한다. 서식 인식 방법을 이용하여 분석 속도를 높이고, 이미지 보정을 통해 OCR 엔진 분석 속도 및 글자 인식률을 향상한다. 이미지에서의 개인정보 분석 알고리즘을 제안하여 개인정보의 탐지율을 높였다. 실험을 통하여 이미지 서식 인식 시료 1755개를 분석하여 평균 0.24초가 소요되어 기존의 PAID 시스템 서식 인식 방안보다 0.5초 향상되었으며 이미지 서식 인식률은 평균 99%를 기록하였다. 본 논문에서 제안한 방법은 전자문서에서 개인정보를 보호할 수 있는 시스템으로서 공공, 통신사, 금융, 관광, 보안 등 여러분야에서 활용할 수 있을 것이다.

Keywords

References

  1. I. G. Cheon, T. Y. Young, "Basic image processing", KiHanJae, 1999.
  2. D. H. Jang, "Implementation of Digital Image Processing", PC ADVANCE, 1999.
  3. https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software (accessed Oct. 31. 2019)
  4. https://docs.opencv.org/4.1.2 (accessed Oct. 31. 2019)
  5. https://en.wikipedia.org/wiki/Regular_expression (accessed Oct. 31. 2019)
  6. Ray Smith, "An Overview of the Tesseract OCR Engine", Google Inc., 2007.
  7. J. H. Cho, C. W. Ahn, "Auto Detection System of Personal Information based on Images and Document Analysis", The Journal of The Institute of Internet, Broadcasting and Communication, Vol 15 No 5, pp.183-192, 2015. DOI:https://doi.org/10.7236/JIIBC.2015.15.5.183
  8. J. W. Kim, S. T. Kim, J. Y. Yoon, Y. I. Joo, "A Personal Prescription Management System Employing Optical Character Recognition Technique", Journal of the Korea Institute of Information and Communication Engineering, Vol 19, No. 10, pp.2423-2428, 2015. DOI:https://doi.org/10.6109/jkiice.2015.19.10.2423
  9. S. C. Park, "Design and Implementation of Personal Information Identification and Masking System Based on Image Recognition", The Journal of The Institute of Internet, Broadcasting and Communication, Vol 17 No 5, pp.1-8, 2017. DOI:https://doi.org/10.7236/JIIBC.2017.17.5.1
  10. Y. G. Kim, "Improvement of Korean Characters Recognition Performance Using CNN and Feature Extraction", Ph.D dissertation, Pusan National University, 2017.
  11. G. W. Joe, "A Personal Information Detection Method of Image File", Master's thesis, Jeonbuk National University, 2018.
  12. S. H. Lee, J. H. Joen, H. S. Hong, D. H. Kang, M. H. Park, "Korean Prescription Character Recognition System Using OCR Technology", Korean Institute of Information Scientists and Engineers 2017 Conference, Korea, pp.362-364, 2017.