DOI QR코드

DOI QR Code

딥러닝 기반 한국어 맞춤법 교정을 위한 오류 유형 분류 및 분석

Classification and analysis of error types for deep learning-based Korean spelling correction

  • 구선민 (건국대학교 컴퓨터공학과) ;
  • 박찬준 (고려대학교 컴퓨터학과) ;
  • 소아람 (고려대학교 Human-inspired AI연구소) ;
  • 임희석 (고려대학교 컴퓨터학과)
  • Koo, Seonmin (Konkuk University, Computer Science) ;
  • Park, Chanjun (Department of Computer Science and Engineering, Korea University) ;
  • So, Aram (Human-inspired Computing Research Center, Korea University) ;
  • Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
  • 투고 : 2021.10.29
  • 심사 : 2021.12.20
  • 발행 : 2021.12.28

초록

최근 기계 번역 기술과 자동 노이즈 생성 방법론을 기반으로 한국어 맞춤법 교정 연구가 활발히 이루어지고 있다. 해당 방법론들은 노이즈를 생성하여 학습 셋과 데이터 셋으로 사용한다. 이는 학습에 사용된 노이즈 외의 노이즈가 테스트 셋에 포함될 가능성이 낮아 정확한 성능 측정이 어렵다는 한계점이 존재한다. 또한 실제적인 오류 유형 분류 기준이 없어 연구마다 사용하는 오류 유형이 다르므로 질적 분석에 어려움을 겪고 있다. 이를 해결하기 위해 본 논문은 딥러닝 기반 한국어 맞춤법 교정 연구를 위한 새로운 '오류 유형 분류 체계'를 제안하며 이를 바탕으로 기존 상용화 한국어 맞춤법 교정기(시스템 A, 시스템 B, 시스템 C)에 대한 오류 분석을 수행하였다. 분석결과, 세 가지 교정 시스템들이 띄어쓰기 오류 외에 본 논문에서 제시한 다른 오류 유형은 교정을 잘 수행하지 못했으며 어순 오류나 시제 오류의 경우 오류 인식을 거의 하지 못함을 알 수 있었다.

Recently, studies on Korean spelling correction have been actively conducted based on machine translation and automatic noise generation. These methods generate noise and use as train and data set. This has limitation in that it is difficult to accurately measure performance because it is unlikely that noise other than the noise used for learning is included in the test set In addition, there is no practical error type standard, so the type of error used in each study is different, making qualitative analysis difficult. This paper proposes new 'error type classification' for deep learning-based Korean spelling correction research, and error analysis perform on existing commercialized Korean spelling correctors (System A, B, C). As a result of analysis, it was found the three correction systems did not perform well in correcting other error types presented in this paper other than spacing, and hardly recognized errors in word order or tense.

키워드

과제정보

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation)" and Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425).

참고문헌

  1. J. Xiong, Q. Zhang, S. Zhang, J. Hou & X. Cheng. (2015, June). HANSpeller: a unified framework for Chinese spelling correction. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 20, Number 1, June 2015-Special Issue on Chinese as a Foreign Language.
  2. M. Kim, J. Jin, H. C. Kwon & A. Yoon. (2013, December). Statistical context-sensitive spelling correction using typing error rate. In 2013 IEEE 16th International Conference on Computational Science and Engineering (pp. 1242-1246).
  3. J. H. Lee, M. Kim & H. C. Kwon. (2017). Improved statistical language model for context-sensitive spelling error candidates. Journal of Korea Multimedia Society, 20(2), 371-381. https://doi.org/10.9717/KMMS.2017.20.2.371
  4. C. Park, K. Kim, Y. Yang, M. Kang & H. Lim. (2020). Neural spelling correction: translating incorrect sentences to correct sentences for multimedia. Multimedia Tools and Applications, 1-18.
  5. M. Lee, H. Shin, D. Lee & S. P Choi. (2021). Korean Grammatical Error Correction Based on Transformer with Copying Mechanisms and Grammatical Noise Implantation Methods. Sensors, 21(8), 2658.
  6. C. Park, S. Park & H. Lim. (2020). Self-Supervised Korean Spelling Correction via Denoising Transformer. 7th International Conference on Information, System and Convergence Applications
  7. C. Park, J. Seo, S. Lee, C. Lee, H. Moon, S. Eo & H. S. Lim. (2021, August). BTS: Back TranScription for speech-to-text post-processor using text-to-speech-to-text. In Proceedings of the 8th Workshop on Asian Translation (WAT2021) (pp. 106-116).
  8. J. Byun, H. C. Rim & S. Y. Park. (2007, August). Automatic spelling correction rule extraction and application for spoken-style korean text. In Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007) (pp. 195-199). IEEE.
  9. E. Brill & R. C. Moore. (2000, October). An improved error model for noisy channel spelling correction. In Proceedings of the 38th annual meeting of the association for computational linguistics (pp. 286-293).
  10. M. Konchady. (2009). Detecting Grammatical Errors in Text using a Ngram-based Ruleset. Retrieved October, 6, 2011.
  11. Li, H., Wang, Y., Liu, X., Sheng, Z., & Wei, S. (2018). Spelling error correction using a nested rnn model and pseudo training data. arXiv preprint arXiv:1811.00238.
  12. A. Solyman, Z. Wang & Q. Tao. (2019, September). Proposed model for arabic grammar error correction based on convolutional neural network. In 2019 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE) (pp. 1-6). IEEE.
  13. A. Kuznetsov & H. Urdiales. (2021). Spelling Correction with Denoising Transformer. arXiv preprint arXiv:2105.05977.
  14. J. H. Min, S. J. Jung, S. H. Jung, S. Yang, J. S. Cho & S. H. Kim. (2020). Grammatical Error Correction Models for Korean Language via Pre-trained Denoising. Quantitative Bio-Science, 39(1), 17-24. https://doi.org/10.22283/QBS.2020.39.1.17
  15. M. Lee, H. Shin, D. Lee & S. P. Choi. (2021). Korean Grammatical Error Correction Based on Transformer with Copying Mechanisms and Grammatical Noise Implantation Methods. Sensors, 21(8), 2658.
  16. S. K. Kim, T. Y. Kim, R. W. Kang & J. Kim. (2020). Characteristics of Korean Liaison Rule in the Reading and Writing of Children of Korean-Vietnamese Multicultural Families and the Correlation with Mothers' Korean Abilities. Korean Speech-Lang. Hear. Assoc. 29, 57-71.
  17. K. Lee. (2018). Patterns of Word Spacing Errors in University Students' Writing. J. Res. Soc. Lang. Lit. 97, 289-318.