DOI QR코드

DOI QR Code

Context-sensitive Spelling Error Correction using Eojeol N-gram

어절 N-gram을 이용한 문맥의존 철자오류 교정

  • 김민호 (부산대학교 전자전기컴퓨터공학과) ;
  • 권혁철 (부산대학교 정보컴퓨터공학부) ;
  • 최성기 (부산대학교 전자전기컴퓨터공학과)
  • Received : 2014.08.07
  • Accepted : 2014.09.24
  • Published : 2014.12.15

Abstract

Context-sensitive spelling-error correction methods are largely classified into rule-based methods and statistical data-based methods, the latter of which is often preferred in research. Statistical error correction methods consider context-sensitive spelling error problems as word-sense disambiguation problems. The method divides a vocabulary pair, for correction, which consists of a correction target vocabulary and a replacement candidate vocabulary, according to the context. The present paper proposes a method that integrates a word-phrase n-gram model into a conventional model in order to improve the performance of the probability model by using a correction vocabulary pair, which was a result of a previous study performed by this research team. The integrated model suggested in this paper includes a method used to interpolate the probability of a sentence calculated through each model and a method used to apply the models, when both methods are sequentially applied. Both aforementioned types of integrated models exhibit relatively high accuracy and reproducibility when compared to conventional models or to a model that uses only an n-gram.

문맥의존 철자오류의 교정 방법은 크게 규칙을 이용한 방법과 통계 정보에 기반을 둔 방법으로 나뉘며, 이중 통계적 오류 교정 방법을 중심으로 연구가 진행되었다. 통계적 오류 방법은 문맥의존 철자오류 문제를 어의 중의성 해소 문제로 간주한 방법으로서, 교정 대상 어휘와 대치 후보 어휘로 이루어진 교정 어휘 쌍을 문맥에 따라 분류하는 방법이다. 본 논문에서는 본 연구진의 기존 연구 결과인 교정 어휘 쌍을 이용한 확률 모델의 성능 향상을 위해 어절 n-gram 모델을 기존 모델에 결합하는 방법을 제안한다. 본 논문에서 제안하는 결합 모델은 각 모델을 통해 계산된 문장의 확률을 보간(interpolation)하는 방법과 각각의 모델을 차례대로 적용하는 방법이다. 본 논문에서 제안한 두 가지 결합 모델 모두 기존 모델이나 어절 n-gram만 이용한 모델보다 높은 정확도와 재현율을 보인다.

Keywords

Acknowledgement

Supported by : 부산대학교

References

  1. Andrew R. Golding, Dan Roth, "A Winnow-Based Approach to Context-Sensitive Spelling Correction," Machine Learning, Vol. 34, pp. 107-130, 1998.
  2. A. Islam and D. Inkpen, "Semantic text similarity using corpus-based word similarity and string similarity," ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 2, pp. 1-25, 2008.
  3. A. Islam and D. Inkpen, "Real-Word Spelling Correction using Google Web 1T 3-grams," Proc. of International Conference on Natural Language Processing and Knowledge Engineering, Vol. 3, pp.1241-1249, 2009.
  4. W.-O. Amber, G. Hirst, and A. Budanitsky, "Realword spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model," Proc. of 9th International Conference on Intelligent Text Processing and Computational Linguistics, Vol. 4919, pp. 605-616, 2008.
  5. G. Hirst and A. Budanitsky, "Correcting real-word spelling errors by restoring lexical cohesion," Natural Language Engineering, Vol. 11, No. 1, pp. 87-111, 2005. https://doi.org/10.1017/S1351324904003560
  6. C. Choi, S. J. Park, C. J. Kim, Gyus, "Analysis of Uncorrected Typing Rate of keyboard to Design Ergonomic Keyboard Based on Qwerty Keyboard," Proc. of the Ergonomics Society of Korea Spring Conference, Vol. 1, pp. 142-145, 2000. (in Korean)
  7. Y. A. Park and R. Levy, "Automated Whole Sentence Grammar Correction using a Noisy Channel Model," Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 49, No. 1, pp. 934-944, 2011.
  8. O. Kolak and P. Resnik, "OCR Error Correction using a Noisy Channel Model," Proc. of the second international conference on Human Language Technology Research, Vol. 2, No. 1, pp. 257-262, 2002.
  9. E. Brill and R. C. Moore, "An Improved Error Model for Noisy Channel Spelling Correction," Proc. of the 38th Annual Meeting on Association for Computational Linguistics, Vol. 38, No. 1, pp. 286-293, 2000.
  10. M. D. Kernighan, K. W. Church, and W. A. Gale, "A Spelling Correction Program based on a Noisy Channel Model," Proc. of the 13th conference on Computational linguistics, Vol. 13, No. 1, pp. 205-210, 1990.

Cited by

  1. Error Correction in Korean Morpheme Recovery using Deep Learning vol.42, pp.11, 2015, https://doi.org/10.5626/JOK.2015.42.11.1452