DOI QR코드

DOI QR Code

A Method for Detection and Correction of Pseudo-Semantic Errors Due to Typographical Errors

철자오류에 기인한 가의미 오류의 검출 및 교정 방법

  • Received : 2013.10.03
  • Accepted : 2013.10.21
  • Published : 2013.10.31

Abstract

Typographical mistakes made in the writing process of drafts of electronic documents are more common than any other type of errors. The majority of these errors caused by mistyping are regarded as consequently still typo-errors, but a considerable number of them are developed into the grammatical errors and the semantic errors. Pseudo semantic errors among these errors due to typographical errors have more noticeable peculiarities than pure semantic errors between senses of surrounding context words within a sentence. These semantic errors can be detected and corrected by simple algorithm based on the co-occurrence frequency because of their prominent contextual discrepancy. I propose a method for detection and correction based on the co-occurrence frequency in order to detect semantic errors due to typo-errors. The co-occurrence frequency in proposed method is counted for only words with immediate dependency relation, and the cosine similarity measure is used in order to detect pseudo semantic errors. From the presented experimental results, the proposed method is expected to help improve the detecting rate of overall proofreading system by about 2~3%.

전자 문서의 초안 작성과정에서 추가되는 철자오류는 다른 유형의 오류보다 압도적으로 높은 비율을 차지한다. 입력 실수로 인한 이들 오류는 결과적으로 여전히 철자오류일 수도 있지만 상당수는 구문오류나 의미오류로 발전한다. 이러한 오류들 중 철자오류에서 발전된 가의미 오류는 순수 의미오류에 비해 문장 내에서 주변 단어의 의미에 대해 두드러진 상이성을 갖게된다. 따라서 이러한 의미 오류는 그것이 가지는 두드러진 문맥 상이성으로 인해 간단한 동시발생 빈도에 기초한 알고리즘으로 검출 및 교정이 가능하다. 본 논문에서는 이러한 오류들을 검출하고 교정하기 위한 동시발생 빈도에 기초한 알고리즘을 제안한다. 제안하는 방법에서 동시발생 빈도는 의존 구조상에서 직접 의존관계에 놓인 단어만을 대상으로 계산하며, 가의미 오류 여부를 판단하기 위해서 코사인 유사도 측정 방법을 사용한다. 제시하는 실험으로부터 제안한 방법은 전체 맞춤법 검사기 검출율을 약 2~3% 수준까지 향상 시킬 수 있을 것으로 예측하였다.

Keywords

References

  1. Byung-hoon Lee, Korean Spelling Corrector Based on Corpus Analysis, MS Thesis, Yonsei University, 1993.
  2. Dong-joo Kim, "Detecting Spelling Errors by Comparison of Words within a Document," Journal of the Korea Society of Computer and Information, Vol. 16, No. 12, pp. 83-92, 2011. https://doi.org/10.9708/jksci.2011.16.12.083
  3. Dong-joo Kim, et al., "Design and Implementation of Morphological Analyser for Korean Spell Checker," Proceedings of IEEK Summer Conference, IEEK, Vol. 20, No. 1, pp. 255-258, 1997.
  4. Hyuk-chul Kwon, "Korean Spelling and Grammar Checker", Journal of the Korea Society of Computer and Information, Vol. 15, No. 10, pp. 24-34, 1997.
  5. Hall, Patrick A. V., et al., "Approximate string matching," ACM Computing Surveys, vol. 12, No. 4, pp. 381-402, December, 1980. https://doi.org/10.1145/356827.356830
  6. Jae-Hyuk Choi, "Automatic Korean Spacing Words Correction System With Bidirectional Longest Match Strategy," Proceedings of the 9th Conference on Hangul and Korean Information Processing, pp. 304-315, 1997.
  7. Seung-Shik Kang, et al., "Morphological Analysis and Spelling Check Function of Korean Morphological Analyzer HAM," Proceedings of the 8th Conference on Hangul and Korean Information Processing, pp. 246-252, 1996.
  8. Hang Li, et al., "Word Clustering and Disambiguation Based on Co-occurrence Data," The On-Line Proceedings of the ACL, 1998.
  9. Ellen Riloff, "Atomatically Generating Extraction Patterns from Untagged Text," Proceedings of the AAAI-96, pp. 1044-1049, 1996.
  10. Kong-joo Lee, et al., "Automatic Word Classification and Wordtags in Korean," Proceedings of the 23th KISS Spring Conference, Vol. 23, No. 1, pp. 961-964, 1996.
  11. Young-sin Lee, et al., "Automatic Spelling Correction using an Error-tolerant Morphological Analyzer and Co-occurrence Information," Proceedings of the 24th KISS Spring Conference, Vol. 24, No. 1, pp. 411-413, 1998.
  12. Ted Pedersen, et al., "A New Supervised Learning Algorithm for Word Sense Disambiguation," Proceedings of the AAAI-97, pp. 604-609, 1997.
  13. Marc Light, "Morphological Cues for Lexical Semantics," The On-Line Proceedings of the ACL, 1996.
  14. Kamal Nigram, et al., "Learning to Classify Text from Labeled and Unlabeled Documents," Proceedings of the AAAI-98, pp. 792-799, 1998.
  15. Yun-jin Nam, et al., "Constructing Dictionary Information for the Processing of Derivational Suffixes of Nouns based on Corpus Analysis," Journal of the Korea Society of Computer and Information, Vol. 23, No. 4, pp. 389-401, 1996.
  16. Fernando Pereira, et al., "Distributional Clustering of English Words," ACL On-line proceeding, 1994.
  17. Lillian Jane Lee, Similarity-Based Approaches to Natural Language Processing, Ph. D. Thesis, Harvard University, 1997.
  18. Young-soog Chae, et al., "Introduction of KIBS (Korean Information Base System) Project," International Conference on Language Resources and Evalution (LREC2000), Serial. 2, Athens, Greece, pp. 1731-1735, 2000.
  19. Dae-seon Choi, et al., "A Two-Phase Dependency Parser of Korean," Proceedings of the natural language pacific rim symposium, 1995.
  20. Jong-hyeok Lee, et al., "Structural Disambiguation Using Constraint-Satisfaction Algorithm for Dependency Parsing," Proceesings of the International Conference on Computer Processing of Oriental Language, pp. 213-216, 1995.
  21. Hyung-jong Noh, et al., "A Joint Statistical Model for Word Spacing and Spelling Error Correction Simultaneously," Journal of the Korea Information Science Society: Software and Applications, Vol. 34, No. 2, pp. 131-139, 07.