DOI QR코드

DOI QR Code

말뭉치와 형태소 분석기를 활용한 한국어 자동 띄어쓰기

Automatic Word Spacing Using Raw Corpus and a Morphological Analyzer

  • 투고 : 2014.08.28
  • 심사 : 2014.10.29
  • 발행 : 2015.01.15

초록

본 논문에서는 띄어쓰기가 전혀 되어 있지 않은 문자열을 입력 받아 말뭉치에서 추출한 어절 정보를 이용하여 자동 띄어쓰기를 해 주는 방법론을 제안한다. 형태소 분석기도 사용되나 오류 수정이라는 제한적인 용도로만 사용된다. 성능 평가를 위해 1,000만 어절 규모의 세종 말뭉치에서 순수 한글 585만 어절을 발췌하여 10 개의 세트로 나누고 10 배수 교차 검증을 실시한 결과 98.06%의 음절 정확도와 94.15%의 어절 재현율을 얻었다. 또한, 개인용 컴퓨터에서 초당 25만 어절, 1.8 MB의 문서를 처리할 수 있을 정도로 빠르다. 제안된 방법의 정확도나 재현율은 어절 사전의 크기에 영향을 받기 때문에 보다 큰 말뭉치로 어절 사전을 구축하면 성능이 더욱 향상될 것으로 기대된다.

This paper proposes a method for the automatic word spacing of unsegmented Korean sentences. In our method, eojeol monograms are used for word spacing as opposed to the syllable n-grams that have been used in previous studies. The use of a Korean morphological analyzer is limited to the correction of typical word spacing errors. Our method gives a 98.06% syllable accuracy and a 94.15% eojeol recall, when 10-fold cross-validated with the Sejong corpus, after filtering out non-hangul eojeols. The processing rate is 250K eojeols or 1.8 MB per second on a typical personal computer. Syllable accuracy and eojeol recall are related to the size of the eojeol dictionary, better performance is expected with a bigger corpus.

키워드

과제정보

연구 과제 주관 기관 : 성신여자대학교

참고문헌

  1. Seung-Shik Kang, "Eojeol-Block Bidirectional Algorithm for Automatic Word Spacing of Hangul Sentences," Journal of KIISE : Software and Applications, Vol. 27, No. 4, pp. 441-447, 2000. (in Korean)
  2. Kye Sung Kim, et al., "Three-Stage Word-Spacing System for Continuous Syllable Sentence in Korea," Journal of KIISE B, Vol. 25, No. 12, pp. 1838-1844, 1998. (in Korean)
  3. Do-Gil Lee, et al., "Two Statistical Models for Automatic Word Spacing of Korean Sentences," Journal of KIISE : Software and Applications, Vol. 30, No. 4, pp. 358-371, 2003. (in Korean)
  4. Harksoo Kim, "A Reliable and Simple Patternmatching Method for Implementing an Automatic Word Spacing System in Low Performance Devices," Journal of KIISE : Software and Applications, Vol. 39, No. 10, pp. 818-823, 2012. (in Korean)
  5. Kwangseob Shim, "Automatic Word Spacing based on Conditional Random Fields," The Korean Journal of Cognitive Science, Vol. 22, No. 2, pp. 217-233, 2011. (in Korean) https://doi.org/10.19066/cogsci.2011.22.2.007
  6. Seong-Bae Park, et al., "Self-Organizing n-gram Model for Automatic Word Spacing," Proc. of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 633-640, 2006.
  7. Jae Sung Lee, "Word Spacing Consistency Check using Syllable and Morpheme Information," Journal of the Korea Contents Association, Vol. 10, No. 5, pp. 10-19, 2010. (in Korean) https://doi.org/10.5392/JKCA.2010.10.5.010
  8. Seung-Shik Kang, "A Decomposition Algorithm of Korean Compound Nouns," Journal of KIISE B, Vol. 25, No. 1, pp. 172-182, 1998. (in Korean)
  9. Kwangseob Shim and Jaehyung Yang, "MACH : A Supersonic Korean Morphological Analyzer," Proc. of the 19th International Conference on Computational Linguistics, pp. 939-945, 2002.
  10. Kwangseob Shim, "Syllable-based POS Tagging without Korean Morphological Analysis," The Korean Journal of Cognitive Science, Vol. 22, No. 3, pp. 327-345, 2011. (in Korean) https://doi.org/10.19066/cogsci.2011.22.3.005
  11. The National Institute of the Korean Language, 21st Century Sejong Project Final Result, 2011.12 Revised Edition, 2011. (in Korean)
  12. Changki Lee, "Joint Models for Korean Word Spacing and POS Tagging using Structural SVM," Journal of KIISE : Software and Applications, Vol. 40, No. 12, pp. 826-832, 2013. (in Korean)