DOI QR코드

DOI QR Code

Cloning of Korean Morphological Analyzers using Pre-analyzed Eojeol Dictionary and Syllable-based Probabilistic Model

기분석 어절 사전과 음절 단위의 확률 모델을 이용한 한국어 형태소 분석기 복제

  • Received : 2015.08.18
  • Accepted : 2016.01.06
  • Published : 2016.03.15

Abstract

In this study, we verified the feasibility of a Korean morphological analyzer that uses a pre-analyzed Eojeol dictionary and syllable-based probabilistic model. For the verification, MACH and KLT2000, Korean morphological analyzers, were cloned with a pre-analyzed eojeol dictionary and syllable-based probabilistic model. The analysis results were compared between the cloned morphological analyzer, MACH, and KLT2000. The 10 million Eojeol Sejong corpus was segmented into 10 sets for cross-validation. The 10-fold cross-validated precision and recall for cloned MACH and KLT2000 were 97.16%, 98.31% and 96.80%, 99.03%, respectively. Analysis speed of a cloned MACH was 308,000 Eojeols per second, and the speed of a cloned KLT2000 was 436,000 Eojeols per second. The experimental results indicated that a Korean morphological analyzer that uses a pre-analyzed eojeol dictionary and syllable-based probabilistic model could be used in practical applications.

본 논문에서는 어절 단위의 기분석 사전과 음절 단위의 확률 모델을 이용하는 한국어 형태소 분석기가 실용성이 있는지를 검증한다. 이를 위해 기존의 한국어 형태소 분석기 MACH와 KLT2000을 복제하고, 복제된 형태소 분석기의 분석 결과가 MACH와 KLT2000 분석 결과와 얼마나 유사한지 정밀도와 재현율로 평가하는 실험을 하였다. 실험은 1,000만 어절 규모의 세종 말뭉치를 10개의 세트로 나누고 10배수 교차 검증을 하는 방식으로 하였다. MACH의 분석 결과를 정답 집합으로 하고 MACH 복제품의 분석 결과를 평가한 결과 정밀도와 재현율이 각각 97.16%와 98.31%였으며, KLT2000 복제품의 경우에는 정밀도와 재현율이 각각 96.80%와 99.03%였다 분석 속도는 MACH 복제품의 경우 초당 30.8만 어절이며, KLT2000 복제품은 초당 43.6만 어절로 나타났다. 이 실험 결과는 어절 단위의 기분석 사전과 음절 단위의 확률 모델로 만든 한국어 형태소 분석기가 실제 응용에 사용될 수 있을 정도의 성능을 가진다는 것을 보여준다.

Keywords

Acknowledgement

Supported by : 성신여자대학교

References

  1. Jae Sung Lee, "Three-Step Probabilistic Model for Korean Morphological Analysis," Journal of KIISE : Software and Applications, Vol. 38, No. 5, pp. 257-268, 2011. (in Korean)
  2. Seung Hyun Yang and Young-Sum Kim, "A High- Speed Korean Morphological Analysis Method based on Pre-Analyzed Partial Words," Journal of KIISE : Software and Applications, Vol. 27, No. 3, pp. 290-301, 2000. (in Korean)
  3. Kwangseob Shim and Jaehyung Yang, "MACH : A Supersonic Korean Morphological Analyzer," Proc. of the 19th International Conference on Computational Linguistics, pp. 939-945, 2002.
  4. Kwangseob Shim and Jaehyung Yang, "High Speed Korean Morphological Analysis based on Adjacency Condition Check," Journal of KIISE : Software and Applications, Vol. 31, No. 1, pp. 89-99, 2004. (in Korean)
  5. Chung-Hye Han and Martha Palmer, "A Morphological Tagger for Korean: Statistical Tagging Combined with Corpus-Based Morphological Rule Application," Machine Translation, Vol.18, pp. 275-297, 2005.
  6. Do-Gil Lee and Hae-Chang Rim, "Probabilistic Modeling of Korean Morphology," IEEE Transactions on Audio, Speech and Language Processing, Vol. 17, No. 5, pp. 945-955, 2009. https://doi.org/10.1109/TASL.2009.2019922
  7. Kwangseob Shim, "Syllable-based Korean Morphological Analysis using n-grams extracted from POS Tagged Corpus," Journal of KIISE : Software and Applications, Vol. 40, No. 12, pp. 869-876, 2013. (in Korean)
  8. Kwangseob Shim, "Syllable-based Probabilistic Models for Korean Morphological Analysis," Journal of KIISE, Vol. 41, No. 9, pp. 642-651, 2014. (in Korean) https://doi.org/10.5626/JOK.2014.41.9.642
  9. MACH Available: http://cs.sungshin.ac.kr/-shim/demo/mach.html
  10. KLT2000 Available: http://nlp.kookmin.ac.kr/HAM/kor/download.html
  11. Jae-Han Kim and Cheol-Young Ok, "Korean Morphological Analysis using Inflected-Word-Dictionary," Proc. of the Korean Information Science Society Conference, Vol. 21, No. 1, pp. 813-816, 1994. (in Korean)
  12. Sujeong Kwak, Bogyum Kim and Jae Sung Lee, "Construction of an Efficient Pre-analyzed Dictionary for Korean Morphological Analysis," KIPS Transactions on Software and Data Engineering, Vol. 2, No. 12, pp. 881-888, 2013. (in Korean) https://doi.org/10.3745/KTSDE.2013.2.12.881
  13. Joon-Choul Shin and Cheol-Young Ock, "A Korean Morphological Analyzer using a Pre-analyzed Partial Word-phrase Dictionary," Journal of KIISE : Software and Applications, Vol. 39, No. 5, pp. 415-424, 2012. (in Korean)
  14. The National Institute of the Korean Language, 21st Century Sejong Project Final Result, 2011.12 Revised Edition, 2011. (in Korean)