A Machine Learning Approach to Korean Language Stemming

  • Published : 2001.12.01

Abstract

Morphological analysis and POS tagging require a dictionary for the language at hand . In this fashion though it is impossible to analyze a language a dictionary. We also have difficulty if significant portion of the vocabulary is new or unknown . This paper explores the possibility of learning morphology of an agglutinative language. in particular Korean language, without any prior lexical knowledge of the language. We use unsupervised learning in that there is no instructor to guide the outcome of the learner, nor any tagged corpus. Here are the main characteristics of the approach: First. we use only raw corpus without any tags attached or any dictionary. Second, unlike many heuristics that are theoretically ungrounded, this method is based on statistical methods , which are widely accepted. The method is currently applied only to Korean language but since it is essentially language-neutral it can easily be adapted to other agglutinative languages.

Keywords

References

  1. Journal of Korean Information Science Society (B) v.24-2 Two-stage Korean tagger based on Statistics and Rules SangHyun Shin;KunBae Lee;JongHyuk Lee
  2. Journal of Korean Information Science Society(B) v.23-4 Construction of dictionary of noun-derived suffixes based on Corpus analysis YunJin Nam;ChulYung Ok
  3. Journal of Korean Information Science Society(B) v.22-10 Morphological Analysis of Korean Irregular verbs and adjectives using syllable characteristics SeungSik Kang
  4. Journal of Korean Information Science Society v.20 no.10 Method for reduction of lexicon reference in Korean morphological analysis by two-way longest match JaeHyung Choi;SangJo Lee
  5. Journal of Korean Information Science Society(B) v.23-1 Korean dictionary using two-way trie structure ChulSu Kim(et al.)
  6. Journal of Korean Information Science Society v.22-6 Efficient Korean morphology analyzer using exclusion information HeeSuk Lim;BoHyun Yoon;HaeChang Lim
  7. Journal of Korean Information Science Society v.23-9 Automatic segmentation using mutual information among syllables Kwang Sub Shim
  8. Foundations of Statistical Natural Language Processing C. Manning;H. Schultze
  9. Machine Translation and Computational Linguistics v.11 Development of stemming algorithms Lovins J.B
  10. proceedings of the ACL99 workshop: Unsupervised learning in Natural Language Processing Knowledge-free Induction of Morphology using Latent Semantic Analysis Patrick Schone;Daniel Jurafsky
  11. Unsupervised learning of the morphology of a natural language J. Goldsmith
  12. Machine Learning v.39 A Machine Learning Approach to POS tagging LLuis Marquez;Lluis Padro;Horacio Rodriguez
  13. proceedings of the ACL99 workshop : Unsupervised learning in Natural Language Processing Unsupervised learning of derivational morphology from inflectional lexicons E. Gaussier
  14. Morphemes as necessary concepts for structures : Discovery from untagged corpora Dejean, H.
  15. Analysis of usage count of Korean morphemes and words HeungGyu Kim;BumMo Kang
  16. Program v.14 no.3 An algorithm for suffix stripping M.F.Porter
  17. Human Behavior and the Principle of Least Effort Zipf G.K.
  18. Introduction to Probability and Statistics W. Mendenhall;R.J.Beaver
  19. Technical Report TR99-1756 Unsupervised Statistical Segmentation of Japanese Kanji Strings R.Ando;L.Lee
  20. Trends in Speech Recognition Phonological Aspect of speech recognition J.E. Shoup;Lea W.A(ed.)
  21. Speech and Language Processing D. Jurafsky;J.H. Martin
  22. Proc. IEEE International Conference of Neural Networks Distributed Syntactic Representations with an Application to Part-of-speech Tagging H. Schutze