자질집합선택 기반의 기계학습을 통한 한국어 기본구 인식의 성능향상

Improving the Performance of Korean Text Chunking by Machine learning Approaches based on Feature Set Selection

  • 발행 : 2002.10.01

초록

In this paper, we present an empirical study for improving the Korean text chunking based on machine learning and feature set selection approaches. We focus on two issues: the problem of selecting feature set for Korean chunking, and the problem of alleviating the data sparseness. To select a proper feature set, we use a heuristic method of searching through the space of feature sets using the estimated performance from a machine learning algorithm as a measure of "incremental usefulness" of a particular feature set. Besides, for smoothing the data sparseness, we suggest a method of using a general part-of-speech tag set and selective lexical information under the consideration of Korean language characteristics. Experimental results showed that chunk tags and lexical information within a given context window are important features and spacing unit information is less important than others, which are independent on the machine teaming techniques. Furthermore, using the selective lexical information gives not only a smoothing effect but also the reduction of the feature space than using all of lexical information. Korean text chunking based on the memory-based learning and the decision tree learning with the selected feature space showed the performance of precision/recall of 90.99%/92.52%, and 93.39%/93.41% respectively.

키워드

참고문헌

  1. S. Abney, 'Parsing by Chunks,' In R.C. Berwick, S.P. Abney and C. Tenny, editors, Principle-Based Parsing: Computation and Psycholinguistics, Kluwer, pp. 257-278, 1991
  2. S. Abney, 'Partial Parsing via. Finite-State Cascades,' In Proc. of the ESSLLI '96 Robust Parsing Workshop, 1996 https://doi.org/10.1017/S1351324997001599
  3. Gregory Grefenstette, 'Light parsing as Finite State Filtering', In Proc. of the Workshop on Extended Finite State Models of Language, ECAI'96, 1996
  4. K. W. Church, 'A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text,' In Proc. of the 2nd Conf. On Applied NLP, 1988 https://doi.org/10.3115/974235.974260
  5. GuoDong ZHOU and Jian SU, 'Error-Driven HMM-based Chunk Tagger with Context-Dependent Lexicon,' In Proc. of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 2000
  6. W. Skut and T. Brants, 'A Maximum-Entropy Partial Parser for Unrestricted Text,' In Proc. of the 6th Workshop on Very Large Corpora., 1998
  7. Rob Koeling, 'Chunking with Maximum Entropy Models,' In Proc. of CoNLL-2000 and LLL-2000, pp.139-141, 2000 https://doi.org/10.3115/1117601.1117634
  8. L.A. Ramshaw and M.P. Marcus, 'Text Chunking using Transformation- Based Learning,' In Proc. of the 3rd ACL workshop on Very Large Corpora, 1995
  9. Claire Cardie and David Pierce, 'Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification,' In Proc. of COLING-ACL'98, pp. 218-224, 1998 https://doi.org/10.3115/980451.980881
  10. Claire Cardie and David Pierce, 'The Role of Lexicalization and Pruning for Base Noun Phrase Grammars,' In Proc. of the 6th National Conference on Artificial Intelligence, 1999
  11. Shlomo Argamon, Ido Dagan, and Yuval Krymolowski, 'A Memory-Based Approach to Learning Shallow Natural Language Patterns,' In Proc. of COLING-ACL'98, pp. 67-73, 1998 https://doi.org/10.3115/980451.980857
  12. J. Veenstra, 'Fast NP Chunking Using Memory-Based Learning Techniques,' In Proc. of the 8th Belgian-Dutch Conference on Machine Learning, 1998
  13. W. Daelemans, S. Buchholz, J. Veenstra, 'Memory-Based Shallow Parsing,' In Proc. of CoNLL, Bergen, Norway, 1999
  14. Taku Kudo and Yuji Matsumoto, 'Chunking with Support Vector Machines' In Proc. of NAACL-2001, 2001 https://doi.org/10.3115/1073336.1073361
  15. Erik F. Tjong Kim Sang, W. Daelemans, H. Dejean, R. Koeling, Y. Krymolowski, V. Punyakanok, and D. Roth, 'Applying system combination to base noun phrase identification,' In Proc. of COLING, 2000 https://doi.org/10.3115/992730.992770
  16. Hans van Halteren, 'A Default First Order Family Weight Determination Procedure for WPDV Models,' In Proc. of CoNLL-2000 and LLL-2000, pp.119-212, 2000 https://doi.org/10.3115/1117601.1117628
  17. 신효필, '최소자원 최대효과의 구문분석', 제11회 한글 및 한국어 정보처리 학술대회, pp. 242-248, 1999
  18. Juntae Yoon, et. al. 'Three Types of Chunking in Korean and Dependency Analysis based on Lexical Association,' In Proc. of the 18th International Conference on Computer Processing Languages(ICCPOL'99), pp. 59-65, 1999
  19. 양재형, '규칙기반 학습에 의한 한국어의 기반 명사구 인식', 정보과학회 논문지: 소프트웨어 및 응용, 제27권 제 10호, pp. 1062-1071, 2000
  20. 박성배, 장병탁, '최대 엔트로피 모델을 이용한 텍스트 단위와 학습', 제 13회 한글 및 한국어 정보처리학술대회, pp.130-137, 2001
  21. 이신목, 강인호, 김길창, '방향성을 이용한 한국어 비재귀명사구 인식 모델', 제 13회 한글 및 한국어 정보처리학술대회, pp. 439-444, 2001
  22. Young-Sook Hwang, Hoo-jung Chung, Yong-Jae Kwak, So-Young Park, 'Shallow Parsing by Weighted Probabilistic Sum,' In Proc. of the 19th International Conference on Computer Processing Languages(ICCPOL2001), 2001
  23. Avrim L. Blum. (1997). 'Selection of Relevant Features and Examples in Machine Learning,' Journal or Artifical Intelligence, pp. 245-271 https://doi.org/10.1016/S0004-3702(97)00063-5
  24. J. R. Quinlan. (1993). 'C4.5: Programs for Machine Learning', Mateo: Morgan Kaufmann
  25. D. W. Aha, D. Kibler, M. Albert. (1991). 'Instance-based learning algorithms,' Machine Learning, 6:37-66 https://doi.org/10.1023/A:1022689900470
  26. Walter. Daelemans, and Antal van den Bosch. (1992). 'Generalisation performance of backpropagation learning on a syllabification task', In M. F. J. Drossaers and A. Nijholt, editors, Proc. of TWLT3: Connectionism and Natural Language Processing, pp. 27-37, Enschede. Twente University
  27. 한국과학기술원, 국어정보베이스(v.1.0) (CD 배포판), 1997