Browse > Article

Improving the Performance of Korean Text Chunking by Machine learning Approaches based on Feature Set Selection  

Hwang, Young-Sook (Dept.of Computer Sceince, Korea University)
Chung, Hoo-jung (Dept.of Computer Sceince, Korea University)
Park, So-Young (Dept.of Computer Sceince, Korea University)
Kwak, Young-Jae (Dept.of Computer Sceince, Korea University)
Rim, Hae-Chang (Dept.of Computer Sceince, Korea University)
Abstract
In this paper, we present an empirical study for improving the Korean text chunking based on machine learning and feature set selection approaches. We focus on two issues: the problem of selecting feature set for Korean chunking, and the problem of alleviating the data sparseness. To select a proper feature set, we use a heuristic method of searching through the space of feature sets using the estimated performance from a machine learning algorithm as a measure of "incremental usefulness" of a particular feature set. Besides, for smoothing the data sparseness, we suggest a method of using a general part-of-speech tag set and selective lexical information under the consideration of Korean language characteristics. Experimental results showed that chunk tags and lexical information within a given context window are important features and spacing unit information is less important than others, which are independent on the machine teaming techniques. Furthermore, using the selective lexical information gives not only a smoothing effect but also the reduction of the feature space than using all of lexical information. Korean text chunking based on the memory-based learning and the decision tree learning with the selected feature space showed the performance of precision/recall of 90.99%/92.52%, and 93.39%/93.41% respectively.
Keywords
Korean Base Phrase Recognition; Machine Learning; Decision Tree; Feature Set Selection; Memory -based Learning;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 S. Abney, 'Partial Parsing via. Finite-State Cascades,' In Proc. of the ESSLLI '96 Robust Parsing Workshop, 1996   DOI
2 Young-Sook Hwang, Hoo-jung Chung, Yong-Jae Kwak, So-Young Park, 'Shallow Parsing by Weighted Probabilistic Sum,' In Proc. of the 19th International Conference on Computer Processing Languages(ICCPOL2001), 2001
3 Avrim L. Blum. (1997). 'Selection of Relevant Features and Examples in Machine Learning,' Journal or Artifical Intelligence, pp. 245-271   DOI   ScienceOn
4 S. Abney, 'Parsing by Chunks,' In R.C. Berwick, S.P. Abney and C. Tenny, editors, Principle-Based Parsing: Computation and Psycholinguistics, Kluwer, pp. 257-278, 1991
5 Walter. Daelemans, and Antal van den Bosch. (1992). 'Generalisation performance of backpropagation learning on a syllabification task', In M. F. J. Drossaers and A. Nijholt, editors, Proc. of TWLT3: Connectionism and Natural Language Processing, pp. 27-37, Enschede. Twente University
6 J. R. Quinlan. (1993). 'C4.5: Programs for Machine Learning', Mateo: Morgan Kaufmann
7 Gregory Grefenstette, 'Light parsing as Finite State Filtering', In Proc. of the Workshop on Extended Finite State Models of Language, ECAI'96, 1996
8 D. W. Aha, D. Kibler, M. Albert. (1991). 'Instance-based learning algorithms,' Machine Learning, 6:37-66   DOI
9 K. W. Church, 'A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text,' In Proc. of the 2nd Conf. On Applied NLP, 1988   DOI
10 GuoDong ZHOU and Jian SU, 'Error-Driven HMM-based Chunk Tagger with Context-Dependent Lexicon,' In Proc. of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 2000
11 한국과학기술원, 국어정보베이스(v.1.0) (CD 배포판), 1997
12 Erik F. Tjong Kim Sang, W. Daelemans, H. Dejean, R. Koeling, Y. Krymolowski, V. Punyakanok, and D. Roth, 'Applying system combination to base noun phrase identification,' In Proc. of COLING, 2000   DOI
13 W. Skut and T. Brants, 'A Maximum-Entropy Partial Parser for Unrestricted Text,' In Proc. of the 6th Workshop on Very Large Corpora., 1998
14 Rob Koeling, 'Chunking with Maximum Entropy Models,' In Proc. of CoNLL-2000 and LLL-2000, pp.139-141, 2000   DOI
15 L.A. Ramshaw and M.P. Marcus, 'Text Chunking using Transformation- Based Learning,' In Proc. of the 3rd ACL workshop on Very Large Corpora, 1995
16 Claire Cardie and David Pierce, 'Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification,' In Proc. of COLING-ACL'98, pp. 218-224, 1998   DOI
17 Claire Cardie and David Pierce, 'The Role of Lexicalization and Pruning for Base Noun Phrase Grammars,' In Proc. of the 6th National Conference on Artificial Intelligence, 1999
18 Shlomo Argamon, Ido Dagan, and Yuval Krymolowski, 'A Memory-Based Approach to Learning Shallow Natural Language Patterns,' In Proc. of COLING-ACL'98, pp. 67-73, 1998   DOI
19 Taku Kudo and Yuji Matsumoto, 'Chunking with Support Vector Machines' In Proc. of NAACL-2001, 2001   DOI
20 W. Daelemans, S. Buchholz, J. Veenstra, 'Memory-Based Shallow Parsing,' In Proc. of CoNLL, Bergen, Norway, 1999
21 박성배, 장병탁, '최대 엔트로피 모델을 이용한 텍스트 단위와 학습', 제 13회 한글 및 한국어 정보처리학술대회, pp.130-137, 2001   과학기술학회마을
22 Hans van Halteren, 'A Default First Order Family Weight Determination Procedure for WPDV Models,' In Proc. of CoNLL-2000 and LLL-2000, pp.119-212, 2000   DOI
23 J. Veenstra, 'Fast NP Chunking Using Memory-Based Learning Techniques,' In Proc. of the 8th Belgian-Dutch Conference on Machine Learning, 1998
24 양재형, '규칙기반 학습에 의한 한국어의 기반 명사구 인식', 정보과학회 논문지: 소프트웨어 및 응용, 제27권 제 10호, pp. 1062-1071, 2000   과학기술학회마을
25 이신목, 강인호, 김길창, '방향성을 이용한 한국어 비재귀명사구 인식 모델', 제 13회 한글 및 한국어 정보처리학술대회, pp. 439-444, 2001
26 Juntae Yoon, et. al. 'Three Types of Chunking in Korean and Dependency Analysis based on Lexical Association,' In Proc. of the 18th International Conference on Computer Processing Languages(ICCPOL'99), pp. 59-65, 1999
27 신효필, '최소자원 최대효과의 구문분석', 제11회 한글 및 한국어 정보처리 학술대회, pp. 242-248, 1999   과학기술학회마을