Improving the Performance of Korean Text Chunking by Machine learning Approaches based on Feature Set Selection

Hwang, Young-Sook;Chung, Hoo-jung;Park, So-Young;Kwak, Young-Jae;Rim, Hae-Chang;

한국정보과학회논문지:소프트웨어및응용 (Journal of KIISE:Software and Applications)

제29권9호
/
Pages.654-668
/
2002
/
1229-6848(pISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

자질집합선택 기반의 기계학습을 통한 한국어 기본구 인식의 성능향상

Improving the Performance of Korean Text Chunking by Machine learning Approaches based on Feature Set Selection

황영숙 (고려대학교 컴퓨터학과) ;
정후중 (고려대학교 컴퓨터학과) ;
박소영 (고려대학교 컴퓨터학과) ;
곽용재 (고려대학교 컴퓨터학과) ;
임해창 (고려대학교 컴퓨터학과)

Hwang, Young-Sook (Dept.of Computer Sceince, Korea University) ;
Chung, Hoo-jung (Dept.of Computer Sceince, Korea University) ;
Park, So-Young (Dept.of Computer Sceince, Korea University) ;
Kwak, Young-Jae (Dept.of Computer Sceince, Korea University) ;
Rim, Hae-Chang (Dept.of Computer Sceince, Korea University)

발행 : 2002.10.01

PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

In this paper, we present an empirical study for improving the Korean text chunking based on machine learning and feature set selection approaches. We focus on two issues: the problem of selecting feature set for Korean chunking, and the problem of alleviating the data sparseness. To select a proper feature set, we use a heuristic method of searching through the space of feature sets using the estimated performance from a machine learning algorithm as a measure of "incremental usefulness" of a particular feature set. Besides, for smoothing the data sparseness, we suggest a method of using a general part-of-speech tag set and selective lexical information under the consideration of Korean language characteristics. Experimental results showed that chunk tags and lexical information within a given context window are important features and spacing unit information is less important than others, which are independent on the machine teaming techniques. Furthermore, using the selective lexical information gives not only a smoothing effect but also the reduction of the feature space than using all of lexical information. Korean text chunking based on the memory-based learning and the decision tree learning with the selected feature space showed the performance of precision/recall of 90.99%/92.52%, and 93.39%/93.41% respectively.

키워드

참고문헌

S. Abney, 'Parsing by Chunks,' In R.C. Berwick, S.P. Abney and C. Tenny, editors, Principle-Based Parsing: Computation and Psycholinguistics, Kluwer, pp. 257-278, 1991
S. Abney, 'Partial Parsing via. Finite-State Cascades,' In Proc. of the ESSLLI '96 Robust Parsing Workshop, 1996 https://doi.org/10.1017/S1351324997001599
Gregory Grefenstette, 'Light parsing as Finite State Filtering', In Proc. of the Workshop on Extended Finite State Models of Language, ECAI'96, 1996
K. W. Church, 'A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text,' In Proc. of the 2nd Conf. On Applied NLP, 1988 https://doi.org/10.3115/974235.974260
GuoDong ZHOU and Jian SU, 'Error-Driven HMM-based Chunk Tagger with Context-Dependent Lexicon,' In Proc. of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 2000
W. Skut and T. Brants, 'A Maximum-Entropy Partial Parser for Unrestricted Text,' In Proc. of the 6th Workshop on Very Large Corpora., 1998
Rob Koeling, 'Chunking with Maximum Entropy Models,' In Proc. of CoNLL-2000 and LLL-2000, pp.139-141, 2000 https://doi.org/10.3115/1117601.1117634
L.A. Ramshaw and M.P. Marcus, 'Text Chunking using Transformation- Based Learning,' In Proc. of the 3rd ACL workshop on Very Large Corpora, 1995
Claire Cardie and David Pierce, 'Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification,' In Proc. of COLING-ACL'98, pp. 218-224, 1998 https://doi.org/10.3115/980451.980881
Claire Cardie and David Pierce, 'The Role of Lexicalization and Pruning for Base Noun Phrase Grammars,' In Proc. of the 6th National Conference on Artificial Intelligence, 1999
Shlomo Argamon, Ido Dagan, and Yuval Krymolowski, 'A Memory-Based Approach to Learning Shallow Natural Language Patterns,' In Proc. of COLING-ACL'98, pp. 67-73, 1998 https://doi.org/10.3115/980451.980857
J. Veenstra, 'Fast NP Chunking Using Memory-Based Learning Techniques,' In Proc. of the 8th Belgian-Dutch Conference on Machine Learning, 1998
W. Daelemans, S. Buchholz, J. Veenstra, 'Memory-Based Shallow Parsing,' In Proc. of CoNLL, Bergen, Norway, 1999
Taku Kudo and Yuji Matsumoto, 'Chunking with Support Vector Machines' In Proc. of NAACL-2001, 2001 https://doi.org/10.3115/1073336.1073361
Erik F. Tjong Kim Sang, W. Daelemans, H. Dejean, R. Koeling, Y. Krymolowski, V. Punyakanok, and D. Roth, 'Applying system combination to base noun phrase identification,' In Proc. of COLING, 2000 https://doi.org/10.3115/992730.992770
Hans van Halteren, 'A Default First Order Family Weight Determination Procedure for WPDV Models,' In Proc. of CoNLL-2000 and LLL-2000, pp.119-212, 2000 https://doi.org/10.3115/1117601.1117628
신효필, '최소자원 최대효과의 구문분석', 제11회 한글 및 한국어 정보처리 학술대회, pp. 242-248, 1999
Juntae Yoon, et. al. 'Three Types of Chunking in Korean and Dependency Analysis based on Lexical Association,' In Proc. of the 18th International Conference on Computer Processing Languages(ICCPOL'99), pp. 59-65, 1999
양재형, '규칙기반 학습에 의한 한국어의 기반 명사구 인식', 정보과학회 논문지: 소프트웨어 및 응용, 제27권 제 10호, pp. 1062-1071, 2000
박성배, 장병탁, '최대 엔트로피 모델을 이용한 텍스트 단위와 학습', 제 13회 한글 및 한국어 정보처리학술대회, pp.130-137, 2001
이신목, 강인호, 김길창, '방향성을 이용한 한국어 비재귀명사구 인식 모델', 제 13회 한글 및 한국어 정보처리학술대회, pp. 439-444, 2001
Young-Sook Hwang, Hoo-jung Chung, Yong-Jae Kwak, So-Young Park, 'Shallow Parsing by Weighted Probabilistic Sum,' In Proc. of the 19th International Conference on Computer Processing Languages(ICCPOL2001), 2001
Avrim L. Blum. (1997). 'Selection of Relevant Features and Examples in Machine Learning,' Journal or Artifical Intelligence, pp. 245-271 https://doi.org/10.1016/S0004-3702(97)00063-5
J. R. Quinlan. (1993). 'C4.5: Programs for Machine Learning', Mateo: Morgan Kaufmann
D. W. Aha, D. Kibler, M. Albert. (1991). 'Instance-based learning algorithms,' Machine Learning, 6:37-66 https://doi.org/10.1023/A:1022689900470
Walter. Daelemans, and Antal van den Bosch. (1992). 'Generalisation performance of backpropagation learning on a syllabification task', In M. F. J. Drossaers and A. Nijholt, editors, Proc. of TWLT3: Connectionism and Natural Language Processing, pp. 27-37, Enschede. Twente University
한국과학기술원, 국어정보베이스(v.1.0) (CD 배포판), 1997

한국정보과학회논문지:소프트웨어및응용 (Journal of KIISE:Software and Applications)

자질집합선택 기반의 기계학습을 통한 한국어 기본구 인식의 성능향상

Improving the Performance of Korean Text Chunking by Machine learning Approaches based on Feature Set Selection

초록

키워드

참고문헌

자세히 찾기