Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2003.10B.1.047

Segmenting and Classifying Korean Words based on Syllables Using Instance-Based Learning  

Kim, Jae-Hoon (한국해양대학교 컴퓨터공학과)
Lee, Kong-Joo (이화여자대학교 컴퓨터공학과)
Abstract
Korean delimits words by white-space like English, but words In Korean Is a little different in structure from those in English. Words in English generally consist of one word, but those in Korean are composed of one word and/or morpheme or more. Because of this difference, a word between white-spaces is called an Eojeol in Korean. We propose a method for segmenting and classifying Korean words and/or morphemes based on syllables using an instance-based learning. In this paper, elements of feature sets for the instance-based learning are one previous syllable, one current syllable, two next syllables, a final consonant of the current syllable, and two previous categories. Our method shows more than 97% of the F-measure of word segmentation using ETRI corpus and KAIST corpus.
Keywords
Word segmentation; Instance-based learning;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 연세대학교 언어정보개발연구원, 연세 한국어사전, 두산동아, 1998
2 이상주, 류원호, 김진동, 임해창, '품사태깅을 위한 어휘문맥 의존규칙의 발뭉치기반 중의성주도 학습', 한국정보과학회논문지(B), 제26권 제1호, pp.178-189, 1999
3 Brent, M., 'An efficient, probabilistically sound algorithm for segmentation and word discovery,' Machine Learning, Vol.34, pp.71-106, 1999   DOI
4 Venkatarman, A., 'A statistical model for word discovery in transcribed speech,' Computational Linguistics, Vol.27, No.3, pp.351-372, 2001   DOI   ScienceOn
5 Brill, E., 'Transformation-based error-driven learning and natural language processing : A case study in part-of-see ech tagging,' Computational Linguistics, Vol.21, No.4. pp. 543-565, 1995
6 Lua, K.-T. and Gan, K.-W., 'An application of information theory in Chinese word segmentation,' Computer Processing of Chinese and Oriental Languages, Vol.8, No.1, pp, 115-124, 1994
7 Yao, Y. and Lua, K.-T., 'Splitting-merging model for Chinese word tokenization and segmentation,' Natural Language Engineering, Vol.4, part 4, pp.309-324, 1998   DOI   ScienceOn
8 Teahan, W. J., Wen, Y., McNab, R J., Witten, I. H., 'A compression-based algorithm for Chinese word segmentation,' Computational Linguistics, Vol.26, No.3, pp.375- 393, 2000   DOI   ScienceOn
9 이준호, 안정수, 박현주, 김명호, '한글 문서의 효과적인 검색을 위한 n-gram 기반의 색인 방법', 정보관리학회지, 제13호 제1호, pp.47-63, 1996   과학기술학회마을
10 강승식, 음절 정보와 복수어 단위 정보를 이용한 한국어 형태소 분석, 서울대학교 컴퓨터공학과 박사학위논문, 1993
11 신중호, 박혁로, '음절단위 bigram정보를 이용한 한국어 단어 인식모델', 제9회 한글 및 한국어 정보처리 학술대회 발표논문집, PP.255-260, 1997   과학기술학회마을
12 Rabiner, L. R., 'A tutorial on hidden Markov models and selected applications in speech recognition,' Proceedings of the IEEE, Vol.77, No.2, pp.257-286, 1989   DOI   ScienceOn
13 김재훈, '가중치망 모델을 이용한 한국어 품사 태깅', 정보과학회논문지, 제25권 제6호, pp.951-959, 1998
14 Quinlan, J. R., C4.5 : Programs for Machine Learning, Morgan Kaufmann Publishers, 1993
15 Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A., TiMBL : Tilburg Memory Based Learner, version 4.0, Reference Guide, Technical Report 01-04, Induction of Linguistic Knowledge, Tilburg University, 2001
16 Daelemans, W., van den Bosch, A., and Zavrel, J., 'Forgetting Exceptions is Harmful in Language Learning,' Machine Learning, Vol.34, No.1-3, pp.11-41, 1999   DOI
17 ETRI, 품사 태그 부착 말뭉치 구축 지침서, 한국전자통신연구원, 컴퓨터소프트웨어 기술연구소, 지식정보연구부, 1999
18 Jurafsky, D. and Martin, J. H., SPEECH and LANGUAGE PROCESSING : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, 2000
19 Allan, J., Callan, J., and Croft, B., 'INQUERY at TREC-5,' Proceedings of The Fifth Text REtrieval Conference (TREC-5), pp.119-132, 1996
20 Sproat R., Shih C., Gale W., Chang N., 'A stochastic finite-state word-segmentation algorithm for Chinese,' Computational Linguistics, Vol.22, No.3, pp.377-404, 1996
21 Palmer, D. D., 'A trainable rule-based algorithm for word segmentation,' Proceedings of ACL -97, pp.321-328, 1997
22 Hammerton, J., Osborne, M., Armstrong, S., and Daelemans, W., 'Introduction to special issue on machine learning approaches to shallow parsing,' Journal of Machine Learning Research, Vol.2, pp.551-558, 2002   DOI
23 Lee, G. G., Cha, J. and Lee, J.-H., 'Syllable-pattern-based unknown morpheme segmetation and estimation for hybrid part-of-speech tagging of Korean,' Computational Linguistics, Vol.28, No.1, pp.53-70, 2002   DOI   ScienceOn
24 이현아, 이원일, 임선숙, 허은경, 이재성, 차건희, 박재득, '표준안에 따른 품사 부착 말뭉치 구축', 제11회 한글 및 한국어 정보처리 학술대회 및 제1회 형태소 분석기 및 품사태거 평가 워크숍논문집, 전북, pp.40-43, 1999   과학기술학회마을
25 Cardie, C. and Mooney, R. J., 'Introduction : Machine learning and natural language,' Machine Learning, Vol.34, nos.1/2/3, pp.5-10, 1999   DOI
26 Aha, D. W. and Bankert, R. L., 'Feature selection for case-based classification of cloud types : An empirical comparison,' Proceedings of the 1994 AAAI Workshop on case-based reasoning, pp.106-112, 1994
27 Ramshaw, L. and Marcus, M., 'Text chunking using transformation-based learning,' Proceedings of the Third Workshop on Very Large Corpora, pp.82-94, 1995
28 Sekine, S. and Grishman, R. and Shinnou, H., 'A decision tree method for finding and classifying names in Japanese texts,' Proceedings of the Sixth Workshop on Very Large Corpora, 1998
29 김재훈, 김길창, 한국어에서의 품사 부착 말뭉치의 작성 요령 : KAIST 말뭉치, 한국과학기술원, 전산학과, CS-TR-95-99, 1995
30 Chinchor, N., Brown, E., Ferro, L. and Robinson, P., Named entity recognition task definition, version 1.4. 1999
31 김재호, 투표 방식의 비지도식 모델을 이용한 개체명 분류, 한국과학기술원 전산학과, 석사학위논문, 2002