Segmenting and Classifying Korean Words based on Syllables Using Instance-Based Learning

Kim, Jae-Hoon;Lee, Kong-Joo;

doi:10.3745/KIPSTB.2003.10B.1.047

The KIPS Transactions:PartB (정보처리학회논문지B)

Volume 10B Issue 1
/
Pages.47-56
/
2003
/
1598-284X(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Segmenting and Classifying Korean Words based on Syllables Using Instance-Based Learning

사례기반 학습을 이용한 음절기반 한국어 단어 분리 및 범주 결정

김재훈 (한국해양대학교 컴퓨터공학과) ;
이공주 (이화여자대학교 컴퓨터공학과)

Published : 2003.02.01

https://doi.org/10.3745/KIPSTB.2003.10B.1.047 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Korean delimits words by white-space like English, but words In Korean Is a little different in structure from those in English. Words in English generally consist of one word, but those in Korean are composed of one word and/or morpheme or more. Because of this difference, a word between white-spaces is called an Eojeol in Korean. We propose a method for segmenting and classifying Korean words and/or morphemes based on syllables using an instance-based learning. In this paper, elements of feature sets for the instance-based learning are one previous syllable, one current syllable, two next syllables, a final consonant of the current syllable, and two previous categories. Our method shows more than 97% of the F-measure of word segmentation using ETRI corpus and KAIST corpus.

한국어는 영어와 같이 공백을 단어의 경계로 사용하지만, 그 단어의 구조는 영어와 다소 차이가 있다. 영어는 일반적으로 공백 사이에 하나의 단어가 포함되나, 한국어는 여러 개의 단어 혹은 형태소가 포함된다. 이런 차이 때문에 일반적으로 한국어에서는 공백을 경계로 이루어진 단어를 어절이라고 한다. 본 논문에서는 하나의 어절 내에 포함된 단어들을 분리하고, 분리된 각 단어의 적절한 범주를 결정하는 방법을 제안한다. 본 논문에서는 사례기반 기계학습 방법을 이용하고 음절 단위로 단어를 분리한다. 사례기반 학습을 위해 사용된 자질집합은 이전 음절 자신의 음절, 이후의 두 음절, 자신의 음절에 대한 받침 정보, 이전 두 범주 정보이다. 제안된 시스템을 평가하기 위해서 ETRI 말뭉치와 KAIST 말뭉치를 사용하였으며, 두 말뭉치 모두에서 단어 분리의 F 측도가 97% 이상으로 비교적 좋은 성능을 보였다.

Keywords

References

연세대학교 언어정보개발연구원, 연세 한국어사전, 두산동아, 1998
김재훈, '가중치망 모델을 이용한 한국어 품사 태깅', 정보과학회논문지, 제25권 제6호, pp.951-959, 1998
이상주, 류원호, 김진동, 임해창, '품사태깅을 위한 어휘문맥 의존규칙의 발뭉치기반 중의성주도 학습', 한국정보과학회논문지(B), 제26권 제1호, pp.178-189, 1999
Brent, M., 'An efficient, probabilistically sound algorithm for segmentation and word discovery,' Machine Learning, Vol.34, pp.71-106, 1999 https://doi.org/10.1023/A:1007541817488
Venkatarman, A., 'A statistical model for word discovery in transcribed speech,' Computational Linguistics, Vol.27, No.3, pp.351-372, 2001 https://doi.org/10.1162/089120101317066113
Allan, J., Callan, J., and Croft, B., 'INQUERY at TREC-5,' Proceedings of The Fifth Text REtrieval Conference (TREC-5), pp.119-132, 1996
Sproat R., Shih C., Gale W., Chang N., 'A stochastic finite-state word-segmentation algorithm for Chinese,' Computational Linguistics, Vol.22, No.3, pp.377-404, 1996
Palmer, D. D., 'A trainable rule-based algorithm for word segmentation,' Proceedings of ACL -97, pp.321-328, 1997
Brill, E., 'Transformation-based error-driven learning and natural language processing : A case study in part-of-see ech tagging,' Computational Linguistics, Vol.21, No.4. pp. 543-565, 1995
Lua, K.-T. and Gan, K.-W., 'An application of information theory in Chinese word segmentation,' Computer Processing of Chinese and Oriental Languages, Vol.8, No.1, pp, 115-124, 1994
Yao, Y. and Lua, K.-T., 'Splitting-merging model for Chinese word tokenization and segmentation,' Natural Language Engineering, Vol.4, part 4, pp.309-324, 1998 https://doi.org/10.1017/S1351324998002058
Teahan, W. J., Wen, Y., McNab, R J., Witten, I. H., 'A compression-based algorithm for Chinese word segmentation,' Computational Linguistics, Vol.26, No.3, pp.375- 393, 2000 https://doi.org/10.1162/089120100561746
이준호, 안정수, 박현주, 김명호, '한글 문서의 효과적인 검색을 위한 n-gram 기반의 색인 방법', 정보관리학회지, 제13호 제1호, pp.47-63, 1996
강승식, 음절 정보와 복수어 단위 정보를 이용한 한국어 형태소 분석, 서울대학교 컴퓨터공학과 박사학위논문, 1993
신중호, 박혁로, '음절단위 bigram정보를 이용한 한국어 단어 인식모델', 제9회 한글 및 한국어 정보처리 학술대회 발표논문집, PP.255-260, 1997
Lee, G. G., Cha, J. and Lee, J.-H., 'Syllable-pattern-based unknown morpheme segmetation and estimation for hybrid part-of-speech tagging of Korean,' Computational Linguistics, Vol.28, No.1, pp.53-70, 2002 https://doi.org/10.1162/089120102317341774
이현아, 이원일, 임선숙, 허은경, 이재성, 차건희, 박재득, '표준안에 따른 품사 부착 말뭉치 구축', 제11회 한글 및 한국어 정보처리 학술대회 및 제1회 형태소 분석기 및 품사태거 평가 워크숍논문집, 전북, pp.40-43, 1999
Cardie, C. and Mooney, R. J., 'Introduction : Machine learning and natural language,' Machine Learning, Vol.34, nos.1/2/3, pp.5-10, 1999 https://doi.org/10.1023/A:1007580931600
Hammerton, J., Osborne, M., Armstrong, S., and Daelemans, W., 'Introduction to special issue on machine learning approaches to shallow parsing,' Journal of Machine Learning Research, Vol.2, pp.551-558, 2002 https://doi.org/10.1162/153244302320884533
Quinlan, J. R., C4.5 : Programs for Machine Learning, Morgan Kaufmann Publishers, 1993
Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A., TiMBL : Tilburg Memory Based Learner, version 4.0, Reference Guide, Technical Report 01-04, Induction of Linguistic Knowledge, Tilburg University, 2001
Rabiner, L. R., 'A tutorial on hidden Markov models and selected applications in speech recognition,' Proceedings of the IEEE, Vol.77, No.2, pp.257-286, 1989 https://doi.org/10.1109/5.18626
Daelemans, W., van den Bosch, A., and Zavrel, J., 'Forgetting Exceptions is Harmful in Language Learning,' Machine Learning, Vol.34, No.1-3, pp.11-41, 1999 https://doi.org/10.1023/A:1007585615670
ETRI, 품사 태그 부착 말뭉치 구축 지침서, 한국전자통신연구원, 컴퓨터소프트웨어 기술연구소, 지식정보연구부, 1999
Jurafsky, D. and Martin, J. H., SPEECH and LANGUAGE PROCESSING : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, 2000
Ramshaw, L. and Marcus, M., 'Text chunking using transformation-based learning,' Proceedings of the Third Workshop on Very Large Corpora, pp.82-94, 1995
Sekine, S. and Grishman, R. and Shinnou, H., 'A decision tree method for finding and classifying names in Japanese texts,' Proceedings of the Sixth Workshop on Very Large Corpora, 1998
김재훈, 김길창, 한국어에서의 품사 부착 말뭉치의 작성 요령 : KAIST 말뭉치, 한국과학기술원, 전산학과, CS-TR-95-99, 1995
Aha, D. W. and Bankert, R. L., 'Feature selection for case-based classification of cloud types : An empirical comparison,' Proceedings of the 1994 AAAI Workshop on case-based reasoning, pp.106-112, 1994
Chinchor, N., Brown, E., Ferro, L. and Robinson, P., Named entity recognition task definition, version 1.4. 1999
김재호, 투표 방식의 비지도식 모델을 이용한 개체명 분류, 한국과학기술원 전산학과, 석사학위논문, 2002

Cited by

Syllable-based Korean POS Tagging Based on Combining a Pre-analyzed Dictionary with Machine Learning vol.43, pp.3, 2016, https://doi.org/10.5626/JOK.2016.43.3.362

The KIPS Transactions:PartB (정보처리학회논문지B)

Segmenting and Classifying Korean Words based on Syllables Using Instance-Based Learning

사례기반 학습을 이용한 음절기반 한국어 단어 분리 및 범주 결정

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)