A Korean Language Stemmer based on Unsupervised Learning

Jo, Se-Hyeong;

The KIPS Transactions:PartB (정보처리학회논문지B)

Volume 8B Issue 6
/
Pages.675-684
/
2001
/
1598-284X(pISSN)

Korea Information Processing Society (한국정보처리학회)

A Korean Language Stemmer based on Unsupervised Learning

자율 학습에 의한 실질 형태소와 형식 형태소의 분리

Jo, Se-Hyeong (Myongji University)

조세형 (명지대학교)

Published : 2001.01.01

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper describes a method for stemming of Korean language by using unsupervised learning from raw corpus. This technique does not require a lexicon or any language-specific knowledge. Since we use unsupervised learning, the time and effort required for learning is negligible. Unlike heuristic approaches that are theoretically ungrounded, this method is based on widely accepted statistical methods, and therefore can be easily extended. The method is currently applied only to Korean language, but it can easily be adapted to other agglutinative languages, since it is not language-dependent.

본 논문은 태그가 없는 단순 말뭉치만을 가지고 자율학습을 이용하여 정보 검색을 위한 색인어의 추출 등에 이용될 수 있도록 한국어의 실질 형태소와 형식 형태소를 분리해내는 기법에 대하여 기술한다. 본 기법은 사전 등의 언어 관련 지식을 요구하지 않으며 오직 단순 말뭉치만을 필요로 한다. 또한 자율학습을 이용함으로써 사람의 간섭이 필요하지 않아 학습에 필요한 시간과 노력이 거의 들지 않는다. 본 방식은 잘 확립된 통계적 방법론을 이용하기 때문에 일반적인 휴리스틱과는 달리 이론적인 기반이 확고하여 확장 및 발전이 용이하다. 본 결과는 한국어에 우선 적용되었으나 한국어에 종속적인 방법이 아니어서 다른 교착어에도 쉽게 적용될 수 있을 것이다.

Keywords

References

신상현, 이근배, 이종혁, '통계와 규칙에 기반한 2단계 한국어 품사 태깅 시스템', 정보과학회논문지(B) 제24권 제2호, pp.160-169, 1997
남윤진, 옥철영, '발뭉치 분석에 기반한 명사파생접미사의 사전정보 구축', 정보과학회논문지(B), 제23권 제4호, pp.389-401, 1996
강승식, '음절특성을 이용한 한국어 불규칙 용언의 형태소 분석', 정보과학회논문지(B) 제22권 제10호, pp.1480-1487, 1995
최재형, 이상조, '양방향 최장 일치법에 의한 한국어 형태소 분석기에서의 사전 검색 횟수 감소 방안', 한국정보과학회논문지 Vol.20, No.10, pp.1497-1507, 1993
김철수, 배우정, 이용식, 靑江純一, '이중배열 트라이 구조를 이용한 한국어 전자 사전의 구축', 정보과학논문지(B) 제23권 제1호, pp.85-94, 1996
임희석, 윤보현, 임해창, '배제 정보를 이용한 효율적인 한국어 형태소 분석기', 한국정보과학회논문지, 제22권 제6호, pp.957-964, 1995
심광섭, '음절간 상호정보를 이용한 한국어 자동 띄어쓰기', 정보과학회논문지 제23권 제9호, pp.991-1000, 1996
C. Manning and H. Schltze, Foundations of Statistical Natural Language Processing, MIT Press, 1999
Lovins, J. B., 'Development of stemming algorithms,' in Machine Translation and Computational Linguistics, 11, 1968
Patrick Schone and Daniel Jurafsky, 'Knowledge-free Induction of Morphology using Latent Semantic Analysis,' in proceedings of the ACL99 workshop : Unsupervised learning in Natural Language Processing, University of Maryland https://doi.org/10.3115/1117601.1117615
J. Goldsmith, 'Unsupervised learning of the morphology of a natural language,' University of Chicago, http://humanities.uchicago.edu/faculty/goldsmith
L.Luis Marquez, Lluis Padro, and Horacia Rodriguez, 'A Machine Learning Approach to POS tagging,' Machine Learning, Vol.39, pp.59-91, 2000 https://doi.org/10.1023/A:1007673816718
E. Gaussier, 'Unsupervised learning of derivational morphology from inflectional lexicons,' in proceedings of the ACL99 workshop : Unsupervised learning in Natural Language Processing, University of Maryland
Dejean, H., 'Morphemes as necessary concepts for structures : Discovery from untagged corpora,' University of Caen-Basse Normandie, http://www.info.unicaen.fr/DeJean/travail/article/pg11.htm. 1998
김흥규, 강범모, '한국어 형태소 및 어휘 사용 빈도의 분석', 고려대학교 민속문화연구원, 2000
M. F. Porter, 'An algorithm for suffix stripping,' Program, 14(3), pp.130-137, 1980
Zipf, G. K. Human Behavior and the Principle of Least Effort, Cambridge, MA : Addison-Wesley, 1949
W. Mendenhall and R.J.Beaver. Introduction to Probability and Statistics, Boston, MA, PWD-Kent publishing co. 1995
R. Ando and L. Lee, 'Unsupervised Statistical Segmentation of Japanese Kanji Strings,' Technical Report TR99-1756, Computer Science Department, Cornell University, 1999

The KIPS Transactions:PartB (정보처리학회논문지B)

A Korean Language Stemmer based on Unsupervised Learning

자율 학습에 의한 실질 형태소와 형식 형태소의 분리

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)