Browse > Article
http://dx.doi.org/10.6109/JKIICE.2009.13.9.1898

Phase-based Model Using Web Documents for Korean Unknown Word Recognition  

Park, So-Young (상명대학교 디지털미디어학부)
Abstract
Recently, real documents such as newspapers as well as blogs include newly coined words such as "Wikipedia". However, most previous information processing technologies cannot deal with these newly coined words because they construct their dictionaries based on materials acquired during system development. In this paper, we propose a model to automatically recognize Korean unknown words excluded from the previously constructed dictionary. The proposed model consists of an unknown noun recognition phase based on full text analysis, an unknown verb recognition phase based on web document frequency, and an unknown noun recognition phase based on web document frequency. The proposed model can recognize accurately the unknown words occurred once and again in a document by the full text analysis. Also, the proposed model can recognize broadly the unknown words occurred once in the document by using web documents. Besides, the proposed model fan recognize both a Korean unknown verb, which syllables can be changed from its base form by inflection, and a Korean unknown noun, which syllables are not changed in any eojeol. Experimental results shows that the proposed model improves precision 1.01% and recall 8.50% as compared with a previous model.
Keywords
미등록어 인식;한국어 처리;웹 기반 접근방법;전문분석 기반 접근방법;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 이도길, 이상주, 임해창, "명사 출현 특성을 이용한 효율적인 한국어 명사 추출 방법", 정보과학회논문지:소프트웨어 및 응용, 제30권 제2호, 173쪽-183쪽, 2003
2 Masaaki Nagata, "Automatic Extraction of New Words from Japanese Texts using Generalized Forward- Backward Search," Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.48-59, 1996
3 박소영, "웹문서에서의 출현빈도를 이용한 한국어 미등록어 사전 자동 구축", 한국컴퓨터정보학회 논문지, 제13권 제3호, 27쪽-33쪽, 2008
4 이도길, 한국어 형태소 분석과 품사부착을 위한 확률 모형, 고려대학교 박사학위 논문, 2005
5 차정원, 이원일, 이근배, 이종혁, "형태소 패턴 사전을 이용한 일반화된 미등록어 처리", 정보과학회 인공지능연구회 춘계학술대회 논문집, 37쪽-42쪽, 1997
6 박봉래, 전문분석에 기반한 한국어 미등록어의 인식, 고려대학교 박사학위 논문, 1999
7 Ralph Weishedel, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmulcci, "Coping with Ambiguity and Unknown Words through Probabilistic Models", Computational Linguistics, Vol.19, No.2, pp.359-382, 1993   ScienceOn
8 양장모, 김민정, 권혁철, "언어정보를 이용한 한국어 미등록어 추정", 한국정보과학회 봄 학술발표논문집, 제23권 제1호, 957쪽-960쪽, 1996
9 김선호, 윤준태, 송만석, "한국어 문서 처리를 위한 동적 생성 로컬 사전 기반 미등록어 분석", 정보과학회논문지:소프트웨어 및 응용, 제29권 제6호, 407쪽-416쪽, 2002