Browse > Article

A Stochastic Word-Spacing System Based on Word Category-Pattern  

Kang, Mi-Young (부산대학교 컴퓨터공학과)
Jung, Sung-Won (부산대학교 컴퓨터공학과)
Kwon, Hyuk-Chul (부산대학교 컴퓨터공학과)
Abstract
This paper implements an automatic Korean word-spacing system based on word-recognition using morpheme unigrams and the pattern that the categories of those morpheme unigrams share within a candidate word. Although previous work on Korean word-spacing models has produced the advantages of easy construction and time efficiency, there still remain problems, such as data sparseness and critical memory size, which arise from the morpho-typological characteristics of Korean. In order to cope with both problems, our implementation uses the stochastic information of morpheme unigrams, and their category patterns, instead of word unigrams. A word's probability in a sentence is obtained based on morpheme probability and the weight for the morpheme's category within the category pattern of the candidate word. The category weights are trained so as to minimize the error means between the observed probabilities of words and those estimated by words' individual-morphemes' probabilities weighted according to their categories' powers in a given word's category pattern.
Keywords
Korean word-spacing; word-unigram; morpheme-unigram; category pattern; stochastic information;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Kim, S.N., Nam, H.S. and Kwon, H.CH., 'Correction Methods of Spacing Words for Improving the Korean Spelling and Grammar Checkers,' Proceedings of the 5th Natural Language Processing Pacific Rim Symposium, pp. 415-419, 1999
2 Kang, S.S. and Woo C.W., Automatic Segmentation of Words Using Syllable Bigram Statistics. Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, pp. 729-732, 200l
3 Kang, M.Y., Choi S.W. and Kwon, H.CH., 'A Hybrid Approach to Automatic Word-spacing in Korean,' Lecture Notes in Computer Science (LNCS) Vol.3029, pp, 284-294, 2004
4 Sproat R, Shih, C., Gale, W. and Chang, N. 'A Stochastic Finite-State Word-Segmentation Algorithm for Chinese,' Computational Linguistics, Vol.22 No.3, pp. 377-404, 1996
5 이도길, 이상주, 임희석, 임해창, '한글 문장의 자동 띄어쓰기를 위한 두 가지 통계적 모델' 정보과학회 논문지: 소프트웨어 및 응용, 30권 4호, pp. 358-370, 2003   과학기술학회마을
6 한국전자통신 연구원, 'ETRI 품사태그 부착 말뭉치(시험판)', 1999
7 21세기 세종계획 국어기초자료 구축, 문화관광부, 2004
8 심철민, 권혁철, '연어 정보에 기반한 한국어 철자 검사와 교정기의 구현', 정보과학회 논문지: 소프트웨어 및 응용, 23권 8호, pp. 776-785, 1996
9 신호철, '형태소 분석기를 이용한 자동 띄어쓰기 시스템 구축에 대한 연구, 한국어학, 12권, pp. 167-186, 2000
10 심광섭, '음절간 상호 정보를 이용한 한국어 자동 띄어쓰기' 정보과학회논문지: 소프트웨어 및 응용, 23권 9호, pp. 991-1000, 1996   과학기술학회마을
11 강승식, '음절 bigram를 이용한 띄어쓰기 오류의 자동 교정', 음성과학회 논문지, 8권 2호, pp. 83-90, 2001   과학기술학회마을
12 Manning, C.D., and Schutze H., 'Foundations of Statistical Natural Language Processing,' The MIT Press, Cambridge, London, 2001