[KSCI] Korea Science Citation Index Service

Two Statistical Models for Automatic Word Spacing of Korean Sentences

이도길 (고려대학교 컴퓨터학과)
이상주 ((주)엔엘피솔루션)
임희석 (천안대학교 정보통신학부)
임해창 (고려대학교 컴퓨터학과)

Publication Information

Journal of KIISE:Software and Applications / v.30, no.3_4, 2003 , pp. 358-371 More about this Journal

Abstract

Automatic word spacing is a process of deciding correct boundaries between words in a sentence including spacing errors. It is very important to increase the readability and to communicate the accurate meaning of text to the reader. The previous statistical approaches for automatic word spacing do not consider the previous spacing state, and thus can not help estimating inaccurate probabilities. In this paper, we propose two statistical word spacing models which can solve the problem of the previous statistical approaches. The proposed models are based on the observation that the automatic word spacing is regarded as a classification problem such as the POS tagging. The models can consider broader context and estimate more accurate probabilities by generalizing hidden Markov models. We have experimented the proposed models under a wide range of experimental conditions in order to compare them with the current state of the art, and also provided detailed error analysis of our models. The experimental results show that the proposed models have a syllable-unit accuracy of 98.33% and Eojeol-unit precision of 93.06% by the evaluation method considering compound nouns.

Keywords

Automatic word spacing; Probabilistic model; hidden Markov model;

Citations & Related Records

Times Cited By KSCI : 4 (Citation Analysis)

Reference
Cited By KSCI

1	신중호, 박혁로, 음절 단위 bigram 정보를 이용한 한국어 단어인식모델, 제9회 한글 및 한국어 정보처리학술발표 논문집, pp.255-260, 1997
2	정영미, 이재윤, 한국어 텍스트 처리를 위한 줄 경계 띄어쓰기 복원, 제6회 한국정보관리학회 학술대회 논문집, pp.21-24, 1999 과학기술학회마을
3	전남열, 박혁로, 음절 Bi gram정보를 이용한 한국어 OCR 후처리용 자동 띄어쓰기, 제 12회 한글 및 한국어 정보처리 학술발표 논문집, pp.95-100, 2000
4	강승식, 음절 bigram를 이용한 띄어쓰기 오류의 자동 교정, 음성과학회논문지, 제8권 2호, pp.83-90, 2001 과학기술학회마을
5	21세기 세종계획 국어기초자료 구축, 문화관광부, 1998
6	21세기 세종계획 국어기초자료 구축, 문화관광부, 1999
7	한국전자통신연구원, 품사 부착 말뭉치 구축 지침서, 1999, http://aladin.etri.re.kr/-nlu/STANDARD/
8	오종훈, 최기선, 은닉마르코프 모델(HMM)을 이용한 과학기술문서에서의 외래어 추출 모델, 제 11회 한글 및 한국어 정보처리 학술발표 논문집, pp.137-141, 1999
9	박봉래, 대용량 한글 텍스트 데이터베이스 맞춤법 오류 교정 시스템의 구현, 고려대학교 전산과학과 석사학위논문, 1995
10	최재혁, 양방향 최장일치법을 이용한 한국어 띄어쓰기 자동 교정 시스템, 제9회 한글 및 한국어 정보처리 학술발표 논문집, pp.145-151, 1997
11	김계성, 이현주, 이상조, 연속 음절 문장에 대한 3단계 한국어 띄어쓰기 시스템, 정보과학회논문지, 제25권 제12호, pp.1838-1844, 1998
12	강승식, 한글 문장의 자동 띄어쓰기, 제10회 한글 및 한국어 정보처리 학술발표 논문집, pp.137-142, 1998
13	강승식, 한글 문장의 자동 띄어쓰기를 위한 어절블록 양방향 알고리즘, 정보과학회논문지, 제27권 제4호, pp.441-447. 2000
14	심광섭, 음절간 상호 정보를 이용한 한국어 자동 띄어쓰기, 정보과학회논문지, 제23권 제9호, pp.991-1000, 1996
15	B. Merialdo, Tagging English Text with a Probabilistic Model, Computational Linguistics, 20(2), pp.155-172, 1994
16	K. Seymore, A. McCallum, and R. Rosenfeld, Learning Hidden Markov Model Structure for Information Extraction, AAAI 99 Workshop on Machine Learning for Information Extraction, 1999
17	D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. NYMBLE: A High-Performance Learning Name-finder, In Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194-201, 1997 DOI
18	E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz, Equations for Part-of-Speech Tagging, In Proceedings of the 11th National Conference on Artificial Intelligence(AAAI-93), pp.784-789, 1993
19	김진동, 임희석, 임해창, Twoply HMM : 한국어의 특성을 고려한 형태소 단위의 품사 태깅 모델, 한국정보과학회 논문지(B), 제24권, 12호, pp.1502-1512, 1997
20	이상주, 자동 품사 부착을 위한 새로운 통계적 모형, 고려대학교 컴퓨터학과 박사학위논문, 1999

1	A Research on Module Arrangement of Korean Spelling Corrector to Optimize Correction Rate / [Yun Keun-Soo;Kwon Hyuk-Chul;] / Journal of KIISE:Software and Applications
2	A Stochastic Word-Spacing System Based on Word Category-Pattern / [Kang, Mi-Young;Jung, Sung-Won;Kwon, Hyuk-Chul;] / Journal of KIISE:Software and Applications
3	A Word Spacing System based on Syllable Patterns for Memory-constrained Devices / [Kim, Shin-Il;Yang, Seon;Ko, Young-Joong;] / Journal of KIISE:Software and Applications
4	Word Spacing Consistency Check using Syllable and Morpheme Information / [Lee, Jae-Sung;] / The Journal of the Korea Contents Association

KSCI

Two Statistical Models for Automatic Word Spacing of Korean Sentences 한글 문장의 자동 띄어쓰기를 위한 두 가지 통계적 모델

Two Statistical Models for Automatic Word Spacing of Korean Sentences