Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2009.16-B.4.333

An Automatic Korean Word Spacing System for Devices with Low Computing Power  

Song, Yeong-Kil (강원대학교 컴퓨터정보통신공학)
Kim, Hark-Soo (강원대학교 컴퓨터정보통신공학)
Abstract
Most of the previous automatic word spacing systems are not suitable to use for mobile devices with relatively low computing powers because they require many system resources. We propose an automatic word spacing system that requires reasonable memory usage and simple numerical computations for mobile devices with low computing powers. The proposed system is a two step model that consists of a statistical system and a rule-based system. To reduce the memory usage, the statistical system first corrects word spacing errors by using a modified hidden Markov model based on character unigrams. Then, to increase the accuracy, the rule-based system re-corrects miscorrected word spaces by using lexical rules based on character bigrams or more. In the experiments, the proposed system showed relatively high accuracy of 94.14% in spite of small memory usage of about 1MB.
Keywords
Automatic Word Spacing; Low Computing Power; Mobile Device; Two Step Model;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Seon, C., Kim, H., Seo, J., 'Information extraction using finite state automata and syllable n-grams in a mobile environment,' Proceedings of the ACL-08: HLT Workshop on Mobile Language Processing, pp.13-18, 2008
2 Johnston, M., 'Multimodal Voice Search for Interactive Media,' Demo of the ACL-08: HLT Workshop on Mobile Language Processing(http://mobilenlpworkshop.org/Demos.html), 2008
3 김계성, 이현주, 이상조, '연속 음절 문장에 대한 3단계 한국어 띄어쓰기 시스템', 정보과학회논문지(B) 제25권 제12호, pp.1938-1844, 1998
4 강승식, '한글 문장의 자동 띄어쓰기를 위한 어절 블록 양방향알고리즘', 정보과학회논문지:소프트웨어 및 응용 제27권 제4호, pp.441-447, 2000   과학기술학회마을
5 태윤식, 박성배, 이상조, 박세영, '자기 조직화 n-gram모델을 이용한 자동 띄어쓰기', 한국정보과학회 언어공학연구회 학술발표 논문집, pp.125-132, 2006
6 Lee, D., Rim, H., and Yook, D., 'Automatic word spacing using probabilistic models based on character n-grams,' IEEE Intelligent Systems, Vol.22 No.1, pp.28-35, 2007   DOI   ScienceOn
7 Lafferty, J., McCallum, A., Pereira, F., 'Conditional random fields: Probabilistic models for segmenting and labeling sequence data,' Proceedings of ICML 2001, pp.282-289, 2001
8 McCallum, A., Freitag, D., Pereira, F., 'Maximum entropy Markov models for information extraction and segmentation,' Intl. Conf. on Machine Learning, pp.591-598, 2000
9 http://www.cs.brandeis.edu/~cs114/Spring2006/slides/CRFs_MEMMs.pdf (2009. 6. 16 방문)
10 Riloff, E., Jones, R., 'Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping,' Proceedings of the 16th National Conference on Artificial Intelligence, 1999
11 최성자, 강미영, 허희근, 권혁철, '음절 N-Gram과 어절 통계정보를 이용한 한국어 띄어쓰기 시스템', 한국정보과학회 언어공학연구회 학술발표 논문집, pp.47-53, 2003
12 강승식., '음절 bigram를 이용한 띄어쓰기 오류의 자동교정',음성과학회논문지, 제8권 제2호, pp.83-90, 2001
13 임동희, 전영진, 김형준, 강승식, '확장된 음절 바이그램을 이용한 자동 띄어쓰기 시스템', 한국정보과학회 언어공학연구회학술발표 논문집, pp.189-193, 2005
14 강승식, 장두성, 'SMS 변형된 문자열의 자동 오류 교정 시스템', 정보과학회논문지 : 소프트웨어 및 응용 제35권 제6호, pp.386-391, 2008   과학기술학회마을