Browse > Article

Korean Word Segmentation and Compound-noun Decomposition Using Markov Chain and Syllable N-gram  

권오욱 (한국과학기술원 뇌과학연구센터)
Abstract
Word segmentation errors occurring in text preprocessing often insert incorrect words into recognition vocabulary and cause poor language models for Korean large vocabulary continuous speech recognition. We propose an automatic word segmentation algorithm using Markov chains and syllable-based n-gram language models in order to correct word segmentation error in teat corpora. We assume that a sentence is generated from a Markov chain. Spaces and non-space characters are generated on self-transitions and other transitions of the Markov chain, respectively Then word segmentation of the sentence is obtained by finding the maximum likelihood path using syllable n-gram scores. In experimental results, the algorithm showed 91.58% word accuracy and 96.69% syllable accuracy for word segmentation of 254 sentence newspaper columns without any spaces. The algorithm improved the word accuracy from 91.00% to 96.27% for word segmentation correction at line breaks and yielded the decomposition accuracy of 96.22% for compound-noun decomposition.
Keywords
Word segmentation; N-gram language model; Text preprocessing;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Word identification for mandarin Chinese sentence /
[ K.H. Chen;S.H. Liu ] / Proc. 14th Int. Conf. Computaitonal Linguistics
2 Using morphology towards better large-vocabulary speech recognition systems /
[ P. Geutner ] / Proc. ICASSP'95, Detroit, USA
3 한글 문장의 자동 띄어쓰기 /
[ 강승식 ] / 제10회 한글 및 한국어정보처리 학술대회논문집
4 Broad converage automatic morphological segmentation of German words /
[ T. Pachunke;O. Mertineit;K. Wothke;R. Schimidt ] / Proc. 14th Int. Conf. Computational Linguistics
5 /
[ A. Popoulis;Porbability ] / Random Variables, and Stochastic Processes
6 /
[ L.R. Rabiner;B.H. Juang ] / Fundamentals of Speech Recognition
7 음절수에 따른 한국어 복합명사 분리 방안 /
[ 최재혁 ] / 제8회 한글 및 한국어 정보처리 학술발표대회논문집
8 Analysis of Japanese compound nouns by direct text scanning /
[ T. Hisamitsu;Y. Nitta ] / Proc. 16th Int. Conf. Computational Linguistics
9 한국어 형태소 분석을 위한 복합 명사의 인식방법 /
[ 강승식 ] / 인지과학회 춘계학술발표논문집
10 합성된 상호정보를 이용한 복합 명사 분리 /
[ 심광섭 ] / 정보과학회논문집
11 한국어 복합명사 분해 알고리즘 /
[ 강승식 ] / 정보과학회논문지 (B)
12 /
[ W. Feller ] / An Introduction to Probability Theory and Its Applications (3rd ed.)
13 Estimation of probabilitie from sparse data for the language model component of a speech reconizer /
[ S.M. Katz ] / IEEE Trans. Acoustics, Speech, and Signal Processing   DOI
14 Korean large vocabulary continuous speech recognition using pseudomorpheme uints /
[ O.W. Kwon;K. Hwang;J. Park ] / Proc. EUROSPEECH'99, Budapest, Hungary
15 Statistical language modeling using the CMU-Cambridge toolkit /
[ P. Clarkson;R. Rosenfeld ] / Proc. EUROSEECH'97
16 Transcription of broadcast news shows with the IBM large vocabulary speech recognition system /
[ R. Bakis;S. Chen;P. Gopalakrishnan;R. Gopinath;S. Maes;L. Polymenakos;M. Franz ] / Proc. 1997 DARPA Speech Recognition Workshop
17 통계정보를 이용한 한국어 복합명사의 분석 방법 /
[ 윤보현;임희석;임해창 ] / 한국정보과학회 봄학술발표논문집
18 음절간 상호 정보를 이용한 한국어 자동 띄어쓰기 /
[ 심광섭 ] / 정보과학회논문지 (B)
19 The LIMSI 1998 HUB-4E Transcription system /
[ J.L. Gauvain;L. Lamel;G. Adda;M. Jardino ] / Proc. DARPA Broadcast News Transcription