Browse > Article

Automatic Generation of Concatenate Morphemes for Korean LVCSR  

박영희 (서강대학교 컴퓨터학과 음성언어처리연구실)
정민화 (서강대학교 컴퓨터학과 음성언어처리연구실)
Abstract
In this paper, we present a method that automatically generates concatenate morpheme based language models to improve the performance of Korean large vocabulary continuous speech recognition. The focus was brought into improvement against recognition errors of monosyllable morphemes that occupy 54% of the training text corpus and more frequently mis-recognized. Knowledge-based method using POS patterns has disadvantages such as the difficulty in making rules and producing many low frequency concatenate morphemes. Proposed method automatically selects morpheme-pairs from training text data based on measures such as frequency, mutual information, and unigram log likelihood. Experiment was performed using 7M-morpheme text corpus and 20K-morpheme lexicon. The frequency measure with constraint on the number of morphemes used for concatenation produces the best result of reducing monosyllables from 54% to 30%, bigram perplexity from 117.9 to 97.3. and MER from 21.3% to 17.6%.
Keywords
Concatenate morpheme models; Large vocabulary continuous speech recognition; Automatic concatenate morphemes generation; Language models;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Effects of words string language models on noisy broadcast news speech recognition /
[ Kazuyuki Takagi;Rei Oguro;Kazuhiko Ozeki ] / Proc. of International Conference on Spoken Language Processing
2 Tagged Word Bigram을 사용한 의사형태소 단위의 한국어 연속음성인식 /
[ 박영희;정민화 ] / 한국정보과학회 봄 학술발표 논문집
3 Statistical language modeling using the CMU-CambridgeToolkit /
[ P. Clarkson;R. Rosenfeld ] / Proc. of EUROSPEECH
4 Phrase-based language models for speech recognition /
[ Hong-Kwang Jeff Kuo;Wolfgang Reichl ] / Proc. of EUROSPEECH
5 Data-driven approach to designing compound words for continuous speech recognition /
[ George Saon;Mukund Padmanabhan ] / IEEE Trans. on ASSP
6 Variable-length sequence language model for large vocabulary continuous dictation machines /
[ I. Zitouni;J. F. Mari;K. Smaili;J. P. Haton ] / Proc. of EUROSPEECH
7 /
[ KAIST ] / 국어정보베이스 Ⅱ CD-ROM
8 Language modeling based on automatic word concatenations /
[ C. Beaujard;M. Jardino ] / Proc. of EUROSPEECH
9 의사 형태소 단위의 음성인식 형태소 해석 /
[ 이경님;정민화 ] / 제 10회 한글 및 한글 및 한국어 정보처리 학술대회 논문집
10 /
[] / HTK Hidden Markov Model Toolkit, Version 2.2
11 Language-model optimization by mapping of corpora /
[ Dietrich Klakow ] / Proc. of International Conference on Acoustics, Speech, and Signal
12 Performance of LVCSR with morpheme-based and syllable-based recognition units /
[ Oh-Wook Kwon ] / Proc. of International Conference on Acoustics, Speech, and Signal