Browse > Article
http://dx.doi.org/10.7776/ASK.2008.27.8.435

Two-Path Language Modeling Considering Word Order Structure of Korean  

Shin, Joong-Hwi (고려대학교 컴퓨터학과)
Park, Jae-Hyun (고려대학교 컴퓨터학과)
Lee, Jung-Tae (고려대학교 컴퓨터.전파통신공학과)
Rim, Hae-Chang (고려대학교 컴퓨터.전파통신공학과)
Abstract
The n-gram model is appropriate for languages, such as English, in which the word-order is grammatically rigid. However, it is not suitable for Korean in which the word-order is relatively free. Previous work proposed a twoply HMM that reflected the characteristics of Korean but failed to reflect word-order structures among words. In this paper, we define a new segment unit which combines two words in order to reflect the characteristic of word-order among adjacent words that appear in verbal morphemes. Moreover, we propose a two-path language model that estimates probabilities depending on the context based on the proposed segment unit. Experimental results show that the proposed two-path language model yields 25.68% perplexity improvement compared to the previous Korean language models and reduces 94.03% perplexity for the prediction of verbal morphemes where words are combined.
Keywords
Korean; Language modeling; Verbal morpheme; Word order; Segment unit;
Citations & Related Records
연도 인용수 순위
  • Reference
1 L. Rabiner and B. Juang, "An Introduction to hidden Markov models", ASSP Magazine IEEE Signal Processing, 3(1), 4-16, 1986   DOI
2 E. Arisoy and M. Saraclar, "Lattice extension and rescoring based approaches for LVCSR of turkish", in INTERSPEECH, 1025-1028, 2006
3 J. Gao, H. Suzuki, and Y. Wen, "Exploiting headword dependency and predictive clustering for language modeling", in EMNLP-2002, 248-256, 2002
4 S. M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 400-401, 1987   DOI
5 M. Creutz, T. Hirsimaki, M. Kurimo, A. Puurula, J. Pylkkonen, V. Siivola, M. Varjokallio, E. Arisoy, M. Saraclar, and A. Stolcke, "Morph-based speech recognition and modeling of out-of-vocabulary words across languages", ACM TSLP, 5(1), 2007
6 D. Jurafsky and J. H. Martin, Speech and Language Processing (Prentice Hall, 2007) Chap.4, pp.83-121
7 P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and Roossin, P.S. "A statistical approach to machine translation", Computational Linguistics, 16(2), 79-85, 1990
8 F. Jelinek, "Self-organized language modeling for speech recognition", Readings in Speech Recognition, A. Waibel and K. F. Lee, eds., Morgan Kaufmann, 450-506, 1990
9 N. Hideki, Corpus-based approaches to sentence structures (John Benjamins Pub Co., 2005), Chap.3, pp.51-76
10 O. Kwon and J. Park, "Korean large vocabulary continuous speech recognition with morpheme-based recognition units", Speech Communication, 39(3-4), 287-300, 2003   DOI   ScienceOn
11 김진동, 임희석, 임해창, "Twoply HMM: 한국어의 특성을 고려한 형태소 단위의 품사 태깅 모델", 정보과학회논문지 (B), 24 (12), 1502-1512, 1997
12 A. Stolcke, "SRILM-an extensible language modeling toolkit", in ICSLP-2002, 901-904, 2002