[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7776/ASK.2008.27.8.435

Two-Path Language Modeling Considering Word Order Structure of Korean

Shin, Joong-Hwi (고려대학교 컴퓨터학과)
Park, Jae-Hyun (고려대학교 컴퓨터학과)
Lee, Jung-Tae (고려대학교 컴퓨터.전파통신공학과)
Rim, Hae-Chang (고려대학교 컴퓨터.전파통신공학과)

Publication Information

The Journal of the Acoustical Society of Korea / v.27, no.8, 2008 , pp. 435-442 More about this Journal

Abstract

The n-gram model is appropriate for languages, such as English, in which the word-order is grammatically rigid. However, it is not suitable for Korean in which the word-order is relatively free. Previous work proposed a twoply HMM that reflected the characteristics of Korean but failed to reflect word-order structures among words. In this paper, we define a new segment unit which combines two words in order to reflect the characteristic of word-order among adjacent words that appear in verbal morphemes. Moreover, we propose a two-path language model that estimates probabilities depending on the context based on the proposed segment unit. Experimental results show that the proposed two-path language model yields 25.68% perplexity improvement compared to the previous Korean language models and reduces 94.03% perplexity for the prediction of verbal morphemes where words are combined.

Keywords

Korean; Language modeling; Verbal morpheme; Word order; Segment unit;

Citations & Related Records

Reference

1	L. Rabiner and B. Juang, "An Introduction to hidden Markov models", ASSP Magazine IEEE Signal Processing, 3(1), 4-16, 1986 DOI
2	E. Arisoy and M. Saraclar, "Lattice extension and rescoring based approaches for LVCSR of turkish", in INTERSPEECH, 1025-1028, 2006
3	J. Gao, H. Suzuki, and Y. Wen, "Exploiting headword dependency and predictive clustering for language modeling", in EMNLP-2002, 248-256, 2002
4	S. M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 400-401, 1987 DOI
5	M. Creutz, T. Hirsimaki, M. Kurimo, A. Puurula, J. Pylkkonen, V. Siivola, M. Varjokallio, E. Arisoy, M. Saraclar, and A. Stolcke, "Morph-based speech recognition and modeling of out-of-vocabulary words across languages", ACM TSLP, 5(1), 2007
6	D. Jurafsky and J. H. Martin, Speech and Language Processing (Prentice Hall, 2007) Chap.4, pp.83-121
7	P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and Roossin, P.S. "A statistical approach to machine translation", Computational Linguistics, 16(2), 79-85, 1990
8	F. Jelinek, "Self-organized language modeling for speech recognition", Readings in Speech Recognition, A. Waibel and K. F. Lee, eds., Morgan Kaufmann, 450-506, 1990
9	N. Hideki, Corpus-based approaches to sentence structures (John Benjamins Pub Co., 2005), Chap.3, pp.51-76
10	O. Kwon and J. Park, "Korean large vocabulary continuous speech recognition with morpheme-based recognition units", Speech Communication, 39(3-4), 287-300, 2003 DOI ScienceOn
11	김진동, 임희석, 임해창, "Twoply HMM: 한국어의 특성을 고려한 형태소 단위의 품사 태깅 모델", 정보과학회논문지 (B), 24 (12), 1502-1512, 1997
12	A. Stolcke, "SRILM-an extensible language modeling toolkit", in ICSLP-2002, 901-904, 2002

KSCI

Two-Path Language Modeling Considering Word Order Structure of Korean 한국어의 어순 구조를 고려한 Two-Path 언어모델링

Two-Path Language Modeling Considering Word Order Structure of Korean