Browse > Article

Clustering-based Statistical Machine Translation Using Syntactic Structure and Word Similarity  

Kim, Han-Kyong (포항공과대학교 정보처리학과)
Na, Hwi-Dong (포항공과대학교 컴퓨터공학과)
Li, Jin-Ji (포항공과대학교 컴퓨터공학과)
Lee, Jong-Hyeok (포항공과대학교 컴퓨터공학과)
Abstract
Clustering method which based on sentence type or document genre is a technique used to improve translation quality of SMT(statistical machine translation) by domain-specific translation. But there is no previous research using sentence type and document genre information simultaneously. In this paper, we suggest an integrated clustering method that classifying sentence type by syntactic structure similarity and document genre by word similarity information. We interpolated domain-specific models from clusters with general models to improve translation quality of SMT system. Kernel function and cosine measures are applied to calculate structural similarity and word similarity. With these similarities, we used machine learning algorithms similar to K-means to clustering. In Japanese-English patent translation corpus, we got 2.5% point relative improvements of translation quality at optimal case.
Keywords
SMT(Statistical Machine Translation); Clustering; Domain-specific model; Syntactic Structural similarity; word similarity;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 F. J. Och, and H. Ney, "A systematic comparison of various statistical alignment models," Comput. Linguist., vol.29, no.1, pp.19-51, Mar. 2003.   DOI   ScienceOn
2 S. K. Choi, O. W. Kwon, K. Y. Lee, Y. H. Roh, S. K. Park, "Construction of English-Korean Automatic Translation System for Patent Documents Based on Domain Customizing Method," Journal of KIISE : Software and Applications, vol.34, no.2, pp.95-103, Feb. 2007. (in Korean)   과학기술학회마을
3 K. Yasuda, A. Finch, and H. Okuma, "System Description of NiCT-ATR SMT for NTCIR-7," Proc. of NTCIR-7 Workshop Meeting, pp.415-419, Dec. 2008.
4 T. Ito, T. Akiba, and K. Itou, "Effect of the Topic Dependent Translation Models for Patent Translation - Experiment at NTCIR-7," Proc. of NTCIR-7 Workshop Meeting, pp.425-429, Dec. 2008.
5 G. Foster, and R. Kuhn, "Mixture-model adaptation for SMT," Proc. of the Second Workshop on Statistical Machine Translation, pp.128-135, Jun. 2007.
6 S. Hasan, and H. Ney, "Clustered Language Models based on Regular Expressions for SMT," 10th EAMT conference "Practical applications of machine translation," pp.119-125, May. 2005.
7 F. J. Och, "Minimum Error Rate Training for Statistical Machine Translation," Proc. of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pp.160-167, 2003.
8 H. Yamamoto, and E. Sumita, "Bilingual cluster based models for statistical machine translation," Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.514- 523, Jun. 2007.
9 A. Finch, and E. Sumita, "Dynamic Model Interpolation for Statistical Machine Translation," Proc. of the Third Workshop on Statistical Machine Translation, pp.208-215, Jun. 2008.
10 P. Koehn, H. Hoang, A. Birch, C. C. Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, "Moses: Open Source Toolkit for Statistical Machine Translation," Proc. of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), Jun. 2007.
11 K. Papineni, S. Roukos, T. Ward, and W. Zhu, "BLEU: A method for automatic evaluation of Machine Translation," Proc. of the 40th Annual Meeting on Association For Computational Linguistics, Jul. 2001.
12 A. Stolcke, "Srilm - an extensible language modeling toolkit," Proc. of the 7th International Conference on Spoken Language Processing (ICSLP). pp.693-696, Sep. 2002.
13 M. Collins, and N. Duffy, "Parsing with a Single Neuron: Convolution Kernels for Natural Language Problems," Technical report UCSC-CRL-01-01, 2001.
14 T. Kudo, and Y. Matsumoto, "Fast Methods for Kernel-based Text Analysis," Proc. of the 41st Annual Meeting on Association For Computational Linguistics, vol.1, pp.24-31, Jul. 2003.
15 A. Fujii, M. Utiyama, M. Yamamoto, and T. Utsuro, "Overview of the patent translation task at the NTCIR-7 Workshop," Proc. of NTCIR-7 Workshop Meeting, pp.389-400, Dec. 2008.