Clustering-based Statistical Machine Translation Using Syntactic Structure and Word Similarity

문장구조 유사도와 단어 유사도를 이용한 클러스터링 기반의 통계기계번역

  • 김한경 (포항공과대학교 정보처리학과) ;
  • 나휘동 (포항공과대학교 컴퓨터공학과) ;
  • 이금희 (포항공과대학교 컴퓨터공학과) ;
  • 이종혁 (포항공과대학교 컴퓨터공학과)
  • Received : 2009.11.12
  • Accepted : 2010.02.02
  • Published : 2010.04.15

Abstract

Clustering method which based on sentence type or document genre is a technique used to improve translation quality of SMT(statistical machine translation) by domain-specific translation. But there is no previous research using sentence type and document genre information simultaneously. In this paper, we suggest an integrated clustering method that classifying sentence type by syntactic structure similarity and document genre by word similarity information. We interpolated domain-specific models from clusters with general models to improve translation quality of SMT system. Kernel function and cosine measures are applied to calculate structural similarity and word similarity. With these similarities, we used machine learning algorithms similar to K-means to clustering. In Japanese-English patent translation corpus, we got 2.5% point relative improvements of translation quality at optimal case.

통계기계번역에서 번역성능의 향상을 위해서 문장의 유형이나 장르에 따라 클러스터링을 수행하여 도메인에 특화된 번역을 시도하는 방법이 있다. 그러나 기존의 연구 중 문장의 유형 정보와 장르에 따른 정보를 동시에 사용한 경우는 없었다. 본 논문에서는 각 문장의 문법적 구조 유사도에 따른 유형별분류 기법과, 단어 유사도 정보를 사용한 장르 구분법을 적용하여 기존의 두 기법을 통합하였다. 이렇게 분류된 말뭉치에서 추출한 도메인 특화 모델과 전체 말뭉치에서 추출된 모델에서 보간법(interpolation)을 사용하여 통계기계번역의 성능을 향상하였다. 문장구조 유사도와 단어 유사도의 계산 방법으로는 각각 커널과 코사인 유사도를 적용하였으며, 두 유사도를 적용하여 말뭉치를 분류하는 과정에서는 K-Means 알고리즘과 유사한 기계학습 기법을 사용하였다. 이를 일본어-영어의 특허문서에서 실험한 결과 최선의 경우 약 2.5%의 상대적인 성능 향상을 얻었다.

Keywords

References

  1. S. Hasan, and H. Ney, "Clustered Language Models based on Regular Expressions for SMT," 10th EAMT conference "Practical applications of machine translation," pp.119-125, May. 2005.
  2. H. Yamamoto, and E. Sumita, "Bilingual cluster based models for statistical machine translation," Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.514- 523, Jun. 2007.
  3. A. Finch, and E. Sumita, "Dynamic Model Interpolation for Statistical Machine Translation," Proc. of the Third Workshop on Statistical Machine Translation, pp.208-215, Jun. 2008.
  4. S. K. Choi, O. W. Kwon, K. Y. Lee, Y. H. Roh, S. K. Park, "Construction of English-Korean Automatic Translation System for Patent Documents Based on Domain Customizing Method," Journal of KIISE : Software and Applications, vol.34, no.2, pp.95-103, Feb. 2007. (in Korean)
  5. K. Yasuda, A. Finch, and H. Okuma, "System Description of NiCT-ATR SMT for NTCIR-7," Proc. of NTCIR-7 Workshop Meeting, pp.415-419, Dec. 2008.
  6. T. Ito, T. Akiba, and K. Itou, "Effect of the Topic Dependent Translation Models for Patent Translation - Experiment at NTCIR-7," Proc. of NTCIR-7 Workshop Meeting, pp.425-429, Dec. 2008.
  7. G. Foster, and R. Kuhn, "Mixture-model adaptation for SMT," Proc. of the Second Workshop on Statistical Machine Translation, pp.128-135, Jun. 2007.
  8. M. Collins, and N. Duffy, "Parsing with a Single Neuron: Convolution Kernels for Natural Language Problems," Technical report UCSC-CRL-01-01, 2001.
  9. T. Kudo, and Y. Matsumoto, "Fast Methods for Kernel-based Text Analysis," Proc. of the 41st Annual Meeting on Association For Computational Linguistics, vol.1, pp.24-31, Jul. 2003.
  10. A. Fujii, M. Utiyama, M. Yamamoto, and T. Utsuro, "Overview of the patent translation task at the NTCIR-7 Workshop," Proc. of NTCIR-7 Workshop Meeting, pp.389-400, Dec. 2008.
  11. A. Stolcke, "Srilm - an extensible language modeling toolkit," Proc. of the 7th International Conference on Spoken Language Processing (ICSLP). pp.693-696, Sep. 2002.
  12. F. J. Och, and H. Ney, "A systematic comparison of various statistical alignment models," Comput. Linguist., vol.29, no.1, pp.19-51, Mar. 2003. https://doi.org/10.1162/089120103321337421
  13. P. Koehn, H. Hoang, A. Birch, C. C. Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, "Moses: Open Source Toolkit for Statistical Machine Translation," Proc. of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), Jun. 2007.
  14. F. J. Och, "Minimum Error Rate Training for Statistical Machine Translation," Proc. of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pp.160-167, 2003.
  15. K. Papineni, S. Roukos, T. Ward, and W. Zhu, "BLEU: A method for automatic evaluation of Machine Translation," Proc. of the 40th Annual Meeting on Association For Computational Linguistics, Jul. 2001.