Optimization of Transitive Verb-Objective Collocation Dictionary based on k-nearest Neighbor Learning

k-최근점 학습에 기반한 타동사-목적어 연어 사전의 최적화

  • 김유섭 (서울대학교 컴퓨터공학과) ;
  • 장병탁 (서울대학교 컴퓨터공학과) ;
  • 김영택 (서울대학교 컴퓨터공학과)
  • Published : 2000.03.15

Abstract

In English-Korean machine translation, transitive verb-objective collocation is utilized for accurate translation of an English verbal phrase into Korean. This paper presents an algorithm for correct verb translation based on the k-nearest neighbor learning. The semantic distance is defined on the WordNet for the k-nearest neighbor learning. And we also present algorithms for automatic collocation dictionary optimization. The algorithms extract transitive verb-objective pairs as training examples from large corpora and minimize the examples, considering the tradeoff between translation accuracy and example size. Experiments show that these algorithms optimized collocation dictionary keeping about 90% accuracy for a verb 'build'.

영한 기계번역에서 영어 문장의 동사구를 한국어로 정확하게 번역하기 위해서는 일반적으로 타동사와 목적어의 연어 관계를 이용한다. 본 논문에서는 k-최근점(k-nearest neighbor) 학습을 연어 관계에 적용하여 동사 번역을 선택하는 알고리즘을 제시하였는데 k-최근점 학습을 위해서 워드넷에서의 의미거리를 정의하여 사용하였다. 그리고 실시간 번역 시스템에 사용될 사전을 구성하기 위하여, 말뭉치로부터 타동사-목적어 쌍을 추출하여 학습예제를 구축하고, 이 예제의 크기를 번역률과 연관시켜 최적화시키는 알고리즘을 제시한다. 본 논문에서는 위의 알고리즘들을 사용하여 동사 'build'의 번역률을 약 90%로 유지하면서 사전의 크기를 최적화하였다.

Keywords

References

  1. Dagan I. and A. Itai, 'Word Sense Disambiguation Using a Second Language Monolingual Corpus,' Association for Computational Linguistics, Vol. 20, No. 4, pp. 563-596, 1994
  2. Kim N. and Y. T. Kim, 'Determining Target Expression Using Parameterized Collocations from Corpus in Korean-English Machine Translation,' Proc of PRICAI-94, pp. 732-736, 1994
  3. Dagan I., F. C. N. Pereira and L. Lee, 'Similarity-Based Estimation of Word Cooccurrence Probabilities,' 32nd Annual Meeting of ACL, 1994 https://doi.org/10.3115/981732.981770
  4. Dagan I., L. Lee, and F. C. N. Pereira, 'Similarity-Based Models of Word Cooccurrence Probabilities,' Machine Learning, Vol. 34, pp. 43-69, 1999 https://doi.org/10.1023/A:1007537716579
  5. Karov, Y. and S. Edelman, 'Similarity-based Word Sense Disambiguation,' Computational Linguistics, Vol. 24, No. 1, pp. 41-59, 1998
  6. Resnik, P., 'Disambiguating noun groupings with respect to WordNet senses,' Proc. of the Third Workshop on Very Large Corpora, pp. 54-68, 1995
  7. Yarowsky D., 'Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora,' Proc. of COLING-92, Nantes, Aug 23-28, pp.454-460, 1992 https://doi.org/10.3115/992133.992140
  8. Charniak, E., Statistical Language Learning, pp. 135-145, The MIT Press, 1993
  9. 박성배, 장병탁, 김영택, 'Self-Organizing Map을 이용한 한국어 동사 클러스터링,' 98 가을 한국정보과학회 학술발표논문집(II), pp. 183-185, 1998
  10. Brown P. F., V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer, 'Class-Based n-gram Models of Natural Language,' Association for Computational Linguistics, Vol. 18, No. 4, pp. 467-479, 1992
  11. Mitchell, T. M., Machine Learning, pp. 230-236, The McGraw-Hill Companies, Inc., 1997
  12. Cover D.S, & P. Hart, 'Nearest neighbor pattern classification,' IEEE Transactions on Information Theory, Vol. 13, pp21-27, 1967 https://doi.org/10.1109/TIT.1967.1053964
  13. Duda R. & P. Hart, Pattern Classification and scene analysis, New York: John Wiley & Sons, 1973
  14. Bishop C. M., Neural networks for pattern recognition, Oxford, England: Oxford University Press., 1995
  15. Frey, B. J., Graphical Models for Machine Learning and Digital Communication, pp. 55-57, The MIT Press, 1998
  16. Collins Cobuild English Language Dictionary, 1997
  17. Kim Y. and Y. T. Kim, 'Semantic Implementation based on Extended Idiom for English to Korean Machine Translation,' The Asia-Pacific Association for Machine Translation Journal, No.21, pp. 23-39, 1998
  18. 김유섭, 김영택, '영한 기계번역에서 관용구에 기반한 의미 분석', 정보과학회논문지(B), 제25권, 제4호, pp. 609-617, 1998
  19. Richardson R., A. F. Smeaton & J. Murphy, 'Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words,' School of Computer Applications Working Paper: CA-1294, 1994
  20. Fellbaum, C., Wordnet - An Electronic Lexical Database, The MIT Press, 1998
  21. Cherkassky V., and F. Mulier, Learning from Data - Concepts, Theory, and Methods, pp78-80, John Wiley & Sons, Inc., 1998