Automatic Extraction of English-Chinese Transliteration Pairs using Dynamic Window and Tokenizer

동적 윈도우와 토크나이저를 이용한 영-중 음차표기 대역쌍 자동 추출

  • 김성국 (포항공과대학교 정보통신연구소) ;
  • 나승훈 (포항공과대학교 컴퓨터공학과) ;
  • 김동일 (중국연변과기대 컴퓨터공학과) ;
  • 이종혁 (포항공과대학교 컴퓨터공학과)
  • Published : 2007.11.15

Abstract

Recently, many studies have focused on extracting transliteration pairs from bilingual texts. Most of these studies are based on the statistical transliteration model. The paper discusses the limitations of previous approaches and proposes novel approaches called dynamic window and tokenizer to overcome these limitations. Experimental results show that the average rates of word and character precision are 99.0% and 99.78%, respectively.

인터넷의 발달로 대량의 이중언어 문서 구축이 가능해짐에 따라 이런 언어자원을 이용하여 음차표기 대역쌍을 추출하는 연구가 활발히 진행되고 있다. 이런 연구들은 대부분 통계기반 음차표기 모델을 기반으로 하고 있다. 본 논문에서는 기존의 통계기반 음차표기 모델의 문제점에 대하여 분석하고 동적 윈도우와 토크나이저 기법을 제안하여 약 99%의 단어 정확률을 나타냈으며 그 결과 기존의 통계기반 음차표기 모델에 비하여 약 23%정도의 성능 향상을 보였다.

Keywords

References

  1. Xinhua Agency, Names of the world's peoples: a comprehensive dictionary of names in Roman-Chinese (世界人名翻汻大辭典), (1993)
  2. Richard Sproat, Tao Tao, ChengXiang Zhai, Named Entity Tranliteration with Comparable Corpora, in: Proceddings of the 21stInternational Conference on Computational Linguistics. (2006)
  3. Jong-Hoon Oh, Sun-Mee Bae, Key-Sun Choi, An Algorithm for extracting English-Korean Transliterationpairs using Automatic E-K Transliteration In Proceedings of Korean Information Science Socieity (Spring). (In Korean), (2004)
  4. C.-J. Lee, J.S. Chang, J.-S.R. Jang, Extraction of transliteration pairs from parallel corpora using a statistical transliteration model, in: Information Sciences 176, 67-90 (2006) https://doi.org/10.1016/j.ins.2004.10.006
  5. J.S. Lee and K.S. Choi, 'English to Korean statistical transliteration for information retrieval,' International Journal of Computer Processing of Oriental Languages, pp. 17-37, (1998)
  6. K. Knight, J. Graehl, Machine transliteration, Computational Linguistics 24 (4), 599-612, (1998)
  7. W.-H. Lin, H.-H. Chen, Backward transliteration by learning phonetic similarity, in: CoNLL-2002, Sixth Conference on Natural Language Learning, Taipei, Taiwan, (2002)
  8. J.-H. Oh, K.-S. Choi, An English-Korean transliteration model using pronunciation and contextual rules, in: Proceedings of the 19th International Conference on Computational Linguistics (COLING), Taipei, Taiwan, pp. 758-764, (2002)
  9. C.-J. Lee, J.S. Chang, J.-S.R. Jang, A statistical approach to Chinese-to-English Backtransliteration, in: Proceedings of the 17th Pacific Asia Conference on Language, Information, and Computation (PACLIC), Singapore, pp. 310-318, (2003)
  10. Jong-Hoon Oh, Jin-Xia Huang, Key-Sun Choi, An Alignment Model for Extracting English-Korean Translations of Term Constituents, Journal of Korean Information Science Society, SA, 32(4), (2005)
  11. Chun-Jen Lee, Jason S. Chang, Jyh-Shing Roger Jang: Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Trans. Asian Lang. Inf. Process. 5(2): 121-145 (2006) https://doi.org/10.1145/1165255.1165257
  12. Lee, C. J. and Chang, J. S., Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts Using a Statistical Machine Transliteration Model, In. Proceedings of HLTNAACL, Edmonton, Canada, pp. 96-103, (2003)