DOI QR코드

DOI QR Code

언어 자원과 토픽 모델의 순차 매칭을 이용한 유사 문장 계산 기반의 위키피디아 한국어-영어 병렬 말뭉치 구축

Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling

  • 천주룡 (동아대학교 컴퓨터공학과) ;
  • 고영중 (동아대학교 컴퓨터공학과)
  • 투고 : 2015.03.17
  • 심사 : 2015.05.20
  • 발행 : 2015.07.15

초록

본 논문은 위키피디아로부터 한국어-영어 간 병렬 말뭉치를 구축하기 위한 연구이다. 이를 위해, 언어 자원과 토픽모델의 순차 매칭 기반의 유사 문장 계산 방법을 제안한다. 먼저, 언어자원의 매칭은 위키피디아 제목으로 구성된 위키 사전, 숫자, 다음 온라인 사전을 단어 매칭에 순차적으로 적용하였다. 또한, 위키피디아의 특성을 활용하기 위해 위키 사전에서 추정한 번역 확률을 단어 매칭에 추가 적용하였다. 그리고 토픽모델로부터 추출한 단어 분포를 유사도 계산에 적용함으로써 정확도를 향상시켰다. 실험에서, 선행연구의 언어자원만을 선형 결합한 유사 문장 계산은 F1-score 48.4%, 언어자원과 모든 단어 분포를 고려한 토픽모델의 결합은 51.6%의 성능을 보였으나, 본 논문에서 제안한 언어자원에 번역 확률을 추가하여 순차 매칭을 적용한 방법은 58.3%로 9.9%의 성능 향상을 얻었고, 여기에 중요한 단어 분포를 고려한 토픽모델을 적용한 방법이 59.1%로 7.5%의 성능 향상을 얻었다.

In this paper, to build a parallel corpus between Korean and English in Wikipedia. We proposed a method to find similar sentences based on language resources and topic modeling. We first applied language resources(Wiki-dictionary, numbers, and online dictionary in Daum) to match word sequentially. We construct the Wiki-dictionary using titles in Wikipedia. In order to take advantages of the Wikipedia, we used translation probability in the Wiki-dictionary for word matching. In addition, we improved the accuracy of sentence similarity measuring method by using word distribution based on topic modeling. In the experiment, a previous study showed 48.4% of F1-score with only language resources based on linear combination and 51.6% with the topic modeling considering entire word distributions additionally. However, our proposed methods with sequential matching added translation probability to language resources and achieved 9.9% (58.3%) better result than the previous study. When using the proposed sequential matching method of language resources and topic modeling after considering important word distributions, the proposed system achieved 7.5%(59.1%) better than the previous study.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단

참고문헌

  1. Teubert Wolfgang, "Comparable or parallel corpora?," International journal of lexicography, Vol. 9, No. 3, pp. 238-264, 1996. https://doi.org/10.1093/ijl/9.3.238
  2. Sunghyun Kim. Seon Yang and Youngjoong Ko, "Extracting Korean-English Parallel Sentences from Wikipedia," Journal of korean institute of information scientists and engineers (KIISE): software and applications, pp. 580-585, 2014.
  3. Dragos Stefan munteanu and Daniel Marcu, "Improving machine translation performance by exploiting non-parallel corpora," Computational linguistics, Vol. 31, No. 4, pp. 477-504, 1995.
  4. Tao Tao and ChengXiang Zhai, "Mining comparable bilingual text corpora for cross-language information integration," Proc. of the 19th ACM SIGKDD international conference on knowledge discovery in data mining (KDD-2005), pp. 691-696, 2005.
  5. Ramirez Jessica C and Yuji Matsumoto, "A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora," arXiv preprint arXiv:1211.4488, 2012.
  6. Utiyama Masao and Hitoshi Isahara, "Reliable measures for aligning Japanese-English news articles and sentences," Proc. of ACL '03, pp. 72-79, 2003.
  7. Adafre Sisay Fissaha and Maarten De Rijke. "Finding similar sentences across multiple languages in wikipedia," Proc. of ACL '06, pp. 62-69, 2006.
  8. David M. Blei, Andrew Y. Ng and Michael I.Jordan, "Latent dirichlet allocation," The journal of machine learning research, 3, pp. 993-1022, 2003.
  9. Zede Zhu, Miao Li, Lei Chen and Zhenxin Yang, "Building Comparable Corpora Based on Bilingual LDA Model," Proc. of ACL '13, pp. 278-282, 2013.
  10. Ture Ferhan and jimmay Lin, "Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling," Proc. of the 2012 conference of the north american chapter of the association for computational linguistics: human language technologies, association for computational linguistics, pp. 626-630, 2012.
  11. Mallet toolkit, [Online]. Available: http://mallet.cs.umass.edu/download.php
  12. GIZA++ statistical translation models toolkit, [Online]. Available: http://code.google.com/p/giza-pp/