[KSCI] Korea Science Citation Index Service

Extracting Korean-English Parallel Sentences from Wikipedia

Kim, Sung-Hyun (동아대학교 컴퓨터공학과)
Yang, Seon (동아대학교 컴퓨터공학과)
Ko, Youngjoong (동아대학교 컴퓨터공학과)

Publication Information

Journal of KIISE:Software and Applications / v.41, no.8, 2014 , pp. 580-585 More about this Journal

Abstract

This paper conducts a variety of experiments for "the extraction of Korean parallel sentences using Wikipedia data". We refer to various methods that were previously proposed for other languages. We use two approaches. The first one is to use translation probabilities that are extracted from the existing resources such as Sejong parallel corpus, and the second one is to use dictionaries such as Wiki dictionary consisting of Wikipedia titles and MRDs (machine readable dictionaries). Experimental results show that we obtained a significant improvement in system using Wikipedia data in comparison to one using only the existing resources. We finally achieve an outstanding performance, an F1-score of 57.6%. We additionally conduct experiments using a topic model. Although this experiment shows a relatively lower performance, an F1-score of 51.6%, it is expected to be worthy of further studies.

Keywords

wikipedia; parallel sentence; comparable corpus; wiki dictionary; translation probability; topic model;

Citations & Related Records

Reference

1	Teubert Wolfgang, "Comparable or parallel corpora?," International journal of lexicography, vol.9, no.3, p.238, 1996. DOI ScienceOn
2	Adafre Sisay Fissaha and Maarten De Rijke. "Finding similar sentences across multiple languages in wikipedia," In Proceedings of EACL'06, p.62, 2006.
3	Hewavitharana Sanjika and Stephan Vogel, "Extracting parallel phrases from comparable data," In Proceedings of BUCC'11, p.61, 2011.
4	Ture Ferhan and Jimmy Lin, "Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling," In Proceedings of NAACL'12, p.626, 2012.
5	Dean Jeffrey and Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol.51, no.1, p.107, 2008.
6	David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent dirichlet allocation," The Journal of Machine Learning research, 3, p.993, 2003.
7	Zede Zhu, Miao Li, Lei Chen and Zhenxin Yang, "Building Comparable Corpora Based on Bilingual LDA Model," In Proceedings of ACL'13, p.278, 2013.
8	Ivan Vulic, Wim De Smet, and Marie-Francine Moens, "Cross-language information retrieval with latent topic models trained on a comparable corpus," Information Retrieval Technology, Springer Berlin Heidelberg, p.37, 2011.
9	Ivan Vulic and Marie-Francine Moens, "Crosslingual semantic similarity of words as the similarity of their semantic word response," In Proceedings of NAACL'13, p.106, 2013.

KSCI

Extracting Korean-English Parallel Sentences from Wikipedia 위키피디아로부터 한국어-영어 병렬 문장 추출

Extracting Korean-English Parallel Sentences from Wikipedia