[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.14400/JDC.2020.18.6.271

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus

Park, Chanjun (Department of Computer Science and Engineering, Korea University)
Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)

Publication Information

Journal of Digital Convergence / v.18, no.6, 2020 , pp. 271-277 More about this Journal

Abstract

Machine translation refers to software that translates a source language into a target language, and has been actively researching Neural Machine Translation through rule-based and statistical-based machine translation. One of the important factors in the Neural Machine Translation is to extract high quality parallel corpus, which has not been easy to find high quality parallel corpus of Korean language pairs. Recently, the AI HUB of the National Information Society Agency(NIA) unveiled a high-quality 1.6 million sentences Korean-English parallel corpus. This paper attempts to verify the quality of each data through performance comparison with the data published by AI Hub and OpenSubtitles, the most popular Korean-English parallel corpus. As test data, objectivity was secured by using test set published by IWSLT, official test set for Korean-English machine translation. Experimental results show better performance than the existing papers tested with the same test set, and this shows the importance of high quality data.

Keywords

Machine Translation; Public Data; Parallel Corpus; Transformer; Neural Machine Translation;

Citations & Related Records

Reference

1	Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
2	Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
3	Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
4	Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
5	Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics.
6	Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.
7	Currey, A., Miceli-Barone, A. V., & Heafield, K. (2017, September). Copied monolingual data improves low-resource neural machine translation. In Proceedings of the Second Conference on Machine Translation (pp. 148-156).
8	Koehn, P., Och, F. J., & Marcu, D. (2003, May). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 48-54). Association for Computational Linguistics.
9	Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
10	Yamada, K., & Knight, K. (2001, July). A syntax-based statistical translation model. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 523-530).
11	Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
12	Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
13	Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. V. D., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
14	Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., ... & Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210.
15	Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, August). Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1243-1252). JMLR. org.
16	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
17	Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
18	Lample, G., & Conneau, A. (2019). Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
19	Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2019). Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
20	Y. J. Jeong, C. E. Park, C. K. Lee & J. S. Kim. (2019). English-Korean Neural Machine Translation using MASS, .The 31st Annual Conference on Human & Cognitive Language Technology
21	Guanghao Xu, Youngjoong Ko, Jungyun Seo. (2019). Improving Low-resource Machine Translation by utilizing Multilingual, Out-domain Resources. KIISE, 46(1), PP. 0649-0651
22	J. H. Lee, B. S. Kim, Guanghao Xu, Youngjoong Ko & J. Y. Seo. (2018). English-Korean Neural Machine Translation using Subword Units KIISE 2018 (), 586-588.
23	Xu, Guanghao, Youngjoong Ko, and Jungyun Seo.(2018) "Expanding Korean/English Parallel Corpora using Back-translation for Neural Machine Translation." Annual Conference on Human and Language Technology. Human and Language Technology.

KSCI

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus 공공 한영 병렬 말뭉치를 이용한 기계번역 성능 향상 연구

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus