[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.15207/JKCS.2021.12.5.001

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering

Moon, Hyeonseok (Department of Computer Science and Engineering, Korea University)
Park, Chanjun (Department of Computer Science and Engineering, Korea University)
Eo, Sugyeong (Department of Computer Science and Engineering, Korea University)
Park, JeongBae (Department of Human Inspired AI Research, Korea University)
Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)

Publication Information

Journal of the Korea Convergence Society / v.12, no.5, 2021 , pp. 1-7 More about this Journal

Abstract

In the latest trend of machine translation research, the model is pretrained through a large mono lingual corpus and then finetuned with a parallel corpus. Although many studies tend to increase the amount of data used in the pretraining stage, it is hard to say that the amount of data must be increased to improve machine translation performance. In this study, through an experiment based on the mBART model using parallel corpus filtering, we propose that high quality data can yield better machine translation performance, even utilizing smaller amount of data. We propose that it is important to consider the quality of data rather than the amount of data, and it can be used as a guideline for building a training corpus.

Keywords

Deep Learning; Natural Language Process; Machine Translation; Parallel Corpus Filtering; Pretrained model;

Citations & Related Records

Reference

1	K. Papineni, S. Roukos, T. Ward & W. J. Zhu. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
2	C. Park, Y. Yang, K. Park & H. Lim. (2020). Decoding strategies for improving low-resource machine translation. Electronics, 9(10), 1562. DOI
3	T. Kudo & J. Richardson. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. DOI : 10.18653/v1/P18-1007
4	K. Song, X. Tan, T. Qin, J. Lu & T. Y. Liu. (2019). Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
5	M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer & O. Levy. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64-77. DOI : 10.1162/tacl_a_00300 DOI
6	C. Park & H. Lim. (2020). A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus. Journal of Digital Convergence, 18(6), 271-277. DOI : 10.14400/JDC.2020.18.6.271 DOI
7	H. Khayrallah & P. Koehn. (2018). On the impact of various types of noise on neural machine translation. arXiv preprint arXiv:1805.12282. DOI : 10.18653/v1/w18-2709
8	Y. Liu et al. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, 726-742. DOI : 10.1162/tacl_a_00343 DOI
9	C. Park, Y. Lee, C. Lee & H Lim. (2020). "Quality, not Quantity? : Effect of parallel corpus quantity and quality on Neural Machine Translation," The 32st Annual Conference on Human Cog-nitive Language Technology.
10	W. A. Gale & K. Church. (1993). A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1), 75-102.
11	M. Cettolo et al. (2017). Overview of the iwslt 2017 evaluation campaign. In International Workshop on Spoken Language Translation (pp. 2-14).
12	M. Ott et al. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038. DOI : 10.18653/v1/n19-4009
13	P. Koehn, V. Chaudhary, A. El-Kishky, N. Goyal, P. J. Chen & F. Guzman. (2020, November). Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation (pp. 726-742).
14	M. Lewis et al. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
15	A. Vaswani et al. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
16	J. Devlin, M. W. Chang, K. Lee & K. Toutanova. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
17	G. Lample & A. Conneau. (2019). Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.

KSCI

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering 병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering