Browse > Article
http://dx.doi.org/10.15207/JKCS.2021.12.5.001

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering  

Moon, Hyeonseok (Department of Computer Science and Engineering, Korea University)
Park, Chanjun (Department of Computer Science and Engineering, Korea University)
Eo, Sugyeong (Department of Computer Science and Engineering, Korea University)
Park, JeongBae (Department of Human Inspired AI Research, Korea University)
Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
Publication Information
Journal of the Korea Convergence Society / v.12, no.5, 2021 , pp. 1-7 More about this Journal
Abstract
In the latest trend of machine translation research, the model is pretrained through a large mono lingual corpus and then finetuned with a parallel corpus. Although many studies tend to increase the amount of data used in the pretraining stage, it is hard to say that the amount of data must be increased to improve machine translation performance. In this study, through an experiment based on the mBART model using parallel corpus filtering, we propose that high quality data can yield better machine translation performance, even utilizing smaller amount of data. We propose that it is important to consider the quality of data rather than the amount of data, and it can be used as a guideline for building a training corpus.
Keywords
Deep Learning; Natural Language Process; Machine Translation; Parallel Corpus Filtering; Pretrained model;
Citations & Related Records
연도 인용수 순위
  • Reference
1 K. Papineni, S. Roukos, T. Ward & W. J. Zhu. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
2 C. Park, Y. Yang, K. Park & H. Lim. (2020). Decoding strategies for improving low-resource machine translation. Electronics, 9(10), 1562.   DOI
3 T. Kudo & J. Richardson. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. DOI : 10.18653/v1/P18-1007
4 K. Song, X. Tan, T. Qin, J. Lu & T. Y. Liu. (2019). Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
5 M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer & O. Levy. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64-77. DOI : 10.1162/tacl_a_00300   DOI
6 C. Park & H. Lim. (2020). A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus. Journal of Digital Convergence, 18(6), 271-277. DOI : 10.14400/JDC.2020.18.6.271   DOI
7 H. Khayrallah & P. Koehn. (2018). On the impact of various types of noise on neural machine translation. arXiv preprint arXiv:1805.12282. DOI : 10.18653/v1/w18-2709
8 Y. Liu et al. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, 726-742. DOI : 10.1162/tacl_a_00343   DOI
9 C. Park, Y. Lee, C. Lee & H Lim. (2020). "Quality, not Quantity? : Effect of parallel corpus quantity and quality on Neural Machine Translation," The 32st Annual Conference on Human Cog-nitive Language Technology.
10 W. A. Gale & K. Church. (1993). A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1), 75-102.
11 M. Cettolo et al. (2017). Overview of the iwslt 2017 evaluation campaign. In International Workshop on Spoken Language Translation (pp. 2-14).
12 M. Ott et al. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038. DOI : 10.18653/v1/n19-4009
13 P. Koehn, V. Chaudhary, A. El-Kishky, N. Goyal, P. J. Chen & F. Guzman. (2020, November). Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation (pp. 726-742).
14 M. Lewis et al. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
15 A. Vaswani et al. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
16 J. Devlin, M. W. Chang, K. Lee & K. Toutanova. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
17 G. Lample & A. Conneau. (2019). Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.