과제정보
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation) and This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2021R1A6A1A03045425).
참고문헌
- Y. Liu. et al. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- J. Devlin, M. W. Chang, K. Lee & K. Toutanova. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- R. Sennrich, B. Haddow & A. Birch. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. DOI : 10.18653/v1/P16-1162
- Y. Wu. et al. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- T. Kudo. & J. Richardson. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- M. Kim., Y. Kim., Y. Lim. & E. N. Huh. (2019, July). Advanced subword segmentation and interdependent regularization mechanisms for korean language understanding. In 2019 Third World Conference on Smart Trends in Systems Security and Sustainablity (WorldS4) (pp. 221-227). London : UK. DOI : 10.1109/WorldS4.2019.8903977
- O. Kwon, D. Kim, S. R. Lee, J. Choi & S. Lee. (2021, April). Handling Out-Of-Vocabulary Problem in Hangeul Word Embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 3213-3221). DOI : 10.18653/v1/2021.eacl-main.280
- S. Park, J. Byun, S. Baek, Y. Cho & A. Oh. (2018, July). Subword-level word vector representations for Korean. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2429-2438). DOI : 10.18653/v1/P18-1226
- S. Lee, H. Jang, Y. Baik, S. Park & H. Shin. (2020). Kr-bert: A small-scale korean-specific language model. arXiv preprint arXiv:2008.03979. DOI : 10.5626/jok.2020.47.7.682
- A. Matteson, C. Lee, Y. Kim & H. S. Lim. (2018, August). Rich character-level information for Korean morphological analysis and part-of-speech tagging. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2482-2492).
- S. Moon. & N. Okazaki. (2020, May). Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 3490-3497). Marseille : France.
- D. B. Cho, H. Y. Lee & S. S. Kang. (2021). An Empirical Study of Korean Sentence Representation with Various Tokenizations. Electronics, 10(7), 845. DOI : 10.3390/electronics10070845
- K. Park, J. Lee, S. Jang & D. Jung. (2020). An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks. arXiv preprint arXiv:2010.02534.
- T. Kudo. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959. DOI : 10.18653/v1/P18-1007
- E. F. Sang & F. De Meulder. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
- S. Park. et al. (2021). KLUE: Korean Language Understanding Evaluation. arXiv preprint arXiv:2105.09680.
- J. Ham, Y. J. Choe, K. Park, I. Choi & H. Soh. (2020). Kornli and korsts: New benchmark datasets for korean natural language understanding. arXiv preprint arXiv:2004.03289. DOI : 10.18653/v1/2020.findings-emnlp.39
- M. Ott et al. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038. DOI : 10.18653/v1/N19-4009