DOI QR코드

DOI QR Code

Data augmentation methods for classifying Korean texts

한국어 텍스트 분류 분석을 위한 데이터 증강 방법

  • Jihyun Jeon (2nd Credit Bureau division, Nice Information Service) ;
  • Yoonsuh Jung (Department of Statistics, Korea University)
  • 전지현 (NICE평가정보 CB부분 CB사업2본부 CB사업4실) ;
  • 정윤서 (고려대학교 통계학과)
  • Received : 2024.04.13
  • Accepted : 2024.07.17
  • Published : 2024.10.31

Abstract

Data augmentation is widely adopted in computer vision. In contrast, research on data augmentation in the field of natural language processing has been limited. We propose several data augmentation methods to support the classification of Korean texts. We increase the size and diversity of text data which are specifically tailored to Korean. These methods adopt and adjust the existing data augmentation for English texts. We could improve the classification accuracy and sometimes regularize the natural language models to reduce the overfits. Our contribution to the data augmentation regarding Korean texts compose of three parts. 1) data augmentation with Spelling Correction, 2) Easy data augmentation based on part-of-speech tagging, and 3) Data augmentation with conditional Masked Language Modeling. Our experiments show that classification accuracy can be improved with the aids of our proposed methods. Due to the limit of computing facilities, we consider rather small-scale Korean texts only.

데이터 증강은 학습데이터의 변형을 통해 데이터의 크기 및 다양성을 늘리는 방법으로 과적합 규제화 수단으로 사용되고 있다. 활발한 연구가 이루어지고 있는 컴퓨터비전 영역과 달리 자연어처리 영역에서의 데이터 증강 관련 연구는 다소 제한적인 상황이다. 특히 한국어 데이터 관련 연구는 극히 적다. 본 논문에서는 소규모의 한국어 텍스트 데이터 분류 분석 성능 향상을 위한 증강 방법론을 제안한다. 1) 맞춤법 교정을 통한 데이터 증강(DA-SC), 2) 형태소 분석 기반의 쉬운 데이터 증강(EDA-POS), 3) 조건부 마스킹 언어모형 기반의 데이터 증강(DA-cMLM)의 총 세 가지 방안을 제안한다. 실제 데이터 분석을 통해 본 논문에서 제안하는 증강방법의 적용을 통해 분류 성능을 향상시킬 수 있음을 보인다.

Keywords

Acknowledgement

Jung's work has been partially supported by National Research Foundation of Korea (NRF) grants funded by the Korean government (MIST) 2022R1F1A1071126 and by a Korea University Grant (K2305251).

References

  1. Bayer M, Kaufhold M, and Reuter C (2022). A survey on data augmentation for text classification, ACM Computing Surveys, 55, 1-39.
  2. Chen T, Kornblith S, Norouzi M, and Hinton G (2020). A simple framework for contrastive learning of visual representations, Proceedings of the 37th International Conference on Machine Learning, 119, 1597-1607.
  3. Cho J, Jeong M, Lee J, and Cheong Y (2019). Transformational data augmentation techniques for Korean text data, Proceedings of the Korean Institute of Information Scientists and Engineers Conference, 47, 592-594.
  4. Choi M and On B (2019). A Comparative Study on the Accuracy of Sentiment Analysis of Bi-LSTM Model by Morpheme Feature Proceedings of KIIT Conference, 307-309.
  5. Choi Y and Lee KJ (2020). Performance analysis of Korean morphological analyzer based on transformer and BERT, Journal of The Korean Institute of Information Scientists and Engineers, 47, 730-741.
  6. Clark K, Luong M, Le Q, and Manning C (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, Available from: arXiv, https://arxiv.org/abs/2003.10555
  7. Devlin J, Chang M, Lee K, and Toutanova K (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171-4186.
  8. Dong L, Yang N, Wang W et al. (2019). Unified language model pre-training for natural language understanding and generation, Available from: arXiv, http://arxiv.org/abs/1905.03197
  9. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y (2014). Generative adversarial nets, Advances in Neural Information Processing Systems, 2672-2680.
  10. Han S (2015). py-hanspell GitHub repository, Available from: https://github.com/ssut/py-hanspell
  11. Howard J and Ruder S (2018). Universal language model fine-tuning for text classification, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 328-339.
  12. Kim J, Jang K, Lee Y, and Park W (2020). Bert-based classification model improvement through minority class data augmentation, Proceedings of the Korea Information Processing Society Conference, 27, 810-813.
  13. Kobayashi S (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 452-457.
  14. Kumar V, Choudhary A, and Cho E (2020). Data augmentation using pre-trained transformer models, Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, Suzhou, China, 18-26.
  15. Liu Y, Ott M, Goyal N et al (2019). RoBERTa: A robustly optimized BERT pretraining approach, Available from: arXiv, http://arxiv.org/abs/1907.11692
  16. Mikolov T, Chen K, Corrado G, and Dean J (2013). Efficient estimation of word representations in vector space, Available from: arXiv, https://doi.org/10.48550/arXiv.1301.3781
  17. Min C (2019). Korean pronunciation teaching methods for learners from the isolating language circle, specifically in Vietnam and China, Journal of Koreanology, 23, 337-371.
  18. Moon J, Cho W, and Lee J (2020). BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection, Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, 25-31. Online. Association for Computational Linguistics.
  19. Park E (2015). NSMC: Naver sentiment movie corpus v1.0, GitHub repository, Available from: https://github.com/e9t/nsmc Park
  20. Park J (2020). KoELECTRA: Pretrained ELECTRA Model for Korean, GitHub repository, Available from: https://github.com/monologg/KoELECTRA
  21. Pennington J, Socher R, and Manning C (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532-1543.
  22. Qu C, Yang L, Qiu M, Croft WB, Zhang Y, and Iyyer M (2019). BERT with history answer embedding for conversational question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, 1133-1136.
  23. Radford A, Narasimhan K, Salimans T, and Sutskever I (2018). Improving language understanding by generative pre-training, Available from: https://s3-us-west-2.amazonaws.com/openai-assets/research-c\protect\@normalcr\relaxovers/language-unsupervised/language_understanding_paper.pdf
  24. Sennrich R, Haddow B, and Birch A (2016). Improving Neural Machine Translation Models with Monolingual Data P Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 86-96.
  25. SKTBrain (2019). KoBERT GitHub repository, Available from: https://github.com/SKTBrain/KoBERT
  26. Song Y, Wang J, Liang Z, Liu Z, and Jiang T (2020). Utilizing BERT intermediate layers for aspect based sentiment analysis and natural language inference, Available from: arXiv, https://arxiv.org/abs/2002.04815
  27. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, and Polosukhin I (2017). Attention is All you Need. In Proceedings of Advances in Neural Information Processing Systems, Long Beach, CA, 5998-6008.
  28. Wei J and Zou K (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 6382-6388.
  29. Wu X, Lv S, Zang L, Han J, and Hu S (2019). Conditional BERT Contextual Augmentation Computational Science - ICCS 2019, 84-95.