DOI QR코드

DOI QR Code

생성-선정을 통한 텍스트 증강 프레임워크

TAGS: Text Augmentation with Generation and Selection

  • 김경민 (한국기술교육대학교 컴퓨터공학과) ;
  • 김동환 (한국기술교육대학교 컴퓨터공학부) ;
  • 조성웅 (한국기술교육대학교 컴퓨터공학과) ;
  • 오흥선 (한국기술교육대학교 컴퓨터공학부) ;
  • 황명하 (한국전력공사 전력연구원 디지털솔루션연구소)
  • 투고 : 2023.06.16
  • 심사 : 2023.08.22
  • 발행 : 2023.10.31

초록

텍스트 증강은 자연어처리 모델의 성능 향상을 목적으로 원본 텍스트의 변환, 생성을 통하여 새로운 증강 텍스트를 생성하는 방법론이다. 기존 연구된 기법들은 표현적 다양성 부족, 의미 왜곡 , 한정적인 양의 증강 텍스트와 같은 한계점이 존재한다. 거대언어모델과 few-shot learning을 활용한 텍스트 증강은 이러한 한계점의 극복이 가능하지만, 잘못된 생성으로 인한 노이즈 발생의 위험성이 존재한다. 본 논문에서는 여러 후보 텍스트를 생성하고 적합한 텍스트를 증강 텍스트로 선정하는 TAGS를 제안한다. TAGS는 기존 텍스트 few shot learning을 통해 다양한 표현을 생성하면서 대조 학습과 유사도 비교를 통해 원본 텍스트가 적더라도 적합한 데이터를 효과적으로 선정한다. 이를 텍스트 증강이 필수적인 업무용 챗봇 데이터에 적용하여 60배 이상의 양적 향상을 달성하였다. 또한 증강 텍스트의 질적 향상을 확인하기 위해 실제 생성된 텍스트를 분석하여 원본 텍스트에 비해 의미론적, 표현적으로 다양한 텍스트를 생성함을 확인하였으며, 증강 텍스트로 실제 분류 모델을 학습하고 실험하여 실질적으로 자연어처리 모델 성능 향상에 도움이 되는 것을 확인하였다.

Text augmentation is a methodology that creates new augmented texts by transforming or generating original texts for the purpose of improving the performance of NLP models. However existing text augmentation techniques have limitations such as lack of expressive diversity semantic distortion and limited number of augmented texts. Recently text augmentation using large language models and few-shot learning can overcome these limitations but there is also a risk of noise generation due to incorrect generation. In this paper, we propose a text augmentation method called TAGS that generates multiple candidate texts and selects the appropriate text as the augmented text. TAGS generates various expressions using few-shot learning while effectively selecting suitable data even with a small amount of original text by using contrastive learning and similarity comparison. We applied this method to task-oriented chatbot data and achieved more than sixty times quantitative improvement. We also analyzed the generated texts to confirm that they produced semantically and expressively diverse texts compared to the original texts. Moreover, we trained and evaluated a classification model using the augmented texts and showed that it improved the performance by more than 0.1915, confirming that it helps to improve the actual model performance.

키워드

과제정보

This work was funded by the Korea Electric Power Corporation (KEPCO) (R22XO02-30) and "Regional Innovation Strategy (RIS)" through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (MOE) (2021RIS-004).

참고문헌

  1. W. Y. Wang and D. Yang, "That's so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal: Association for Computational Linguistics, pp.2557-2563, Sep. 2015.
  2. E. Pavlick, P. Rastogi, J. Ganitkevitch, B. Van Durme, and C. Callison-Burch, "PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China: Association for Computational Linguistics, pp.425-430, Jul. 2015.
  3. R. Sennrich, B. Haddow, and A. Birch, "Improving neural machine translation models with monolingual data," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany: Association for Computational Linguistics, pp. 86-96, Aug. 2016.
  4. M. Suzgun, L. Melas-Kyriazi, and D. Jurafsky, "Promptand-rerank: A method for zero-shot and few-shot arbitrary textual style transfer with small language models," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, pp. 2195-2222, Dec. 2022.
  5. A. Fabbri et al., "Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation," in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online: Association for Computational Linguistics, pp.704-717, Jun. 2021.
  6. V. Gangal, S. Y. Feng, M. Alikhani, T. Mitamura, and E. Hovy, "Nareor: The narrative reordering problem," in Proceedings of the AAAI Conference on Artificial Intelligence, pp.10645-10653, 2022.
  7. H. Dai et al., "AugGPT: Leveraging ChatGPT for Text Data Augmentation," 2023.
  8. Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, "Generalizing from a few examples: A survey on few-shot learning," ACM Computing Surveys (CSUR), Vol.53, No.3, pp.1-34, 2020. https://doi.org/10.1145/3386252
  9. W. Yin, J. Hay, and D. Roth, "Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach," CoRR, Vol. abs/1909.00161, 2019,
  10. L. Cui, Y. Wu, J. Liu, S. Yang, and Y. Zhang, "Templatebased named entity recognition using BART," in Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, pp.1835-1845, 2021.
  11. N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, "Ctrl: A conditional transformer language model for controllable generation," arXiv preprint arXiv:1909. 05858, 2019.
  12. T. Gao, X. Yao, and D. Cehn, "SimCSE: Simple contrastive learning of sentence embeddings," 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computational Linguistics (ACL), 2021.
  13. C. Liu, "Learning a few-shot embedding model with contrastive learning," Proceedings of the AAAI Conference on Artificial Intelligence, Vol.35, No.10, pp.8635-8643, May 2021.
  14. J. Zhang et al., "Few-shot intent detection via contrastive pre-training and fine-tuning," Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.1906-1912, 2021.
  15. Park, Jangwon, KoELECTRA: Pretrained ELECTRA Model for Korean [Internet], https://github.com/monologg/Ko ELECTRA
  16. Ildoo Kim and Gunsoo Han and Jiyeon Ham and Woonhyuk Baek,KoGPT: KakaoBrain Korean(hangul) Generative Pre-trained Transformer [Internet], https://github.com/kakaobrain/kogpt
  17. Lee, Junbum, "KcBERT: Korean Comments BERT," Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology, pp.437-440, 2020