DOI QR코드

DOI QR Code

A Study on Fine-Tuning and Transfer Learning to Construct Binary Sentiment Classification Model in Korean Text

한글 텍스트 감정 이진 분류 모델 생성을 위한 미세 조정과 전이학습에 관한 연구

  • 김종수 ((주)골드브릿지 기업부설연구소)
  • Received : 2023.06.26
  • Accepted : 2023.10.10
  • Published : 2023.10.30

Abstract

Recently, generative models based on the Transformer architecture, such as ChatGPT, have been gaining significant attention. The Transformer architecture has been applied to various neural network models, including Google's BERT(Bidirectional Encoder Representations from Transformers) sentence generation model. In this paper, a method is proposed to create a text binary classification model for determining whether a comment on Korean movie review is positive or negative. To accomplish this, a pre-trained multilingual BERT sentence generation model is fine-tuned and transfer learned using a new Korean training dataset. To achieve this, a pre-trained BERT-Base model for multilingual sentence generation with 104 languages, 12 layers, 768 hidden, 12 attention heads, and 110M parameters is used. To change the pre-trained BERT-Base model into a text classification model, the input and output layers were fine-tuned, resulting in the creation of a new model with 178 million parameters. Using the fine-tuned model, with a maximum word count of 128, a batch size of 16, and 5 epochs, transfer learning is conducted with 10,000 training data and 5,000 testing data. A text sentiment binary classification model for Korean movie review with an accuracy of 0.9582, a loss of 0.1177, and an F1 score of 0.81 has been created. As a result of performing transfer learning with a dataset five times larger, a model with an accuracy of 0.9562, a loss of 0.1202, and an F1 score of 0.86 has been generated.

근래에 트랜스포머(Transformer) 구조를 기초로 하는 ChatGPT와 같은 생성모델이 크게 주목받고 있다. 트랜스포머는 다양한 신경망 모델에 응용되는데, 구글의 BERT(bidirectional encoder representations from Transformers) 문장생성 모델에도 사용된다. 본 논문에서는, 한글로 작성된 영화 리뷰에 대한 댓글이 긍정적인지 부정적인지를 판단하는 텍스트 이진 분류모델을 생성하기 위해서, 사전 학습되어 공개된 BERT 다국어 문장생성 모델을 미세조정(fine tuning)한 후, 새로운 한국어 학습 데이터셋을 사용하여 전이학습(transfer learning) 시키는 방법을 제안한다. 이를 위해서 104 개 언어, 12개 레이어, 768개 hidden과 12개의 집중(attention) 헤드 수, 110M 개의 파라미터를 사용하여 사전 학습된 BERT-Base 다국어 문장생성 모델을 사용했다. 영화 댓글을 긍정 또는 부정 분류하는 모델로 변경하기 위해, 사전 학습된 BERT-Base 모델의 입력 레이어와 출력 레이어를 미세 조정한 결과, 178M개의 파라미터를 가지는 새로운 모델이 생성되었다. 미세 조정된 모델에 입력되는 단어의 최대 개수 128, batch_size 16, 학습 횟수 5회로 설정하고, 10,000건의 학습 데이터셋과 5,000건의 테스트 데이터셋을 사용하여 전이 학습시킨 결과, 정확도 0.9582, 손실 0.1177, F1 점수 0.81인 문장 감정 이진 분류모델이 생성되었다. 데이터셋을 5배 늘려서 전이 학습시킨 결과, 정확도 0.9562, 손실 0.1202, F1 점수 0.86인 모델을 얻었다.

Keywords

References

  1. O. Dongsuk, P. Sungjin, L. Hanna, J. Yoonna, and L. Heuiseok, (2021), KoDialoGPT2 : Modeling Chit-Chat Dialog in Korean, Proceedings of the 33th Korean Language and Korean Information Processing Conference, Oct. 14-15, 457-460, Korea
  2. K. EunJung. (2022), A study on the difficulty adjustment of programming language multiple-choice problems using machine learning, Journal of Korea Industrial Information Systems Research, 27(2), 11-24
  3. K. SeongAn, K. SoHui and R. Min Ho. (2022), Analysis of Hypertension Risk Factors by Life Cycle Based on Machine Learning, Journal of Korea Industrial Information Systems Research, 27(5), 73-82
  4. L. DoegGyu, Kyungkeun B, L. HyungDong, and S. Sunhee. (2023), The Prediction of Survival of Breast Cancer Patients Based on Machine Learning Using Health Insurance Claim Data, Journal of Korea Industrial Information Systems Research, 28(2), 1-9
  5. S. John, W. Filip, D. Prafulla, R. Alec and K. Oleg, (2017), Proximal Policy Optimization Algorithms, Aug. 28, https://doi.org/10.48550/arXiv.1707.06347
  6. B. Tom B., M. Benjamin, R. Nick, et al., (2020), Language Models are Few-Shot Learners. NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing, Jun. 22, 1877-1901, https://doi.org/10.48550/arXiv.2005.14165
  7. L. Ouyang, W. Jeff, X. Jiang, A. Diogo, et al., (2022), Training language models to follow instructions with human feedback, Journal of Advances in Neural Information Processing Systems, 35, 27730-27744, https://doi.org/10.48550/arXiv.2203.02155
  8. B. Sebastien, C. Varun, E. Ronen, et al., (2023), Sparks of Artificial General Intelligence: Early experiments with GP T-4, Apr. 13, https://doi.org/10.48550/arXiv.2303.12712
  9. D. Jacob, C. Ming-Wei, L. Kenton and T. Kristina, (2019), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, May. 24, https://doi.org/10.48550/arXiv.1810.04805
  10. TensorFlow Authors, (2019), Basic text classification | TensorFlow Core, https://www.tensorflow.org/tutorials/keras/text_classification?hl=ko(Accessed on Oct. 3th, 2023)
  11. Gooble Colab, (2023), Text classification with an RNN, https://www.tensorflow.org/text/tutorials/text_classification_rnn?hl=en(Access ed on Oct. 3th, 2023)
  12. TensorFlow Hub Authors, (2020), Classify text with BERT, https://github.com/tensorflow/text/blob/master/docs/tutorials/classify_text_with_bert.ipynb(Accessed on Oct. 3th, 2023)
  13. K. Jared, M. Sam, H. Tom, B. Tom B, et al., (2020), Scaling Laws for Neural Language Models, OpenAI, Jan. 23, https://doi.org/10.48550/arXiv.2001.08361
  14. V. Ashish, S. Noam, P. Niki, U. Jakob, J. Llion, G. Aidan N., K. Lukasz and P. Illia, (2017), Attention Is All You Need, The Thirty-first Annual Conference on Neural Information Processing Systems, Dec. 6, https://doi.org/10.48550/arXiv.1706.03762
  15. W. Yizhong, K. Yeganeh, M. Swaroop, et al., (2022), SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions, Dec. 20, https://doi.org/10.48550/arXiv.2212.10560
  16. Team with members from UC Berkeley, CMU, Stanford, and UC San Diego, (2023). Vicuna: An Open-Source Chatbot Impressing GP T-4 with 90%* ChatGP T Quality, https://vicuna.lmsys.org/(Accessed on Jun. 25th, 2023)
  17. S. Chang-Uk, (2020), Awesome Korean NLP Papers, https://github.com/changukshin/Awesome-Korean-NLP-Papers(Accessed on Oct. 4th, 2023)
  18. L. Eunchan, L. Changhyeon and A. Sangtae, (2022), Comparative Study of Multiclass Text Classification in Research Proposals Using Pretrained Language Models, applied sciences, https://www.mdpi.com/2076-3417/12/9/4522(Accessed on Oct. 4th, 2023)