DOI QR코드

DOI QR Code

Generating Sponsored Blog Texts through Fine-Tuning of Korean LLMs

한국어 언어모델 파인튜닝을 통한 협찬 블로그 텍스트 생성

  • Bo Kyeong Kim ;
  • Jae Yeon Byun ;
  • Kyung-Ae Cha
  • 김보경 (대구대학교 AI학과) ;
  • 변재연 (대구대학교 AI학과) ;
  • 차경애 (대구대학교 AI학과)
  • Received : 2024.03.15
  • Accepted : 2024.04.26
  • Published : 2024.06.30

Abstract

In this paper, we fine-tuned KoAlpaca, a large-scale Korean language model, and implemented a blog text generation system utilizing it. Blogs on social media platforms are widely used as a marketing tool for businesses. We constructed training data of positive reviews through emotion analysis and refinement of collected sponsored blog texts and applied QLoRA for the lightweight training of KoAlpaca. QLoRA is a fine-tuning approach that significantly reduces the memory usage required for training, with experiments in an environment with a parameter size of 12.8B showing up to a 58.8% decrease in memory usage compared to LoRA. To evaluate the generative performance of the fine-tuned model, texts generated from 100 inputs not included in the training data produced on average more than twice the number of words compared to the pre-trained model, with texts of positive sentiment also appearing more than twice as often. In a survey conducted for qualitative evaluation of generative performance, responses indicated that the fine-tuned model's generated outputs were more relevant to the given topics on average 77.5% of the time. This demonstrates that the positive review generation language model for sponsored content in this paper can enhance the efficiency of time management for content creation and ensure consistent marketing effects. However, to reduce the generation of content that deviates from the category of positive reviews due to elements of the pre-trained model, we plan to proceed with fine-tuning using the augmentation of training data.

본 논문에서는 대규모 한국어 언어모델인 KoAlpaca를 파인튜닝하고 이를 이용한 블로그 텍스트 생성 시스템을 구현하였다. 소셜 미디어 플랫폼의 블로그는 기업 마케팅 수단으로 널리 활용된다. 수집된 협찬 블로그 텍스트의 감정 분석과 정제를 통한 긍정 리뷰의 학습 데이터를 구축하고 KoAlpaca 학습의 경량화를 위한 QLoRA를 적용하였다. QLoRA는 학습에 필요한 메모리 사용량을 크게 줄이는 파인튜닝 접근법으로 파라미터 크기 12.8B 경우의 실험 환경에서 LoRA 대비 최대 약 58.8%의 메모리 사용량 감소를 확인하였다. 파인튜닝 모델의 생성 성능 평가를 위해서 학습 데이터에 포함되지 않은 100개의 입력으로 생성한 텍스트는 사전학습 모델에 비해서 평균적으로 두배 이상의 단어 수를 생성하였으며 긍정 감정의 텍스트 역시 두 배 이상으로 나타났다. 정성적 생성 성능 평가를 위한 설문조사에서 파인튜닝 모델의 생성 결과가 제시된 주제에 더 잘 부합한다는 응답이 평균 77.5%로 나타났다. 이를 통해서 본 논문의 협찬물에 대한 긍정 리뷰 생성 언어모델은 콘텐츠 제작을 위한 시간 관리의 효율성을 높이고 일관된 마케팅 효과를 보장하는 콘텐츠 제작이 가능함을 보였다. 향후 사전학습 모델의 생성 요소에 의해서 긍정 리뷰의 범주에서 벗어나는 생성 결과를 감소시키기 위해서 학습 데이터의 증강을 활용한 파인튜닝을 진행할 예정이다.

Keywords

References

  1. Ahn, H. J. and Ha, Y. (2017). Analysis of the Relationship between the Type of Experience and Blog Texts. The Journal of Korean Institute of Information Technology, 15(2), 131-140.
  2. Dettmers, T., Pagnoni, A., Holtzman, A. and Zettlemoyer, L. (2023). Qlora: Efficient Finetuning of Quantized LLMs, arXiv preprint arXiv:2305.14314.
  3. Friedl, J. E. F. (2006). Mastering Regular Expressions. O'Reilly Media, the United States of America.
  4. Goldberg, Y. and Levy, O. (2014). Word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method, arXiv preprint arXiv:1402.3722.
  5. Han, Y., Kim, H. and Lee, S. (2017). Experience Transfer in Social Media: Impact of Indirect Experience from Blog Posts, Journal of Channel and Retailing, 22(1), 39-50.
  6. Kang, H. and Cheon, H.J. (2020). Sponsorship Disclosures in Influencer Marketing: Focusing on Characteristics of Influencer, Viewing Satisfaction, and Attitudes toward Sponsorship, Journal of Communication Research in Korea, 19(3), 215-244.
  7. Kang, S.Y., Lee, Y.J., Jung, H.A., Cho, S.A. and Lee, H.G. (2024). An User-Friendly Kiosk System Based on Deep Learning, Journal of Korea Society of Industrial Information Systems, 29(1), 1-13.
  8. Kim, H. and Oh, Y. (2023). Design of a Mirror for Fragrance Recommendation Based on Personal Emotion Analysis, Journal of Korea Society of Industrial Information Systems, 28(4), 11-19.
  9. Kim, J. (2023). A Study on Fine-Tuning and Transfer Learning to Construct Binary Sentiment Classification Model in Korean Text, Journal of Korea Society of Industrial Information Systems, 28(5), 15-30.
  10. Kim, S., Shin, J.B., Yun, H.G., Lee, J., Cho, H.J., Choi, J. and Han, J.H. (2023). Technology Trends of Large Language Models in the Age of Generative AI, Communications of the Korean Institute of Information Scientists and Engineers, 41(11), 25-33.
  11. Ko, H., Yang, K., Ryu, M., Choi, T., Yang, S., Hyun, J., Park, S. and Park, K. (2023). A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models. arXiv preprint arXiv:2306.02254.
  12. KoAlpaca (2023). GitHub Repository. https://github.com/Beomi/KoAlpaca (Accessed on Jan. 10th, 2024).
  13. McInnes, L., Healy, J. and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, arXiv preprint arXiv:1802.03426.
  14. Nah, F. F.-H., Zheng, R., Cai, J., Siau, K. and Chen, L. (2023). Generative AI and ChatGPT: Applications, Challenges, and AI-human Collaboration, Journal of Information Technology Case and Application Research, 25(3), 277-304.
  15. Oh, C., Kim, C. and Park, K. (2023). Building Robust Korean Speech Recognition Model by Fine-tuning Large Pretrained Model, Phonetics and Speech Sciences, 15(3), 75-82.
  16. OpenAI. (2021). GPT-3.5 (Turbo) - APIDocumentation, https://platform.openai.com/docs/models/gpt-3-5(Accessed on Jan. 10th, 2024)
  17. Rathore, B. (2023). Future of AI & Generation Alpha: ChatGPT Beyond Boundaries, Eduzone: International Peer Reviewed/Refereed Multidisciplinary Journal, 12(1), 63-68.
  18. S. Hochreiter and J. Schmidhuber. (1997). Long Short-Term Memory, Neural Computation, 9(8) 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735.
  19. Soh, H. (2012). Examining the Effects of Message Sidedness and Rewarded Consumer Referral in the Context of Blog Product Reviews, Journal of Practical Research in Advertising and Public Relations, 5(2), 112-143.
  20. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P. and Tatsunori, B. (2023), Stanford Alpaca: An Instruction-following LLaMA model, howpublished = https://github.com/tatsu-lab/stanford_alpaca
  21. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Hambro, E., Azhar, F., Rodriguez, A., Grave, E. and Lample, G. (2023). Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv: 2302.13971.
  22. Vasarhelyi, M. A., Moffitt, K. C., Stewart, T. and Sunderland, D. (2023). Large Language Models: An Emerging Technology in Accounting. Journal of Emerging Technologies in Accounting, 20(2), 1-10.
  23. Yeu, M., Lee, D.H. and Jeong, J.E. (2020). How Sponsorship Type Affects the Review Adoption of Blog Reviews: Focusing on Moderating Effect of Self-Control. Journal of Product Research, 38(2), 63-70.