DOI QR코드

DOI QR Code

A Proposal of Evaluation of Large Language Models Built Based on Research Data

연구데이터 관점에서 본 거대언어모델 품질 평가 기준 제언

  • 한나은 (한국과학기술정보연구원) ;
  • 서수정 (한국과학기술정보연구원) ;
  • 엄정호 (한국과학기술정보연구원)
  • Received : 2023.08.16
  • Accepted : 2023.09.18
  • Published : 2023.09.30

Abstract

Large Language Models (LLMs) are becoming the major trend in the natural language processing field. These models were built based on research data, but information such as types, limitations, and risks of using research data are unknown. This research would present how to analyze and evaluate the LLMs that were built with research data: LLaMA or LLaMA base models such as Alpaca of Stanford, Vicuna of the large model systems organization, and ChatGPT from OpenAI from the perspective of research data. This quality evaluation focuses on the validity, functionality, and reliability of Data Quality Management (DQM). Furthermore, we adopted the Holistic Evaluation of Language Models (HELM) to understand its evaluation criteria and then discussed its limitations. This study presents quality evaluation criteria for LLMs using research data and future development directions.

본 연구는 지금까지 제안된 거대언어모델 가운데 LLaMA 및 LLaMA 기반 모델과 같이 연구데이터를 주요 사전학습데이터로 활용한 모델의 데이터 품질에 중점을 두어 현재의 평가 기준을 분석하고 연구데이터의 관점에서 품질 평가 기준을 제안하였다. 이를 위해 데이터 품질 평가 요인 중 유효성, 기능성, 신뢰성을 중심으로 품질 평가를 논의하였으며, 거대언어모델의 특성 및 한계점을 이해하기 위해 LLaMA, Alpaca, Vicuna, ChatGPT 모델을 비교하였다. 현재 광범위하게 활용되는 거대언어모델의 평가 기준을 분석하기 위해 Holistic Evaluation for Language Models를 중심으로 평가 기준을 살펴본 후 한계점을 논의하였다. 이를 바탕으로 본 연구는 연구데이터를 주요 사전학습데이터로 활용한 거대언어모델을 대상으로 한 품질 평가 기준을 제시하고 추후 개발 방향을 논의하였으며, 이는 거대언어모델의 발전 방향을 위한 지식 기반을 제공하는데 의의를 갖는다.

Keywords

Acknowledgement

본 논문은 한국과학기술정보연구원 연구사업(과제번호: K-23-L01-C03-S01 및 K-23-L03-C02-S01)의 지원에 의해 이루어진 것임.

References

  1. An, Seong-Won, Yu, Jae-Hong, Jo, Won-Young, No, Jae-Won, & Son, Ho-Hyun (2023). Rise of Hyper-scale LLM(Large Language Model) and issues. Gyeonggi: Software Policy Research Institute. 
  2. Azma Yukinaga (2018). Deep Learning that is Tangible, Practical Programming from the Basics. Tokyo:SBクリエイティブ. 
  3. Han, Na-Eun (2023). Proposal of process model for research data quality management. Korean Society for Information Society, 40(1), 51-71. https://doi.org/10.3743/KOSIM.2023.40.1.051 
  4. Jo, Tae-Ho (2022). Deep Learning for Everyone - Deep Learning that Anyone can Easily Understand. Seoul: Gilbut. 
  5. Kim, Hyung-Sub (2020). A study on the data quality management evaluation model. Journal of the Korea Convergence Society, 11(7), 217-222. https://doi.org/10.15207/JKCS.2020.11.7.217 
  6. Kim, Seon-Tae, Lee, Jeong-Hoon, & Jeong, Han-Min (2017). Understanding and Managing Research Data. Daejeon: Korea Institute of Science and Technology Information. 
  7. Korea Data Agency (2006). Data Quality Management Guidelines (Ver 2.1). 
  8. Lee, Gi-Chang (2021). (Do it!) Learning Natural Language Processing with BERT and GPT: Transformer Core Principles and How to Use the Hugging Face Package. Seoul: Easyspublishing. 
  9. Lee, Kyong-NIm & Ho, Eun-Kyoung (2023). AI dialogue interface based on large language models: the state of the art AI dialogue models and seeking linguistic research topics. The Society of Korean Linguistics, 105, 345-374. https://doi.org/10.15811/jkl.2023..105.010 
  10. Lee, Su-Hyeon & Jeon, Sang-Hong (2023). ChatGPT State of the Technology Industry Report. Korea Copyright Commission. 
  11. Ministry of Security and Public Administration (2014). Government Data Management Guidelines. No. 2014-13. 
  12. National Research and Development Information Processing Standards, Ministry of Science and ICT Notice No. 2020-102. 
  13. National Research Council of Science and Technology (2019). Research Data Management Guidelines (2019-07). 
  14. Park, Hyung-Kyung (2020). A study on the use of copyrightable works in machine learning. The Korean Association of Sports and Entertainment Law, 23(1), 129-152. http://doi.org/10.19051/kasel.2020.23.1.129 
  15. Park, Seong-Ho (2020). A study on whether collecting and using other people's copyrighted works for the purpose of text and data mining falls under the copyright limitations: focusing on the use of big data in artificial intelligence. Human Rights and Justice, 494, 39-69. http://doi.org/10.22999/hraj..494.202012.003 
  16. 我妻 幸長 (2018). はじめてのディープラーニング -Pythonで学ぶニューラルネットワークと バックプロパゲーション- (Machine Learning). 최재원 옮김(2019). 실체가 손에 잡히는 딥 러닝, 기초부터 실천 프로그래밍. 서울: 책만. 
  17. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901. 
  18. Buchanan, B., Lohn, A., Musser, M., & Sedova, K. (2021). Truth, lies, and automation. Center for Security and Emerging Technology, 1(1), 2. 
  19. Chiang, W. L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: An Open-source Chatbot Impressing Gpt-4 with 90%* Chatgpt Quality. Available: https://lmsys.org/blog/2023-03-30-vicuna/ 
  20. Chomsky, N. (1957). Logical structure in language. Journal of the American Society for Information Science, 8(4), 284. 
  21. Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Albashrawi, M. A., Al-Busaidi, A. S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I., Brooks, L., Buhalis, D., Carter, L., & Wright, R. (2023). "So what if ChatGPT wrote it?" multidisciplinary perspectives on opportunities, challenges and implications of generative conversational ai for research, practice and policy. International Journal of Information Management, 71, 102642. 
  22. English, L. P. (2009). Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems. New Jersey: Wiley. 
  23. Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462
  24. Hale, J. (2001). A Probabilistic Earley Parser as a Psycholinguistic Model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics. 
  25. International Organization for Standardization (2015). ISO/IEC 25024: 2015: Systems and Software Engineering-Systems and Software Quality Requirements and Evaluation (SQuaRE)-Measurement of Data Quality. ISO/IEC. 
  26. Jurafsky, D. & James H. M. (2021). Speech and Language Processing (3rd ed.). California: Standford University. 
  27. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. https://doi.org/10.48550/arXiv.2001.08361 
  28. Kindling, M. & Strecker, D. (2022). Data Quality Assurance at Research Data Repositories. Data Science Journal, 21(1). http://doi.org/10.5334/dsj-2022-018 
  29. Lee, P., Goldberg, C., & Kohane, I. (2023). The AI Revolution in Medicine: GPT-4 and beyond. London: Pearson. 
  30. Lemley, M. A. & Casey, B. (2020). Fair learning. Texas Law Review, 99(4), 743-785.
  31. Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126-1177. https://doi.org/10.1016/j.cognition.2007.05.006 
  32. Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., & Misra, V. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35, 3843-3857. 
  33. Lin, S., Hilton, J., & Evans, O. (2021). Truthfulqa: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958 
  34. OpenAI (2023). GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
  35. Peng, B., Li, C., He, P., Galley, M., & Gao, J. (2023). Instruction Tuning with Gpt-4. https://doi.org/10.48550/arXiv.2304.03277 
  36. Pennycook, G., Epstein, Z., Mosleh, M., Arechar, A. A., Eckles, D., & Rand, D. G. (2021). Shifting attention to accuracy can reduce misinformation online. Nature, 592(7855), 590-595.  https://doi.org/10.1038/s41586-021-03344-2
  37. Percy, L., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M., Suzgun, M., Kim, N., Guha, N., Chatterji, N., Khattab, O., Henderson, P., Huang, Q., Chi, R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X., Mai, Y., Zhang, Y., & Koreeda, Y. (2022), Holistic Evaluation of Language Models. https://doi.org/10.48550/arXiv.2211.09110 
  38. Petroni, F., Rocktaschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language Models as Knowledge Bases?. https://doi.org/10.48550/arXiv.1909.01066 
  39. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. 
  40. Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., d'Autume, C. M., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., Casas, D. L., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., & Irving, G. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. https://doi.org/10.48550/ARXIV.2112.11446 
  41. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following Llama Model. Available: https://github.com/tatsu-lab/stanford_alpaca 
  42. Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., & Stojnic, R. (2022). Galactica: A Large Language Model for Science. https://doi.org/10.48550/arXiv.2211.09085 
  43. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). Llama: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971 
  44. Wilcox, E., Qian, P., Futrell, R., Kohita, R., Levy, R., & Ballesteros, M. (2020). Structural Supervision Improves Few-shot Learning and Syntactic Generalization in Neural Language Models. https://doi.org/10.48550/arXiv.2010.05725
  45. Yarowsky, D. (1995, June). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics, 189-196. 
  46. Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P. S., & Leskovec, J. (2022). Deep Bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35, 37309-37323.