A Proposal of Evaluation of Large Language Models Built Based on Research Data

Na-eun Han;Sujeong Seo;Jung-ho Um;

doi:10.3743/KOSIM.2023.40.3.077

Journal of the Korean Society for information Management (정보관리학회지)

Volume 40 Issue 3
/
Pages.77-98
/
2023
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

A Proposal of Evaluation of Large Language Models Built Based on Research Data

연구데이터 관점에서 본 거대언어모델 품질 평가 기준 제언

한나은 (한국과학기술정보연구원) ;
서수정 (한국과학기술정보연구원) ;
엄정호 (한국과학기술정보연구원)

Received : 2023.08.16
Accepted : 2023.09.18
Published : 2023.09.30

https://doi.org/10.3743/KOSIM.2023.40.3.077 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Large Language Models (LLMs) are becoming the major trend in the natural language processing field. These models were built based on research data, but information such as types, limitations, and risks of using research data are unknown. This research would present how to analyze and evaluate the LLMs that were built with research data: LLaMA or LLaMA base models such as Alpaca of Stanford, Vicuna of the large model systems organization, and ChatGPT from OpenAI from the perspective of research data. This quality evaluation focuses on the validity, functionality, and reliability of Data Quality Management (DQM). Furthermore, we adopted the Holistic Evaluation of Language Models (HELM) to understand its evaluation criteria and then discussed its limitations. This study presents quality evaluation criteria for LLMs using research data and future development directions.

본 연구는 지금까지 제안된 거대언어모델 가운데 LLaMA 및 LLaMA 기반 모델과 같이 연구데이터를 주요 사전학습데이터로 활용한 모델의 데이터 품질에 중점을 두어 현재의 평가 기준을 분석하고 연구데이터의 관점에서 품질 평가 기준을 제안하였다. 이를 위해 데이터 품질 평가 요인 중 유효성, 기능성, 신뢰성을 중심으로 품질 평가를 논의하였으며, 거대언어모델의 특성 및 한계점을 이해하기 위해 LLaMA, Alpaca, Vicuna, ChatGPT 모델을 비교하였다. 현재 광범위하게 활용되는 거대언어모델의 평가 기준을 분석하기 위해 Holistic Evaluation for Language Models를 중심으로 평가 기준을 살펴본 후 한계점을 논의하였다. 이를 바탕으로 본 연구는 연구데이터를 주요 사전학습데이터로 활용한 거대언어모델을 대상으로 한 품질 평가 기준을 제시하고 추후 개발 방향을 논의하였으며, 이는 거대언어모델의 발전 방향을 위한 지식 기반을 제공하는데 의의를 갖는다.

Keywords

Acknowledgement

본 논문은 한국과학기술정보연구원 연구사업(과제번호: K-23-L01-C03-S01 및 K-23-L03-C02-S01)의 지원에 의해 이루어진 것임.

References

An, Seong-Won, Yu, Jae-Hong, Jo, Won-Young, No, Jae-Won, & Son, Ho-Hyun (2023). Rise of Hyper-scale LLM(Large Language Model) and issues. Gyeonggi: Software Policy Research Institute.
Azma Yukinaga (2018). Deep Learning that is Tangible, Practical Programming from the Basics. Tokyo:SBクリエイティブ.
Han, Na-Eun (2023). Proposal of process model for research data quality management. Korean Society for Information Society, 40(1), 51-71. https://doi.org/10.3743/KOSIM.2023.40.1.051
Jo, Tae-Ho (2022). Deep Learning for Everyone - Deep Learning that Anyone can Easily Understand. Seoul: Gilbut.
Kim, Hyung-Sub (2020). A study on the data quality management evaluation model. Journal of the Korea Convergence Society, 11(7), 217-222. https://doi.org/10.15207/JKCS.2020.11.7.217
Kim, Seon-Tae, Lee, Jeong-Hoon, & Jeong, Han-Min (2017). Understanding and Managing Research Data. Daejeon: Korea Institute of Science and Technology Information.
Korea Data Agency (2006). Data Quality Management Guidelines (Ver 2.1).
Lee, Gi-Chang (2021). (Do it!) Learning Natural Language Processing with BERT and GPT: Transformer Core Principles and How to Use the Hugging Face Package. Seoul: Easyspublishing.
Lee, Kyong-NIm & Ho, Eun-Kyoung (2023). AI dialogue interface based on large language models: the state of the art AI dialogue models and seeking linguistic research topics. The Society of Korean Linguistics, 105, 345-374. https://doi.org/10.15811/jkl.2023..105.010
Lee, Su-Hyeon & Jeon, Sang-Hong (2023). ChatGPT State of the Technology Industry Report. Korea Copyright Commission.
Ministry of Security and Public Administration (2014). Government Data Management Guidelines. No. 2014-13.
National Research and Development Information Processing Standards, Ministry of Science and ICT Notice No. 2020-102.
National Research Council of Science and Technology (2019). Research Data Management Guidelines (2019-07).
Park, Hyung-Kyung (2020). A study on the use of copyrightable works in machine learning. The Korean Association of Sports and Entertainment Law, 23(1), 129-152. http://doi.org/10.19051/kasel.2020.23.1.129
Park, Seong-Ho (2020). A study on whether collecting and using other people's copyrighted works for the purpose of text and data mining falls under the copyright limitations: focusing on the use of big data in artificial intelligence. Human Rights and Justice, 494, 39-69. http://doi.org/10.22999/hraj..494.202012.003
我妻幸長 (2018). はじめてのディープラーニング -Pythonで学ぶニューラルネットワークとバックプロパゲーション- (Machine Learning). 최재원 옮김(2019). 실체가 손에 잡히는 딥 러닝, 기초부터 실천 프로그래밍. 서울: 책만.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
Buchanan, B., Lohn, A., Musser, M., & Sedova, K. (2021). Truth, lies, and automation. Center for Security and Emerging Technology, 1(1), 2.
Chiang, W. L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: An Open-source Chatbot Impressing Gpt-4 with 90%* Chatgpt Quality. Available: https://lmsys.org/blog/2023-03-30-vicuna/
Chomsky, N. (1957). Logical structure in language. Journal of the American Society for Information Science, 8(4), 284.
Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Albashrawi, M. A., Al-Busaidi, A. S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I., Brooks, L., Buhalis, D., Carter, L., & Wright, R. (2023). "So what if ChatGPT wrote it?" multidisciplinary perspectives on opportunities, challenges and implications of generative conversational ai for research, practice and policy. International Journal of Information Management, 71, 102642.
English, L. P. (2009). Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems. New Jersey: Wiley.
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462
Hale, J. (2001). A Probabilistic Earley Parser as a Psycholinguistic Model. In Second Meeting of the North American Chapter of the Association for Computational Linguistics.
International Organization for Standardization (2015). ISO/IEC 25024: 2015: Systems and Software Engineering-Systems and Software Quality Requirements and Evaluation (SQuaRE)-Measurement of Data Quality. ISO/IEC.
Jurafsky, D. & James H. M. (2021). Speech and Language Processing (3rd ed.). California: Standford University.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. https://doi.org/10.48550/arXiv.2001.08361
Kindling, M. & Strecker, D. (2022). Data Quality Assurance at Research Data Repositories. Data Science Journal, 21(1). http://doi.org/10.5334/dsj-2022-018
Lee, P., Goldberg, C., & Kohane, I. (2023). The AI Revolution in Medicine: GPT-4 and beyond. London: Pearson.
Lemley, M. A. & Casey, B. (2020). Fair learning. Texas Law Review, 99(4), 743-785.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126-1177. https://doi.org/10.1016/j.cognition.2007.05.006
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., & Misra, V. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35, 3843-3857.
Lin, S., Hilton, J., & Evans, O. (2021). Truthfulqa: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958
OpenAI (2023). GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
Peng, B., Li, C., He, P., Galley, M., & Gao, J. (2023). Instruction Tuning with Gpt-4. https://doi.org/10.48550/arXiv.2304.03277
Pennycook, G., Epstein, Z., Mosleh, M., Arechar, A. A., Eckles, D., & Rand, D. G. (2021). Shifting attention to accuracy can reduce misinformation online. Nature, 592(7855), 590-595. https://doi.org/10.1038/s41586-021-03344-2
Percy, L., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M., Suzgun, M., Kim, N., Guha, N., Chatterji, N., Khattab, O., Henderson, P., Huang, Q., Chi, R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X., Mai, Y., Zhang, Y., & Koreeda, Y. (2022), Holistic Evaluation of Language Models. https://doi.org/10.48550/arXiv.2211.09110
Petroni, F., Rocktaschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language Models as Knowledge Bases?. https://doi.org/10.48550/arXiv.1909.01066
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., d'Autume, C. M., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., Casas, D. L., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., & Irving, G. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. https://doi.org/10.48550/ARXIV.2112.11446
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following Llama Model. Available: https://github.com/tatsu-lab/stanford_alpaca
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., & Stojnic, R. (2022). Galactica: A Large Language Model for Science. https://doi.org/10.48550/arXiv.2211.09085
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). Llama: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971
Wilcox, E., Qian, P., Futrell, R., Kohita, R., Levy, R., & Ballesteros, M. (2020). Structural Supervision Improves Few-shot Learning and Syntactic Generalization in Neural Language Models. https://doi.org/10.48550/arXiv.2010.05725
Yarowsky, D. (1995, June). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics, 189-196.
Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P. S., & Leskovec, J. (2022). Deep Bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35, 37309-37323.

Journal of the Korean Society for information Management (정보관리학회지)

A Proposal of Evaluation of Large Language Models Built Based on Research Data

연구데이터 관점에서 본 거대언어모델 품질 평가 기준 제언

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)