DOI QR코드

DOI QR Code

프롬프트 엔지니어링을 통한 GPT-4 모델의 수학 서술형 평가 자동 채점 탐색: 순열과 조합을 중심으로

Exploring automatic scoring of mathematical descriptive assessment using prompt engineering with the GPT-4 model: Focused on permutations and combinations

  • 신병철 (수원외국어고등학교) ;
  • 이준수 (화홍고등학교) ;
  • 유연주 (서울대학교)
  • Byoungchul Shin (Suwon Foreign Language High School) ;
  • Junsu Lee (Hwahong High School) ;
  • Yunjoo Yoo (Seoul National University)
  • 투고 : 2024.02.29
  • 심사 : 2024.05.03
  • 발행 : 2024.05.31

초록

본 연구에서는 GPT-4 기반의 ChatGPT를 활용한 서술형 평가 문항의 자동 채점 가능성을 탐색하기 위해 교사와 GPT-4 기반의 ChatGPT의 채점 결과를 비교, 분석하였다. 이를 위해 학생평가지원포털에 있는 고등학교 1학년 순열과 조합 단원에서 3개의 서술형 문항을 선정하였다. 문항 1, 2는 문제 해결 전략이 1가지인 문항이고, 문항 3은 문제 해결 전략이 2가지 이상인 문항이었다. 8년 이상의 교육 경력이 있는 교사 2명이 학생 204명의 답안을 채점하고, GPT-4 기반의 ChatGPT의 채점 결과와 비교하였다. 문항별로 Few-Shot-CoT, SC, 구조화, 반복 프롬프트 기법 등을 활용하여 채점을 위한 프롬프트를 구성하였고, 이를 GPT-4 기반의 ChatGPT에 입력하여 채점하였다. 채점 결과, 문항 1, 2는 교사의 채점 결과와 GPT-4의 채점 결과 사이에 강한 상관관계를 충족하였다. 문제 해결 전략이 2가지인 문항 3은 먼저 채점 전 학생 답안을 문제 해결전략별로 분류하는 프롬프트를 GPT-4 기반의 ChatGPT에 입력하여 답안을 분류하였다. 이후 유형별로 채점 프롬프트를 적용하여 GPT-4 기반의 ChatGPT에 입력하여 채점하였고, 채점 결과 역시 교사의 채점 결과와 강한 상관관계가 나타났다. 이를 통해 프롬프트 엔지니어링을 활용한 GPT-4 모델이 교사의 채점을 보조할 수 있는 가능성을 확인하였으며 본 연구의 한계점 및 향후 연구 방향을 제시하였다.

In this study, we explored the feasibility of automatically scoring descriptive assessment items using GPT-4 based ChatGPT by comparing and analyzing the scoring results between teachers and GPT-4 based ChatGPT. For this purpose, three descriptive items from the permutation and combination unit for first-year high school students were selected from the KICE (Korea Institute for Curriculum and Evaluation) website. Items 1 and 2 had only one problem-solving strategy, while Item 3 had more than two strategies. Two teachers, each with over eight years of educational experience, graded answers from 204 students and compared these with the results from GPT-4 based ChatGPT. Various techniques such as Few-Shot-CoT, SC, structured, and Iteratively prompts were utilized to construct prompts for scoring, which were then inputted into GPT-4 based ChatGPT for scoring. The scoring results for Items 1 and 2 showed a strong correlation between the teachers' and GPT-4's scoring. For Item 3, which involved multiple problem-solving strategies, the student answers were first classified according to their strategies using prompts inputted into GPT-4 based ChatGPT. Following this classification, scoring prompts tailored to each type were applied and inputted into GPT-4 based ChatGPT for scoring, and these results also showed a strong correlation with the teachers' scoring. Through this, the potential for GPT-4 models utilizing prompt engineering to assist in teachers' scoring was confirmed, and the limitations of this study and directions for future research were presented.

키워드

참고문헌

  1. Baek, J. H., Shim, H. P., & Lee, D. W. (2023). Analyzing and exploring the ability of chatbots for scoring student works in school science evaluation (ORM 2023-30-7). KICE. 
  2. Batanero, C., Navarro-Pelayo, V., & Godino, J. D. (1997). Effect of the implicit combinatorial model on combinatorial reasoning in secondary school pupils. Educational Studies in Mathematics, 32(2), 181-199. https://doi.org/10.1023/A:1002954428327 
  3. Chang, S. C., & Kim, S. M. (2014). The defects of questions of descriptive assessment in elementary school mathematics and the suggestions for its improvement -focusing on the questions produced by Gyeonggi Provincial Office of Education. Journal of Elementary Mathematics Education in Korea, 18(2), 297-318. 
  4. Cho, M. J., Kim, M. R., Yoon, Y. G., & Shin, B. C. (2024). Development of an AI-integrated english writing class model for artificial intelligence literacy: Focused on prompt engineering. Journal of Learner-Centered Curriculum and Instruction, 24(6), 135-157. https://doi.org/10.22251/jlcci.2024.24.6.135 
  5. Choi, I. Y., & Cho, H. H. (2016). Eye movements in understanding combinatorial problems. Journal of Educational Research in Mathematics, 26(4), 635-662 
  6. Choi, S. K., & Park, J. I. (2023). A study on the development of automated Korean essay scoring model using random forest algorithm. Brain, Digital, & Learning, 13(2), 131-146. https://doi.org/10.31216/BDL.20230008 
  7. Chung, G. K., & O'Neil, H. F. (1997). Methodological approaches to online scoring of essays. Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing, Graduate School of Education & Information Studies, University of California, Los Angeles. 
  8. Chung, S. K., Lee, K. H., Yoo, Y. J., Shin, B. M., Park, M. M., & Han, S. Y. (2012). A survey of teachers' perspectives on process-focused assessment in school mathematics. Journal of Educational Research in Mathematics, 22(3), 401-427. 
  9. Ekin, S. (2023, March 4). Prompt engineering for ChatGPT: A quick guide to techniques, tips, and best practices. https://doi.org/10.36227/techrxiv.22683919.v2 
  10. Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51, 2629-2633. https://doi.org/10.1007/s10439-023-03272-4 
  11. Go, B. K., Shin, B. C. & Lim, C. I. (2024). Development of a math-AI convergence instructional model using a generative AI chatbot. Journal of Educational Technology , 40(1), 1-40. http://dx.doi.org/10.17232/KSET.40.1.1 
  12. Han, K. M., & Koh, S. S. (2014) An analysis of the mathematical errors on the items of the descriptive assessment in the equation of a circle. The Mathematical Education, 53(4), 509-524. https://doi.org/10.7468/mathedu.2014.53.4.509 
  13. Hwang, H. J., Na, G. S., Choi, S. H., Park, K. M., Lim, J. H., & Seo, D. Y. (2012). Introduction to mathematics education. Moonumsa. 
  14. Jin, K. A., Lee, B. C., Joo, H. M., & Shin, D. K. (2007). Development of the KICE automated scoring program (II) (RRE 2007-4). KICE. 
  15. Jin, K. A., Lee, B. C., Shin, D. K., & Park, T. J. (2008). The KICE automated scoring program (III) (RRE-2008-6). KICE. 
  16. Jin, K. A., Nam, M. H., Kim, M. H., Oh, S. C., Kim, M. J., & Joo, H. M. (2006). A study on the development and introduction of an automated scoring program (RRI 2006-6). KICE. 
  17. Jung, H. D., Kang, S. P., & Kim, S. J. (2010). Analysis on error types of descriptive evaluations in the learning of elementary mathematics. Journal of Elementary Mathematics Education in Korea, 14(3), 885-905. 
  18. Jung, J. Y., Jo, H. M., Hwang, J. W., Moon, M. H., & Kim, I. J. (2023). ChatGPT education revolution. Porche.
  19. Kapur, J. N. (1970). Combinatorial analysis and school mathematics. Educational Studies in Mathematics, 3, 111-127. https://doi.org/10.1007/BF00381598 
  20. Kim, M. J., Kim, Y. G., & J, I. C. (2009a). The study of factors of anxiety of permutation and coombination in high school. Journal of the Korean School Mathematics, 12(2), 261-279. 
  21. Kim, M. J., Kim, Y. G., & J, I. C. (2009b). The study of instruction on permutation and combination through the discovery method. The Mathematical Education, 48(2), 113-139. 
  22. Kim, M. K., Cho, M. K., & Joo, Y. R. (2012). A survey of perception and status about descriptive assessment - Focused on elementary school teachers in Seoul area. Journal of Elementary Mathematics Education in Korea, 16(1) 63-95. 
  23. Kim, R. Y., & Lee, M. H. (2013). Middle school mathematics teachers' perceptions of constructed-response assessments. Journal of Educational Research in Mathematics, 23(4), 533-551. 
  24. Kim, R. Y., Lee, M. H., Kim, M. K., & Noh, S. S. (2014). A comparison of elementary and middle school mathematics teachers' beliefs and practices in constructed-response assessment. The Mathematical Education, 53(1), 131-146. https://doi.org/10.7468/mathedu.2014.53.1.131 
  25. Kim, S. R., Park, H. S., & Kim, W. S. (2007). Epistemological obstacles on learning the product rule and the sum rule of combinatorics. The Mathematical Education, 46(2), 193-205. 
  26. Kim, W. K., Hong, G. R., & Lee, J. H. (2011). Teaching and learning effects of structural-mapping used instruction in permutation and combination. Communications of Mathematical Education, 25(3), 607-627. 
  27. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho & A. Oh (eds.), Proceedings of the Advances in neural information processing systems 35 (NeurIPS 2022) (pp. 22199-22213). https://doi.org/10.48550/arXiv.2205.11916 
  28. Kwon, O. N., Oh, S. J., Yoon J. E., Lee, K. W., Shin, B. C., & Jung W. (2023). Analyzing mathematical performances of ChatGPT: Focusing on the dolution of national sssessment of educational achievement and the college scholastic ability test. Communications of Mathematical Education, 37(2), 233-256. https://doi.org/10.7468/jksmee.2023.37.2.233 
  29. Lee, J. S. (2024). A exploration of unsupervised learning-based grading aid methods for mathematical descriptive assessment [Master's thesis, Seoul National University]. http://www.dcollection.net/handler/snu/000000182358 
  30. Lee, K. C. (2021). Do it! Natural language processing with Bert and GPT. EasysPublishing. 
  31. Lee, K. S., Rim, H. M., Choi, I. S., & Kim, S. K. (2021). The theory and practice of mathematics education assessment. Kyowoo. 
  32. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157-173. https://doi.org/10.1162/tacl_a_00638 
  33. Ministry of Education. (2022). Mathematics curriculum. Ministry of Education Notice 2022-33 [supplement 8]. Ministry of Education. 
  34. Na, G. S., Park, M. M., Park, Y. J., & Lee, H. C. (2018). A study on mathematical descriptive evaluation - Focusing on examining the recognition of mathematics teachers and searching for supporting way. School Mathematics, 20(4), 635-659. https://doi.org/10.29275/sm.2018.12.20.4.635 
  35. National Council of Teachers of Mathematics (2000). Principles and standards for school mathematics. NCTM. 
  36. Noh, E. H., Kim, M. H., Sung, K. H., Kim, H. S., & Jin. K. Y. (2013). Refinement and pilot implementation of an automated scoring program for open-ended items in large-scale assessments (Research Report RRE 2013-5). Korea Institute for Curriculum and Evaluation. 
  37. Noh, E. H., Lee, S. H., Lim, E. Y., Sung, K. H., & Park, S. Y. (2014). Development and practical validation of an automated scoring program for Korean open-ended items (Research Report RRE 2014-6). Korea Institute for Curriculum and Evaluation. 
  38. Noh, E. H., Shim, J. H., Kim, M. H., & Kim, J. H. (2012). A study on automatic Scoring methods for large-scale assessments of open-ended items (Research Report RRE 2012-6). Korea Institute for Curriculum and Evaluation. 
  39. Noh, E. H., Song, M. Y., Park, J. I., Kim, Y. H., & Lee, D. K. (2016). Advanced refinements and application of automated scoring system for Korean large-scale assessment (Research Report RRE 2016-11). Korea Institute for Curriculum and Evaluation. 
  40. Noh, E. H., Song, M. Y., Sung. K. H., & Park, S. Y. (2015). Development and implementation of an automated scoring program for sentence-level open-ended items in Korean (Research Report RRE 2015-9). Korea Institute for Curriculum and Evaluation. 
  41. Noh, S. S., Kim, M, K., Cho, S. M., Jeong, Y. S., & Jeong. Y. A., (2008). A study of teachers' Perception and status about descriptive evaluation in secondary school mathematics. Journal of the Korean School Mathematics, 11(3), 377-397. 
  42. Oh, S. J. (2023). Effective ChatGPT prompts in mathematical problem solving: Focusing on quadratic equations and quadratic Functions. Communications of Mathematical Education, 37(3), 945-967. https://doi.org/10.7468/jksmee.2023.37.3.545 
  43. OpenAI (2023). GPT-4 technical report. https://arxiv.org/abs/2303.08774 
  44. OpenAI (2024, Feb, 1). Six strategies for getting better results. OpenAI API. https://platform.openai.com/docs/guides/prompt-engineering 
  45. Pang, J. S., Kim, S. H., An, H. J., Chung, J. S., & Kwak, G. W. (2023). Challenges faced by elementary teachers in implementing the five practices for effective mathematical discussions. The Mathematical Education, 62(1), 95-115. https://doi.org/10.7468/mathedu.2023.62.1.95 
  46. Park, J. I., Lee, S. H., Song, M. H., Lee, M. B., Lee, M. J., & Choi, S. K. (2022). A study on the development of automated scoring method for computer-based essay and short answer question type assessment I (Research Report RRE 2022-6). Korea Institute for Curriculum and Evaluation. 
  47. Park, J. I., Lee, S. H., Song, M. H., Lee, M. B., Lee, M. J., & Choi, S. K. (2023). A study on the development of automated scoring method for computer-based essay and short answer question type assessment II (Research Report RRE 2023-7). Korea Institute for Curriculum and Evaluation. 
  48. Plevris, V., Papazafeiropoulos, G., & Rios, A. J. (2023). Chatbots put to the test in math and logic items: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. AI, 4(4), 949-969. https://doi.org/10.3390/ai4040048 
  49. Seo, J. Y., Nam, M. H., Kim, S. Y., Lee, W. S., Choi, M. S., Hong, S. J., & Kwon, Y. M. (2010). Study on the activation of performance assessment for the development of creativity and cultivation of character (Research Report CRE 2010-16). Korea Institute for Curriculum and Evaluation. 
  50. Seong, J. W., & Shin, B. C. (2023). Exploring the feasibility of automatic scoring of written test using ChatGPT: Focusing on the world geography written test. Journal of the Association of Korean Geographers, 12(3), 415-432. https://doi.org/10.25202/JAKG.12.3.3 
  51. Seong, T. J., & Kwon, O. N. (1999). Future directions and perspectives for performance assessment in mathematics. Journal of the Korea society of Educational Studies in Mathematics School Mathematics, 1(1), 217-234. 
  52. Song, M. Y., Noh, E. H., & Sung, K. H. (2016). Analysis on the accuracy of automated scoring for Korean large-scale assessments. The Journal of Curriculum and Evaluation, 19(1), 255-274. https://doi.org/10.29221/jce.2016.19.1.255 
  53. Sriraman, B., & English, L. D. (2004). Combinatorial mathematics: Research into practice. Mathematics Teacher, 98(3), 182-191. https://doi.org/10.5951/MT.98.3.0182 
  54. Sung, J. H. (2023). Analysis of functions and applications of intelligent tutoring system for personalized adaptive learning in mathematics. The Mathematical Education, 62(3), 303-326. https://doi.org/10.7468/mathedu.2023.62.3.303 
  55. Taulli, T. (2023). Generative AI: How ChatGPT and Other AI Tools Will Revolutionize Business (e-book). Berkeley. https://doi.org/10.1007/978-1-4842-9367-6_1 
  56. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., ... & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. https://doi.org/10.48550/arXiv.2203.11171 
  57. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the Advances in Neural Information Processing Systems 35 (pp.24824-24837). https://doi.org/10.48550/arXiv.2201.11903 
  58. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate item solving with large language models. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). https://doi.org/10.48550/arXiv.2305.1060 
  59. Yoon, S. Y., Eva Miszoglad, & Lisa R. Pierce. (2023). Evaluation of ChatGPT feedback on ELL writers' coherence and cohesion. arXiv preprint arXiv:2310.06505. https://doi.org/10.48550/ arXiv.2310.06505