DOI QR코드

DOI QR Code

Dual-scale BERT using multi-trait representations for holistic and trait-specific essay grading

  • Minsoo Cho (Language Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Jin-Xia Huang (Language Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Oh-Woog Kwon (Language Intelligence Research Section, Electronics and Telecommunications Research Institute)
  • Received : 2023.08.27
  • Accepted : 2023.12.20
  • Published : 2024.02.20

Abstract

As automated essay scoring (AES) has progressed from handcrafted techniques to deep learning, holistic scoring capabilities have merged. However, specific trait assessment remains a challenge because of the limited depth of earlier methods in modeling dual assessments for holistic and multi-trait tasks. To overcome this challenge, we explore providing comprehensive feedback while modeling the interconnections between holistic and trait representations. We introduce the DualBERT-Trans-CNN model, which combines transformer-based representations with a novel dual-scale bidirectional encoder representations from transformers (BERT) encoding approach at the document-level. By explicitly leveraging multi-trait representations in a multi-task learning (MTL) framework, our DualBERT-Trans-CNN emphasizes the interrelation between holistic and trait-based score predictions, aiming for improved accuracy. For validation, we conducted extensive tests on the ASAP++ and TOEFL11 datasets. Against models of the same MTL setting, ours showed a 2.0% increase in its holistic score. Additionally, compared with single-task learning (STL) models, ours demonstrated a 3.6% enhancement in average multi-trait performance on the ASAP++ dataset.

Keywords

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government (MSIT) (2019-0-00004, Development of semi-supervised learning language intelligence technology and Korean tutoring service for foreigners).

References

  1. P. W. Foltz, D. Laham, and T. K. Landauer, The intelligent essay assessor: applications to educational technology, Interact Multimed. Elecron. J. Comput-Enhanced Learn. 1 (1999), no. 2, 939-944.
  2. Y. Attali and J. Burstein, Automated essay scoring with e-rater® V. 2, J. Tech. Learn. Assessment 4 (2006), no. 3.
  3. E. B. Page, The imminence of… grading essays by computer, Phi Delta Kappan 47 (1966), no. 5, 238-243.
  4. L. M. Rudner and T. Liang, Automated essay scoring using Bayes' theorem, J. Tech. Learn. Assessment 1 (2002), no. 2.
  5. V. V. Ramalingam, A. Pandian, P. Chetry, and H. Nigam, Automated essay grading using machine learning algorithm, J. Phys. Conf. Ser. 1000 (2018), DOI 10.1088/1742-6596/1000/1/012030
  6. J. Liu, Y. Xu, and Y. Zhu, Automated essay scoring based on two-stage learning, arXiv preprint, 2019, DOI 10.48550/arXiv.1901.07744
  7. K. O'Shea and R. Nash, An introduction to convolutional neural networks, arXiv preprint, 2015, DOI 10.48550/arXiv.1511.08458
  8. A. Sherstinsky, Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network, Phys. D: Nonlin. Phenom. 404 (2020), DOI 10.1016/j.physd.2019.132306
  9. Y. Tay, M. Phan, L. A. Tuan, and S. C. Hui, SKIPFLOW: incorporating neural coherence features for end-to-end automatic text scoring, Proc. AAAI Conf. Artif. Intell. 32 (2018), no. 1.
  10. K. Taghipour and H. T. Ng, A neural approach to automated essay scoring, (Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, TX, USA), 2016, pp. 1882-1891.
  11. D. Alikaniotis, H. Yannakoudakis, and M. Rei, Automatic text scoring using neural networks, arXiv preprint, 2016, DOI 10.48550/arXiv.1606.04289
  12. J. Devlin, M. W. Chang, K. Lee and K. Toutanova, BERT: Pretraining of deep bidirectional Transformers for language understanding, arXiv preprint, 2018, DOI 10.48550/arXiv.1810.04805
  13. V. J. Schmalz and A. Brutti, Automatic assessment of English CEFR levels using BERT embeddings, (Proceedings of the Eighth Italian Conference on Computational Linguistics, Accademia University Press), 2021, pp. 293-299.
  14. M. A. Hussein, H. A. Hassan, and M. Nassef, A trait-based deep learning automated essay scoring system with adaptive feedback, Int. J. Adv. Comput. Sci. Appl. 11 (2020), no. 5, DOI 10.14569/IJACSA.2020.0110538
  15. S. Mathias and P. Bhattacharyya, Can neural networks automatically score essay traits? (Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA), 2020, pp. 85-91.
  16. R. Kumar, S. Mathias, S. Saha and P. Bhattacharyya, Many hands make light work: Using essay traits to automatically score essays, arXiv preprint, 2021, DOI 10.48550/arXiv.2102.00781 
  17. V. Kumar and D. Boulanger, Explainable automated essay scoring: Deep learning really has pedagogical value, Front. Educ. 5 (2020), DOI 10.3389/feduc.2020.572367
  18. H. Manabe and M. Hagiwara, EXPATS: a toolkit for explainable automated text scoring, arXiv preprint, 2021, DOI 10.48550/arXiv.2104.03364
  19. S. Mathias and P. Bhattacharyya, ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores, (Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, Japan), 2018.
  20. ETS Corpus of Non-Native Written English, (2014). https://catalog.ldc.upenn.edu/LDC2014T06
  21. S. Prabhu, K. Akhila and S. Sanriya, A hybrid approach towards automated essay evaluation based on BERT and feature engineering. (IEEE 7th International conference for Convergence in Technology (I2CT), Mumbai, India), 2022, DOI 10.1109/I2CT54291.2022.9824999
  22. M. Chen and X. Li, Relevance-based automated essay scoring via hierarchical recurrent model. (International Conference on Asian Language Processing (IALP), Bandung, Indonesia), 2018, pp. 378-383.
  23. H. Bai, Z. Huang, A. Hao, and S. C. Hui, Gated characteraware convolutional neural network for effective automated essay scoring, (IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia), 2021, pp. 351-359.
  24. J. Schneider, R. Richner, and M. Riser, Towards trustworthy autograding of short, multi-lingual, multi-type answers, Int. J. Artif. Intell. Educ. 33 (2023), no. 1, 88-118.
  25. T. Mizumoto, H. Ouchi, Y. Isobe, P. Reisert, R. Nagata, S. Sekine and K. Inui, Analytic score prediction and justification identification in automated short answer scoring, (Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy), 2019, pp. 316-325.
  26. Y. Wang, C. Wang, R. Li and H. Lin, On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation, arXiv preprint, 2022, DOI 10.48550/arXiv.2205.03835
  27. C. M. Ormerod, A. Malhotra and A. Jafari, Automated essay scoring using efficient Transformer-based language models, arXiv preprint, 2021, DOI 10.48550/arXiv.2102.13136
  28. A. Mizumoto and M. Eguchi, Exploring the potential of using an AI language model for automated essay scoring, Res. Method Appl. Linguist. 2 (2023), no. 2, DOI 10.1016/j.rmal.2023.100050
  29. X. Li, M. Chen, and J. Y. Nie, SEDNN: shared and enhanced deep neural network model for cross-prompt automated essay scoring, Knowledge-Based Syst. 210 (2020), DOI 10.1016/j.knosys.2020.106491
  30. R. Ridley, L. He, X. Dai, S. Huang and J. Chen, Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring, arXiv preprint, 2020, DOI 10.48550/arXiv.2008.01441
  31. Y. Cao, H. Jin, X. Wan and Z. Yu, Domain-adaptive neural automated essay scoring, (Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, China), 2020, pp. 1011-1020.
  32. C. Jin, B. He, K. Hui and L. Sun, TDNN: a two-stage deep neural network for prompt-independent automated essay scoring, (Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia), 2018, pp. 1088-1097.
  33. H. Funayama, Y. Asazuma, Y. Matsubayashi and T. M. K. Inui, What can short answer scoring models learn from crossprompt training data? (Language Processing Society 29th Annual Conference (NLP2023), Okinawa), 2023, pp. 1874-1879.
  34. R. Ridley, L. He, X. Y. Dai, S. Huang, and J. Chen, Automated cross-prompt scoring of essay traits, Proc AAAI Conf. Artif. Intell. 35 (2021), no. 15, 13745-13753.
  35. X. Wang, Y. Lee and J. Park, Automated evaluation for student argumentative writing: A survey, arXiv preprint, 2022, DOI 10.48550/arXiv.2205.04083
  36. Y. He, F. Jiang, X. Chu and P. Li, Automated Chinese Essay Scoring from Multiple Traits, (Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Rep. of Korea), 2022, pp. 3007-3016.
  37. W. Song, Z. Song, L. Liu and R. Fu, Hierarchical multi-task learning for organization evaluation of argumentative student essays, (Proc. International Joint Conference on Artificial Intelligent), 2021, pp. 3875-3881.
  38. D. Liao, J. Xu, G. Li, and Y. Wang, Hierarchical coherence modeling for document quality assessment, Proc. AAAI Conf. Artif. Intell. 35 (2021), no. 15, 13353-13361.
  39. F. S. Mim, N. Inoue, P. Reisert, H. Ouchi and K. Inui, Unsupervised learning of discourse-aware text representation for essay scoring, (Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy), 2019, pp. 378-385.
  40. T. Abhishek, D. Rawat, M. Gupta and V. Varma, Transformer models for text coherence assessment, arXiv preprint, 2021, DOI 10.48550/arXiv.2109.02176
  41. S. Behzad, A. Zeldes and N. Schneider, Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help? arXiv preprint, 2022, DOI 10.48550/arXiv.2212.08999
  42. R. Nagata, Toward a task of feedback comment generation for writing learning, (Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China), 2019, pp. 3206-3215.
  43. K. Hanawa, R. Nagata and K. Inui, Exploring methods for generating feedback comments for writing learning, (Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing), 2021, pp. 9719-9730.
  44. R. Nagata, M. Hagiwara, K. Hanawa, M. Mita, A. Chernodub and O. Nahorna, Shared task on feedback comment generation for language learners, (Proceedings of the 14th International Conference on Natural Language Generation, Aberdeen, Scotland), 2021, pp. 320-324.
  45. S. Coyne, Template-guided Grammatical Error Feedback Comment Generation, (Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, Dubrovnik, Crotia), 2023, pp. 94-104. 
  46. Z. Zhang, J. Guan, G. Xu, Y. Tian and M. Huang, Automatic Comment Generation for Chinese Student Narrative Essays, (Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Abu Dhabi, UAE), 2022, pp. 214-223. 
  47. F. Dong, Y. Zhang and J. Yang, Attention-based recurrent convolutional neural network for automatic essay scoring, (Proceedings of the 21st conference on computational natural language learning, Vancouver, Canada), 2017, pp. 153-162. 
  48. S. Jeon and M. Strube, Countering the influence of essay length in neural essay scoring, (Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing), 2021, pp. 32-38. 
  49. M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed, Big bird: transformers for longer sequences, Adv. Neur. Inf. Process Syst. 33 (2020), 17283-17297. 
  50. https://www.kaggle.com/competitions/asap-aes/data 
  51. https://www.kaggle.com/code/javigallego/english-languagelearning-complete-eda 
  52. J. Cohen, Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit, Psychol. Bull. 70 (1968), no. 4, 213-220. https://doi.org/10.1037/h0026256