DOI QR코드

DOI QR Code

Scoring Korean Written Responses Using English-Based Automated Computer Scoring Models and Machine Translation: A Case of Natural Selection Concept Test

영어기반 컴퓨터자동채점모델과 기계번역을 활용한 서술형 한국어 응답 채점 -자연선택개념평가 사례-

  • Received : 2016.03.21
  • Accepted : 2016.05.05
  • Published : 2016.06.30

Abstract

This study aims to test the efficacy of English-based automated computer scoring models and machine translation to score Korean college students' written responses on natural selection concept items. To this end, I collected 128 pre-service biology teachers' written responses on four-item instrument (total 512 written responses). The machine translation software (i.e., Google Translate) translated both original responses and spell-corrected responses. The presence/absence of five scientific ideas and three $na{\ddot{i}}ve$ ideas in both translated responses were judged by the automated computer scoring models (i.e., EvoGrader). The computer-scored results (4096 predictions) were compared with expert-scored results. The results illustrated that no significant differences in both average scores and statistical results using average scores was found between the computer-scored result and experts-scored result. The Pearson correlation coefficients of composite scores for each student between computer scoring and experts scoring were 0.848 for scientific ideas and 0.776 for $na{\ddot{i}}ve$ ideas. The inter-rater reliability indices (Cohen kappa) between computer scoring and experts scoring for linguistically simple concepts (e.g., variation, competition, and limited resources) were over 0.8. These findings reveal that the English-based automated computer scoring models and machine translation can be a promising method in scoring Korean college students' written responses on natural selection concept items.

이 연구는 기계 번역을 활용하여 영어기반서술형 평가의 자동채점모델을 한국어 응답에 적용하는 방법의 효용감을 조사하기 위하여 이루어졌다. 이 연구를 위하여 예비생물교사 128명이 4문항으로 구성된 자연선택개념평가도구에 응답한 512개의 서술형응답을 활용하였다. 서술형응답은 한글맞춤법을 교정한 것과 교정하지 않은 학생들이 작성한 그대로의 응답 두 가지를 구글번역으로 번역하였다. 8가지 과학적 개념과 비과학적 개념을 채점하는 자동채점모델을 통해 생성한 4096개의 예측자료의 정확도를 독립적으로 수행한 전문가 채점자료와 비교하는 방법으로 확인하였다. 그 결과 컴퓨터로 채점한 점수와 전문가 채점점수의 평균값의 문항별 분포는 유의미한 차이가 없었다. 평균값을 활용하여 생성한 통계치들은 전문가 채점자료를 통하여 생성한 자료들과 유의미한 차이가 없었다. 학생별 점수의 Pearson 상관관계 계수를 확인한 결과 과학적 개념 점수는 0.848, 비과학적 개념 점수는 0.776이었다. 언어적으로 단순한 개념의 경우 채점자간 일치도 (kappa)가 0.8이상이었다. 이 결과는 기계 번역과 영어기반 서술형 평가의 자동채점모델이 우리나라 학생들의 자연선택개념문항을 채점하는데 유용한 방법이 될 수 있음을 보여준다.

Keywords

References

  1. Anderson, D. L., Fisher, K. M., & Norman, G. J. (2002). Development and evaluation of the conceptual inventory of natural selection. Journal of Research in Science Teaching, 39(10), 952-978. https://doi.org/10.1002/tea.10053
  2. Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 1, 391-402.
  3. Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance?. Journal of Science Education and Technology, 23(1), 160-182. https://doi.org/10.1007/s10956-013-9461-9
  4. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220. https://doi.org/10.1037/h0026256
  5. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159. https://doi.org/10.1037/0033-2909.112.1.155
  6. Crossgrove, K., & Curran, K. L. (2008). Using clickers in nonmajors-and majors-level biology courses: student opinion, learning, and long-term retention of course material. CBE-Life Sciences Education, 7(1), 146-154. https://doi.org/10.1187/cbe.07-08-0060
  7. Ha, M. (2013). Assessing scientific practices using machine learning methods: Development of automated computer scoring models for written evolutionary explanations. Unpublished Doctoral Dissertation. Columbus: The Ohio State University.
  8. Ha, M., & Nehm, R. H. (2016a). The impact of misspelled words on automated computer scoring: A case study of scientific explanations. Journal of Science Education and Technology, 25, 358-374. https://doi.org/10.1007/s10956-015-9598-9
  9. Ha, H., & Nehm, R. H. (2016b). Predicting the accuracy of computer scoring of text: Probabilistic, multi-model, and semantic similarity approaches. Paper in proceedings of the National Association for Research in Science Teaching, Baltimore, MD, April 14-17.
  10. Haudek, K. C., Prevost, L. B., Moscarella, R. A., Merrill, J., & Urban-Lurain, M. (2012). What are they thinking? Automated analysis of student writing about acid-base chemistry in introductory biology. CBE-Life Sciences Education, 11(3), 283-293. https://doi.org/10.1187/cbe.11-08-0084
  11. Kaplan, J. J., Haudek, K. C., Ha, M., Rogness, N., & Fisher, D. G. (2014). Using lexical analysis software to assess student writing in statistics. Technology Innovations in Statistics Education, 8(1).
  12. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174.
  13. Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389-405. https://doi.org/10.1023/A:1025779619903
  14. Levesque, A. A. (2011). Using clickers to facilitate development of problem-solving skills. CBE-Life Sciences Education, 10(4), 406-417. https://doi.org/10.1187/cbe.11-03-0024
  15. Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215-233. https://doi.org/10.1002/tea.21299
  16. Magnusson, S. J., Templin, M., & Boyle, R. A. (1997). Dynamic science assessment: A new approach for investigating conceptual change. The Journal of the Learning Sciences, 6(1), 91-142. https://doi.org/10.1207/s15327809jls0601_5
  17. Makiko, M., Yuta, T., & Kazuhide, Y. (2011). Phrase-based statistical machine translation via Chinese characters with small parallel corpora. IJIIP: International Journal of Intelligent Information Processing, 2(3), 52-61. https://doi.org/10.4156/ijiip.vol2.issue3.6
  18. Mathan, S. A., & Koedinger, K. R. (2005). Fostering the intelligent novice: Learning from errors with metacognitive tutoring. Educational Psychologist, 40(4), 257-265. https://doi.org/10.1207/s15326985ep4004_7
  19. Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7(1), 1-14. https://doi.org/10.2307/2405568
  20. Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21(1), 183-196. https://doi.org/10.1007/s10956-011-9300-9
  21. Nehm, R. H., Ha, M., Rector, M., Opfer, J. E., Perrin, L., Ridgway, J. et al. (2010). Scoring guide for the open response instrument (ORI) and evolutionary gain and loss test (ACORNS). Technical Report of National Science Foundation REESE Project 0909999.
  22. Odom, A. L., & Barrow, L. H. (1995). Development and application of a two-tier diagnostic test measuring college biology students' understanding of diffusion and osmosis after a course of instruction. Journal of Research in Science Teaching, 32(1), 45-61. https://doi.org/10.1002/tea.3660320106
  23. Opfer, J. E., Nehm, R. H., & Ha, M. (2012). Cognitive foundations for science assessment design: Knowing what students know about evolution. Journal of Research in Science Teaching, 49(6), 744-777. https://doi.org/10.1002/tea.21028
  24. Rutledge, M. L., & Warden, M. A. (1999). The development and validation of the measure of acceptance of the theory of evolution instrument. School Science and Mathematics, 99(1), 13-18. https://doi.org/10.1111/j.1949-8594.1999.tb17441.x
  25. Sato, T., Yamanishi, Y., Kanehisa, M., & Toh, H. (2005). The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships. Bioinformatics, 21(17), 3482-3489. https://doi.org/10.1093/bioinformatics/bti564
  26. Shute, V. J. (2008). Focus on formative feedback. Review of educational research, 78(1), 153-189. https://doi.org/10.3102/0034654307313795
  27. Weston, M., Haudek, K. C., Prevost, L., Urban-Lurain, M., & Merrill, J. (2015). Examining the impact of question surface features on students' answers to constructed-response questions on photosynthesis. CBE-Life Sciences Education, 14(2), ar19. https://doi.org/10.1187/cbe.14-07-0110
  28. Zhu, Z., Pilpel, Y., & Church, G. M. (2002). Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. Journal of Molecular Biology, 318(1), 71-81. https://doi.org/10.1016/S0022-2836(02)00026-8