• Title/Summary/Keyword: Raters

Search Result 168, Processing Time 0.03 seconds

Reliability Analysis and Improvement Plan for Evaluation of Program Outcomes among Demand-driven Raters (프로그램 학습성과 평가에 대한 수요지향 평가자 간 신뢰도 분석 및 개선 방안)

  • Lee, Youngho;Shin, Younghak;Kim, Jonghwa
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.3
    • /
    • pp.410-418
    • /
    • 2021
  • In a program that runs an engineering education certification, program outcomes refer to the knowledge, skills, and attitudes a student must have until graduation. In general, capstone design is used as a tool for evaluating program outcomes. This paper applies the intraclass correlation coefficient (ICC) to measure the raters' reliability in assessing program outcomes. Several raters evaluate program outcomes, and the result is used to obtain the raters' ICC. ICC measures the reliability of ratings or measurements for clusters - data that has been collected as groups or sorted into groups. If the ICC is close to 1, it means that the reliability among the raters is high. We evaluated the proposed method's usefulness through case analysis. As a method for assessing an evaluation tool's objectivity, multiple raters measure the same evaluation tool. As a result, we measured the ICC values for all POs, and analyzed the cause for the low measured POs. We applied this method to evaluate program outcomes of the Department of Computer Engineering in the past two years. As a result, we derived guidelines for improvement and program outcomes.

Analysis of error source in subjective evaluation results on Taekwondo Poomsae: Application of generalizability theory (태권도 품새 경기의 주관적 평가결과의 오차원 분석: 일반화가능도 이론 적용)

  • Cho, Eun Hyung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.2
    • /
    • pp.395-407
    • /
    • 2016
  • This study aims to apply the G-theory for estimation of reliability of evaluation scores between raters on Taekwondo Poomsae rating categories. Selecting a number of game days and raters as multiple error sources, we analyzed the error sources caused by relative magnitude of error variances of interaction between the factors and proceeded with D-study based on the results of G-study for optimal determination of measurement condition. The results showed below. The estimated outcomes of variance component for accuracy among the Taekwondo Poomsae categories with G-theory showed that impact of error was the biggest influence factor in raters conditions and in order of interaction in subjects and between subjects, also impact of variance component estimation error on expression category was the major influence factor in interaction and in order of the between subjects and raters. Finally, the result of generalizability coefficient estimation via D-study showed that measurement condition of optimal level depend on the number of raters was 8 persons of raters on accuracy category, and stable reliability on expression category was gained when the raters were 7 persons.

A Study on Comparison of Generalized Kappa Statistics in Agreement Analysis

  • Kim, Min-Seon;Song, Ki-Jun;Nam, Chung-Mo;Jung, In-Kyung
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.5
    • /
    • pp.719-731
    • /
    • 2012
  • Agreement analysis is conducted to assess reliability among rating results performed repeatedly on the same subjects by one or more raters. The kappa statistic is commonly used when rating scales are categorical. The simple and weighted kappa statistics are used to measure the degree of agreement between two raters, and the generalized kappa statistics to measure the degree of agreement among more than two raters. In this paper, we compare the performance of four different generalized kappa statistics proposed by Fleiss (1971), Conger (1980), Randolph (2005), and Gwet (2008a). We also examine how sensitive each of four generalized kappa statistics can be to the marginal probability distribution as to whether marginal balancedness and/or homogeneity hold or not. The performance of the four methods is compared in terms of the relative bias and coverage rate through simulation studies in various scenarios with different numbers of raters, subjects, and categories. A real data example is also presented to illustrate the four methods.

A Novel Fundus Image Reading Tool for Efficient Generation of a Multi-dimensional Categorical Image Database for Machine Learning Algorithm Training

  • Park, Sang Jun;Shin, Joo Young;Kim, Sangkeun;Son, Jaemin;Jung, Kyu-Hwan;Park, Kyu Hyung
    • Journal of Korean Medical Science
    • /
    • v.33 no.43
    • /
    • pp.239.1-239.12
    • /
    • 2018
  • Background: We described a novel multi-step retinal fundus image reading system for providing high-quality large data for machine learning algorithms, and assessed the grader variability in the large-scale dataset generated with this system. Methods: A 5-step retinal fundus image reading tool was developed that rates image quality, presence of abnormality, findings with location information, diagnoses, and clinical significance. Each image was evaluated by 3 different graders. Agreements among graders for each decision were evaluated. Results: The 234,242 readings of 79,458 images were collected from 55 licensed ophthalmologists during 6 months. The 34,364 images were graded as abnormal by at-least one rater. Of these, all three raters agreed in 46.6% in abnormality, while 69.9% of the images were rated as abnormal by two or more raters. Agreement rate of at-least two raters on a certain finding was 26.7%-65.2%, and complete agreement rate of all-three raters was 5.7%-43.3%. As for diagnoses, agreement of at-least two raters was 35.6%-65.6%, and complete agreement rate was 11.0%-40.0%. Agreement of findings and diagnoses were higher when restricted to images with prior complete agreement on abnormality. Retinal/glaucoma specialists showed higher agreements on findings and diagnoses of their corresponding subspecialties. Conclusion: This novel reading tool for retinal fundus images generated a large-scale dataset with high level of information, which can be utilized in future development of machine learning-based algorithms for automated identification of abnormal conditions and clinical decision supporting system. These results emphasize the importance of addressing grader variability in algorithm developments.

Digital enhancement of pronunciation assessment: Automated speech recognition and human raters

  • Miran Kim
    • Phonetics and Speech Sciences
    • /
    • v.15 no.2
    • /
    • pp.13-20
    • /
    • 2023
  • This study explores the potential of automated speech recognition (ASR) in assessing English learners' pronunciation. We employed ASR technology, acknowledged for its impartiality and consistent results, to analyze speech audio files, including synthesized speech, both native-like English and Korean-accented English, and speech recordings from a native English speaker. Through this analysis, we establish baseline values for the word error rate (WER). These were then compared with those obtained for human raters in perception experiments that assessed the speech productions of 30 first-year college students before and after taking a pronunciation course. Our sub-group analyses revealed positive training effects for Whisper, an ASR tool, and human raters, and identified distinct human rater strategies in different assessment aspects, such as proficiency, intelligibility, accuracy, and comprehensibility, that were not observed in ASR. Despite such challenges as recognizing accented speech traits, our findings suggest that digital tools such as ASR can streamline the pronunciation assessment process. With ongoing advancements in ASR technology, its potential as not only an assessment aid but also a self-directed learning tool for pronunciation feedback merits further exploration.

Development and Application of an Online Scoring System for Constructed Response Items (서답형 문항 온라인 채점 시스템의 개발과 적용)

  • Cho, Jimin;Kim, Kyunghoon
    • The Journal of Korean Association of Computer Education
    • /
    • v.17 no.2
    • /
    • pp.39-51
    • /
    • 2014
  • In high-stakes tests for large groups, the efficiency with which students' responses are distributed to raters and how systematic scoring procedures are managed is important to the overall success of the testing program. In the scoring of constructed response items, it is important to understand whether the raters themselves are making consistent judgments on the responses, and whether these judgments are similar across all raters in order to establish measures of rater reliability. The purpose of this study was to design, develop and carry out a pilot test of an online scoring system for constructed response items administered in a paper-and-pencil test to large groups, and to verify the system's reliability. In this study, we show that this online system provided information on the scoring process of individual raters, including intra-rater and inter-rater consistency, compared to conventional scoring methods. We found this system to be especially effective for obtaining reliable and valid scores for constructed response items.

  • PDF

Reliability of the Modified Modified Ashworth Scale for the Muscle Tone of Poststroke Patients (뇌졸중 환자의 근긴장도 평가를 위한 개정된 개정된 Ashworth 척도의 신뢰도)

  • Kim, Tae-Ho;Kim, Yong-Wook
    • Journal of the Korean Society of Physical Medicine
    • /
    • v.5 no.3
    • /
    • pp.477-485
    • /
    • 2010
  • Purpose : The clinical scale to assess spasticity of muscle was wildly used the modified Ashworth scale (MAS). But reliability of the MAS has been controverted for ambiguity among the grades. The purpose of this study was to establish the inter-rater reliability of the modified MAS (MMAS) translated into Korean in stroke patients. Methods : Twenty-five patients (sixteen men and nine women) with hemiplegia (ten right and fifteen left) were measured by two raters who were physical therapist in the rehabilitation hospital. The raters assessed spasticity of shoulder adductor, elbow flexor, wrist flexor, hip adductor, knee extensor, and ankle plantar flexor in the same patients according to ratings criteria of the MAS and the MMAS. Results : For the inter-rater reliability of the MAS, two raters agreed on 57.3% and the Kappa values were moderate ($\kappa$=0.41) between two rater. The inter-rater reliability of the MAS was fair for the wrist flexor and the hip adductor and moderate for the other muscles. The intra-rater reliability was good for the shoulder adductor and the knee extensor and moderate for the other muscles. For the inter-rater reliability of the MMAS, two raters agreed on 84.7% and the Kappa values were good ($\kappa$=0.78) between two rater. The inter-rater reliability of the MMAS was moderate for the hip adductor, and good for the shoulder adductor and the wrist flexor, and very good for the other muscles. The intra-rater reliability was good for the wrist flexor and the hip adductor and very good for the other muscles. Conclusion : This study suggests that the MMAS translated into Korean is reliable test scale for the spasticity with stroke patients in the clinical field.

Computerized Sunnybrook facial grading scale (SBface) application for facial paralysis evaluation

  • Jirawatnotai, Supasid;Jomkoh, Pojanan;Voravitvet, Tsz Yin;Tirakotai, Wuttipong;Somboonsap, Natthawut
    • Archives of Plastic Surgery
    • /
    • v.48 no.3
    • /
    • pp.269-277
    • /
    • 2021
  • Background The Sunnybrook facial grading scale is a comprehensive scale for the evaluation of facial paralysis patients. Its results greatly depend on subjective input. This study aimed to develop and validate an automated Sunnybrook facial grading scale (SBface) to more objectively assess disfigurement due to facial paralysis. Methods An application compatible with iOS version 11.0 and up was developed. The software automatically detected facial features in standardized photographs and generated scores following the Sunnybrook facial grading scale. Photographic data from 30 unilateral facial paralysis patients were randomly sampled for validation. Intrarater reliability was tested by conducting two identical tests at a 2-week interval. Interrater reliability was tested between the software and three facial nerve clinicians. Results A beta version of the SBface application was tested. Intrarater reliability showed excellent congruence between the two tests. Moderate to strong positive correlations were found between the software and an otolaryngologist, including the total scores of the three individual software domains and composite scores. However, 74.4% (29/39) of the subdomain items showed low to zero correlation with the human raters (κ<0.2). The correlations between the human raters showed good congruence for most of the total and composite scores, with 10.3% (4/39) of the subdomain items failing to correspond (κ<0.2). Conclusions The SBface application is efficient and accurate for evaluating the degree of facial paralysis based on the Sunnybrook facial grading scale. However, correlations of the software-derived results with those of human raters are limited by the software algorithm and the raters' inconsistency.

Measuring plagiarism in the second language essay writing context (영작문 상황에서의 표절 측정의 신뢰성 연구)

  • Lee, Ho
    • English Language & Literature Teaching
    • /
    • v.12 no.1
    • /
    • pp.221-238
    • /
    • 2006
  • This study investigates the reliability of plagiarism measurement in the ESL essay writing context. The current study aims to address the answers to the following research questions: 1) How does plagiarism measurement affect test reliability in a psychometric view? and 2) how do raters conceive the plagiarism in their analytic scoring? This study uses the mixed-methodology that crosses quantitative-qualitative techniques. Thirty eight international students took an ESL placement writing test offered by the University of Illinois. Two native expert raters rated students' essays in terms of 5 analytic features (organization, content, language use, source use, plagiarism) and made a holistic score using a scoring benchmark. For research question 1, the current study, using G-theory and Multi-facet Rasch model, found that plagiarism measurement threatened test reliability. For research question 2, two native raters and one non-native rater in their email correspondences responded that plagiarism was not a valid analytic area to be measured in a large-scale writing test. They viewed the plagiarism as a difficult measurement are. In conclusion, this study proposes that a systematic training program for avoiding plagiarism should be given to students. In addition, this study suggested that plagiarism is measured reliably in the small-scale classroom test.

  • PDF

Analysis of Evaluator Reliability for the Raters' Calibration Training (채점자 조정(calibration) 교육 제안을 위한 평가자 신뢰도 분석)

  • Kim, jooah;Shin, Yooseok;Seo, Jeong Taeg
    • The Journal of the Korean dental association
    • /
    • v.58 no.5
    • /
    • pp.284-291
    • /
    • 2020
  • This study analyzed the change in the rater reliability based on the student's practice evaluation process conducted at Yonsei University College of Dentistry. Through this, we suggest the significance of the rater calibration training in the student's practical evaluation of dental college. Nine professors from the department of Conservative Dentistry, Yonsei University College of Dentistry, analyzed the results of class II restoration cases twice in 2017 and once in 2018. Intra Class Correlation (ICC) which is a statistic used to determine the consistency of raters with three or more scores, was also calculated. ICC values increased as raters participated in rater calibration meetings and grading experiences. This shows that the rater reliability is related to the grading experience and feedback from calibration meeting. Based on the results of previous studies that grading experiences and rater calibration training can cause a meaningful change in rater behavior, we propose to conduct rater calibration training to ensure the evaluator reliability.

  • PDF