DOI QR코드

DOI QR Code

A Study on Comparison of Generalized Kappa Statistics in Agreement Analysis

  • Kim, Min-Seon (Department of Biostatistics, Yonsei University College of Medicine) ;
  • Song, Ki-Jun (Department of Biostatistics, Yonsei University College of Medicine) ;
  • Nam, Chung-Mo (Department of Biostatistics, Yonsei University College of Medicine) ;
  • Jung, In-Kyung (Department of Biostatistics, Yonsei University College of Medicine)
  • Received : 2012.06.04
  • Accepted : 2012.09.19
  • Published : 2012.10.31

Abstract

Agreement analysis is conducted to assess reliability among rating results performed repeatedly on the same subjects by one or more raters. The kappa statistic is commonly used when rating scales are categorical. The simple and weighted kappa statistics are used to measure the degree of agreement between two raters, and the generalized kappa statistics to measure the degree of agreement among more than two raters. In this paper, we compare the performance of four different generalized kappa statistics proposed by Fleiss (1971), Conger (1980), Randolph (2005), and Gwet (2008a). We also examine how sensitive each of four generalized kappa statistics can be to the marginal probability distribution as to whether marginal balancedness and/or homogeneity hold or not. The performance of the four methods is compared in terms of the relative bias and coverage rate through simulation studies in various scenarios with different numbers of raters, subjects, and categories. A real data example is also presented to illustrate the four methods.

Keywords

References

  1. Berry, K. J. and Mielke, P. W. (1988). A generalization of Cohen's kappa, Educational and Psychological Measurement, 48, 921-933. https://doi.org/10.1177/0013164488484007
  2. Brennan, R. L. and Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives, Educational and Psychological Measurement, 41, 687-699. https://doi.org/10.1177/001316448104100307
  3. Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20, 37-46. https://doi.org/10.1177/001316446002000104
  4. Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement of partial credit, Psychological Bulletin, 70, 213-220. https://doi.org/10.1037/h0026256
  5. Conger, A. J. (1980). Integration and generalization of kappas for multiple raters, Psychological Bulletin, 88, 322-328. https://doi.org/10.1037/0033-2909.88.2.322
  6. Feinstein, A. R. and Cicchetti, D. V. (1990). High agreement but low kappa: 1. The problems of two paradoxes, Journal of Clinical Epidemiology, 43, 543-549. https://doi.org/10.1016/0895-4356(90)90158-L
  7. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters, Psychological Bulletin, 76, 378-382. https://doi.org/10.1037/h0031619
  8. Gwet, K. L. (2008a). Computing inter-rater reliability and its variance in the presence of high agreement, British Journal of Mathematical and Statistical Psychology, 61, 29-48. https://doi.org/10.1348/000711006X126600
  9. Gwet, K. L. (2008b). Variance estimation of nominal-scale interrater reliability with random selection of raters, Psychometrika, 73, 407-430. https://doi.org/10.1007/s11336-007-9054-8
  10. Gwet, K. L. (2010). Handbook of Inter-Rater Reliability, 2nd edn. Advanced Analytics, LLC.
  11. Janson, H. and Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations, Educational and Psychological Measurement, 61, 277-289. https://doi.org/10.1177/00131640121971239
  12. Janson, H. and Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges, Educational and Psychological Measurement, 64, 62-70. https://doi.org/10.1177/0013164403260195
  13. Park, M. H. and Park, Y. G. (2007). A new measure of agreement to resolve the two paradoxes of Cohen's kappa, The Korean Journal of Applied Statistics, 20, 117-132. https://doi.org/10.5351/KJAS.2007.20.1.117
  14. Quenouille, M. H. (1949). Approximate test of correlation in time-series, Journal of the Royal Statistical Society, Series B, (Methodological), 11, 68-84.
  15. Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative to Fleiss' fixed-marginal multirater kappa, Paper presented at the Joensuu University Learning and Instruction Symposium.
  16. Scott, W. (1955). Reliability of content analysis: The case of nominal scale coding, Public Opinion Quarterly, 19, 321-325. https://doi.org/10.1086/266577

Cited by

  1. Development of a scale to measure diabetes self-management behaviors among older Koreans with type 2 diabetes, based on the seven domains identified by the American Association of Diabetes Educators vol.14, pp.2, 2017, https://doi.org/10.1111/jjns.12145
  2. Measurement of Inter-Rater Reliability in Systematic Review vol.35, pp.1, 2015, https://doi.org/10.7599/hmr.2015.35.1.44