|Fleiss' Generalized Kappa is NOT Kappa. It is a Generalized Version of Scott's Pi
Posted: Tuesday, May 3, 2011
The term "Kappa" has been used in inter-rater reliability literature to refer to almost any chance-corrected agreement coefficient. Fleiss' Generalized Kappa for example is NOT Kappa. It is a Generalized Version of Scott's Pi coefficient. It is unfortunate since this situation has created some confusion among researchers as to what is Kappa and what is not. Here is my take on this issue.
Cohen (1960) proposed an agreement coefficient that became popular overtime for quantifying the extent of agreement among 2 raters. It is essential to note that Cohen's coefficient is only applicable to 2 raters (A and B) and cannot be used for 3 raters or more in its initial form. If you have 2 raters A and B, you can always organize their ratings in a contingency Table such as Table 1 below.
Based on Table 1 data, Cohen(1960) suggested to compute the extent of agreement among the raters as follows:
Note that pe|k is the chance-agreement probability with pAk and pBk being defined as , which represent raters A and B's marginal rating probabilities. Moreover pa, which is the "raw" overall agreement probability (not corrected for chance agreement), is given by:
This is the only coefficient that should normally be referred to as Kappa. Any other coefficient, which represents even a slightly modified version of Cohen's kappa as described above should bear a name other than kappa. The name kappa itself is arbitrary and does not carry any particular meaning. Therefore, it should not be a problem finding an alternative name for a different coefficient.
Extending Kappa to 3 Raters or More
Fleiss (1971) appears to be among the first researchers to propose a kappa-like agreement coefficient that can be used for quantifying the extent of agreement among 3 raters or more. His initial goal was to extend kappa to 3 raters and more. But there is a problem. Fleiss' generalized statistic does not reduce to kappa if the number of raters is 2. Instead, it reduces to another agreement coefficient called Pi proposed by Scott (1955). But still, Fleiss decided to refer to his coefficient as a generalized kappa. Here is where things got messed up. Fleiss' coefficient is indeed a generalized Pi coefficient, not a generalized Kappa. Numerous software packages such STATA among others were developed with this misleading terminology.
Conger (1980) is the one who formally raised this problem regarding the misuse of the term kappa for multiple raters, and suggested a more genuine generalized kappa coefficient. In my book Gwet (2010) "Handbook of Inter-Rater Reliabity (2nd Edition)", in chapter 4 (e.g. Table 4.7) and in many other parts where the multiple-rater kappa is mentioned, I used Conger's coefficient as the generalized kappa, and Fleiss' coefficient as the generalized Pi of Scott (1955). Bear this in mind when comparing numerical results from my book to those available in the literature.
Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological Measurement, 20, 37-46.
Conger, A. J. (1980). "Integration and Generalization of Kappas for Multiple Raters," Psychological Bulletin, 88, 322-328.
Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters", Psychological Bulletin, 76, 378-382.
Gwet, K.L. (2010). Handbook of Inter-Rater Reliability (2nd Edition), Advanced Analytics, LLC
Scott, W. A. (1955). "Reliability of content analysis : the case of nominal scale coding." Public Opinion Quarterly, XIX, 321-325.