logoInter-Rater Reliability Discussion Corner
by Kilem L. Gwet, Ph.D.

Back to the Inter-Rater Reliability Discussion Corner's Home Page

Sample Size Determination
Posted: Monday, June 28, 2010

I have received several e-mails from researchers asking me how the sample size should be determined to ensure the validity of their inter-rater reliability results. In many instances, researchers worry about the validity of their Kappa coefficient. The sample size in this context represents the number of subjects who will participate in the inter-rater reliability experiment. I did not discuss this problem in the second edition of my book "Handbook of Inter-Rater Reliability," but will certainly include it in subsequent editions. The following issues must be considered before deciding about the best course of action:

1) The notion of validity must be clarified. While the "true" interrater reliability coefficient is based on the entire population of subjects, its estimated value (generally used in practice) is obtained from a sample. Should be considered as valid, any estimated inter-rater reliability coefficient that differs from its "true" value by no more than 20% of the "true" value. The use of 20% is arbitrary, and could be changed by the researcher. However, decreasing it will result in an increase in sample size.
2) The second issue to consider is that the number of subjects required depends on the specific inter-rater reliability coefficient one decides to use. The number of subjects required for Cohen's
Kappa is different from that required for Brennan-Prediger's coefficient or Gwet's ac1 coefficient. All these coefficients are discussed by Gwet (2008a).
3) In addition to the number of subjects, it may be interest in some applications to determine the number of raters that should score the subjects. This will be the case when only some of the raters the researcher is interested in, can be invited to participate in the study ; an issue that is extensively discussed by Gwet (2008b). This situation will only be treated in a subsequent post. In the current post, we will confine ourselves to the situation where the number of raters is known and fixed. Only the number of subjects must be calculated.

I propose one possible solution to this sample size problem below. Interessted readers may want to look at the article by Alan B. Cantor (1996) as well for further discussions on this issue.

For all kappa-like agreement coefficients, the required number of subjects denoted as n depends on the relative error r and the difference pa_pe between the overall agreement probability pa and
the chance-agreement probability pe as follows:


and N the number of subjects in the entire population. Equation (1) is based on the variance formulas associated with various kappa-like statistics discussed in Gwet (2008a). With a sample size obtained using equation (1), the difference between the calculated coefficient (denoted as b and its "true" value will not exceed r× (the probability will be smaller than 0.05). Equation (1) will be more accurate when the "true" agreement coefficient is large, and less accurate when it is small.

Equation (1) shows that the smaller the relative error, the higher the required sample size. Likewise, the smaller the difference between the overall and chance-agreement probabilities, the higher
the required sample size. Table 1 below shows the magnitude of n for different values of the relative error and the agreement probability differences. It appears from this table that the sample size is smaller than 100 only if the difference between the 2 agreement probabilities is reasonably high. This is due to the fact that kappa-like agreement coefficients quickly become very unstable when the 2 quantities are close to one another. This difference is generally not known at the design stage. The rule of thumb I propose is to assume the best case scenario that chance-agreement probability is 0, and use an anticipated value for pa in place of pa_pe in Table 1, to obtain the absolute minimum sample size one should use. For example, if one anticipates that the raters will agree about 50% of the times, then one would use a sample size of 100, 44, or 25 depending on the error margin

Table 1: Number of Subjects by
Relative Error & Probability Difference
pa_pe Relative Error
20% 30% 40%
0.1 2,500 1,111 625
0.2 625 278 156
0.3 278 123 69
0.4 156 69 39
0.5 100 44 25
0.6 69 31 17
0.7 51 23 13
0.8 39 17 10
0.9 31 14 8
1.0 25 11 6


Cantor, A.B. (1996).Sample Size Calculations for Cohen's Kappa, Psychological Methods, Vol 1, No. 2, pp 150 - 153.
Gwet, K.L. (2008a). Computing inter-rater reliability and its variance in the presence of high agreement, British Journal of Mathematical and Statistical Psychology (2008), 61, 29–48
Gwet, K.L. (2008b). Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters, Psychometrika — Vol. 73, No. 3, 407–430. September 2008
Gwet, K.L. (2010). Handbook of Inter-Rater Reliability (2nd Edition), Advanced Analytics, LLC

Back to the Inter-Rater Reliability Discussion Corner's Home Page

Please use the form shown below to communicate your comments, questions, or suggestions to me regarding this post. I will be happy to get back to you as soon as possible. Thanks. K.L. Gwet, Ph.D.

Name* . :
E-Mail* .:



Submit Your Message: