What is Kappa and How Does It Measure Inter-rater Reliability?

by Audrey Schnell

The Kappa Statistic or Cohen’s* Kappa is a statistical measure of inter-rater reliability for categorical variables. In fact, it’s almost synonymous with inter-rater reliability.

Kappa is used when two raters both apply a criterion based on a tool to assess whether or not some condition occurs. Examples include:

— Two doctors rate whether or not each of 20 patients has diabetes based on symptoms.
— Two swallowing experts assess whether each of 30 stroke patients has aspirated while drinking based on their cough reaction.
— Two education researchers both assess whether each of 50 children reads proficiently based on a reading evaluation.

If the two-raters can reliably use the criterion to make the same assessment on the same targets, then their agreement will be very high and provide evidence we have reliable ratings. If they can’t, then either the criterion tool isn’t useful or the raters are not well enough trained.

Why not just use percent agreement? The Kappa statistic corrects for chance agreement and percent agreement does not.

Here is a classic example: Two raters rating subjects on a diagnosis of Diabetes.

35 times they agree on yes. 40 times they agree on no.

20 disagreements come from Rater B choosing Yes and Rater A choosing no. 15 disagreements come from Rater A choosing Yes and Rater B choosing no.

This table would result in a Kappa of .51 (I won’t go into the formulas).

But how do you know if you have a high level of agreement?

An often-heard Rule of Thumb for the Kappa statistic is:

“A Kappa Value of .70 Indicates good reliability”

Where did this originate? Cohen suggested the Kappa statistic be interpreted as:

The emphasis is on SUGGESTED.
Several benchmarks have been proposed by other authors:

It’s very similar to correlation coefficients. There isn’t a clear cut off for what you consider strong, moderate, or weak.

Why can’t we use these rules of thumb as clear cut offs? As with everything in statistics, your decision making depends on the study and the purpose.

For some studies .6 might be acceptable agreement. If you’re looking at physician’s agreement on who should have invasive surgery you pretty much want near-perfect agreement. So these are just general guidelines and it is necessary to consider the goal of the study and the consequences of inaccuracy.

One more thing to note. Most of the time, Kappa works great to measure agreement. However, there is an interesting situation where percent agreement is very high, but the Kappa statistic is very low. This is referred to as the Kappa paradox.

This can happen when nearly everyone or nearly no one is assessed as having the condition. . This affects the marginal totals in the calculation of chance agreement. The percent agreement in both table 1 and table 2 is 85%. However, the Kappa for Table 1 is much lower than for Table 2 because almost all of the agreements are Yeses and there are relatively few nos. I’ve seen situations where a researcher had almost perfect agreement and the Kappa was .31! This is the Kappa paradox.

What have we learned about Kappa? The Kappa statistic is a measure of inter-rater reliability. There is no absolute value for good agreement and depends on the nature of the study. Be aware that in some circumstances it is possible to have great agreement, but a low Kappa.

(*Cohen was a busy guy, he did lots of stuff)

Standard Non-Deviation: The Steps to Running Any Statistical Model
Get the road map for your data analysis before you begin. Learn how to make any statistical modeling – ANOVA, Linear Regression, Poisson Regression, Multilevel Model – straightforward and more efficient.

Leave a Comment

Please note that, due to the large number of comments submitted, any comments on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Previous post:

Next post: