• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • Our Programs
    • Membership
    • Online Workshops
    • Free Webinars
    • Consulting Services
  • About
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Collaborate with Us
  • Statistical Resources
  • Contact
  • Blog
  • Login

What is Kappa and How Does It Measure Inter-rater Reliability?

by Audrey Schnell Leave a Comment

The Kappa Statistic or Cohen’s* Kappa is a statistical measure of inter-rater reliability for categorical variables. In fact, it’s almost synonymous with inter-rater reliability.

Kappa is used when two raters both apply a criterion based on a tool to assess whether or not some condition occurs. Examples include:

— Two doctors rate whether or not each of 20 patients has diabetes based on symptoms.
— Two swallowing experts assess whether each of 30 stroke patients has aspirated while drinking based on their cough reaction.
— Two education researchers both assess whether each of 50 children reads proficiently based on a reading evaluation.

If the two-raters can reliably use the criterion to make the same assessment on the same targets, then their agreement will be very high and provide evidence we have reliable ratings. If they can’t, then either the criterion tool isn’t useful or the raters are not well enough trained.

Why not just use percent agreement? The Kappa statistic corrects for chance agreement and percent agreement does not.

Here is a classic example: Two raters rating subjects on a diagnosis of Diabetes.


35 times they agree on yes. 40 times they agree on no.

20 disagreements come from Rater B choosing Yes and Rater A choosing no. 15 disagreements come from Rater A choosing Yes and Rater B choosing no.

This table would result in a Kappa of .51 (I won’t go into the formulas).

But how do you know if you have a high level of agreement?

An often-heard Rule of Thumb for the Kappa statistic is:

“A Kappa Value of .70 Indicates good reliability”

Where did this originate? Cohen suggested the Kappa statistic be interpreted as:

The emphasis is on SUGGESTED.
Several benchmarks have been proposed by other authors:

It’s very similar to correlation coefficients. There isn’t a clear cut off for what you consider strong, moderate, or weak.

Why can’t we use these rules of thumb as clear cut offs? As with everything in statistics, your decision making depends on the study and the purpose.

For some studies .6 might be acceptable agreement. If you’re looking at physician’s agreement on who should have invasive surgery you pretty much want near-perfect agreement. So these are just general guidelines and it is necessary to consider the goal of the study and the consequences of inaccuracy.

One more thing to note. Most of the time, Kappa works great to measure agreement. However, there is an interesting situation where percent agreement is very high, but the Kappa statistic is very low. This is referred to as the Kappa paradox.

This can happen when nearly everyone or nearly no one is assessed as having the condition. . This affects the marginal totals in the calculation of chance agreement. The percent agreement in both table 1 and table 2 is 85%. However, the Kappa for Table 1 is much lower than for Table 2 because almost all of the agreements are Yeses and there are relatively few nos. I’ve seen situations where a researcher had almost perfect agreement and the Kappa was .31! This is the Kappa paradox.

What have we learned about Kappa? The Kappa statistic is a measure of inter-rater reliability. There is no absolute value for good agreement and depends on the nature of the study. Be aware that in some circumstances it is possible to have great agreement, but a low Kappa.

(*Cohen was a busy guy, he did lots of stuff)

Tagged With: inter rater reliability, Kappa statistic, rules of thumb

Related Posts

  • What Is Reliability and Why Does It Matter
  • Inter Rater Reliability–A Few Good Resources
  • Member Training: Practical Advice for Establishing Reliability and Validity
  • How Big of a Sample Size do you need for Factor Analysis?

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • Member Training: Assumptions of Linear Models

Upcoming Free Webinars

The Pathway: Steps for Staying Out of the Weeds in any Data Analysis

Upcoming Workshops

  • Analyzing Count Data: Poisson, Negative Binomial, and Other Essential Models (Jul 2022)
  • Introduction to Generalized Linear Mixed Models (Jul 2022)

Copyright © 2008–2022 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT