Last updated: 5th Jan, 2024
Cohen’s Kappa Score is a statistic used to measure the performance of machine learning classification models. In this blog post, we will discuss what Cohen’s Kappa Score is and Python code example representing how to calculate Kappa score using Python. We will also provide a code example so that you can see how it works!
Cohen’s Kappa Score, also known as the Kappa Coefficient, is a statistical measure of inter-rater agreement for categorical data. Cohen’s Kappa Coefficient is named after statistician Jacob Cohen, who developed the metric in 1960. It is generally used in situations where there are two raters, but it can also be adapted for use with more than two raters. For machine learning binary classification models, one of the raters become the classification model and the other rater becomes the real-world observer who knows the actual truth about the categories of each of the record or dataset.
Cohen’s Kappa takes into account both the number of agreements (True positives & true negatives) and the number of disagreements between the raters (False positives & false negatives), and it can be used to calculate both overall agreement and agreement after chance has been taken into account. Taking that into consideration, Cohen’s Kappa score can be defined as the metric used to measure the performance of machine learning classification models based on assessing the perfect agreement and agreement by chance between the two raters (real-world observer and the classification model).
The main use of Cohen Kappa metric is to evaluate the consistency of the classifications, rather than their accuracy. This is particularly useful in scenarios where accuracy is not the only important factor, such as with imbalanced classes.
The Cohen Kappa Score is used to compare the predicted labels from a model with the actual labels in the data. The score ranges from -1 (worst possible performance) to 1 (best possible performance).
The following is the white paper on Cohen Kappa Score: J. Cohen (1960). “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement
Cohen’s Kappa can be calculated using either raw data or confusion matrix values. When Cohen’s Kappa is calculated using raw data, each row in the data represents a single observation, and each column represents a rater’s classification of that observation. Cohen’s Kappa can also be calculated using a confusion matrix, which contains the counts of true positives, false positives, true negatives, and false negatives for each class. We will look into the details of how Kappa score can be calculated using confusion matrix.
Let’s look at the following confusion matrix representing a binary classification model where there are two classes / labels:
Lets quickly understand the concepts of true positives, false positives, true negative, false negatives:
In the above confusion matrix, the actual represents rater 1. Rater 1 is an observer of real world events and record what actually happened. The predicted represents rater 2. The rater 2 represents the classification model which makes the predictions. Cohen Kappa score will be used to assess the model performance as a function of probability that the rater 1 and rater 2 are in perfect agreement (TP + TN), also denoted as Po (observed probability), and, the probability (expected) both the raters are in agreement by chance or randomly, denoted as Pe in the following formula.
Now that we have defined the terms, let’s calculate the Cohen Kappa score. Let the total observation (TP + FP + FN + TN) is N. Or, N = TP + FP + FN + TN
The first step is to calculate the probability that both the raters are in perfect agreement:
Observed Agreement, Po = (TP + TN) / N
In our example, this would be:
Po = (45+15)/100=0.6
Next, we need to calculate the expected probability that both the raters are in agreement by chance. This is calculated by multiplying the expected probability that both the raters are in agreement that the classes are positive, and, the classes are negative.
Pe = [{Pe(rater 1 says Yes) / N}* {Pe(rater 2 says Yes) / N} + [{Pe(rater 1 says no) / N} * {Pe(rater 2 says no) / N}]
So in our case this would be calculated as:
Pe = 0.7 x 0.6 + 0.4 x 0.3 = 0.42 + 0.12 = 0.54
Now that we have both the observed and expected agreement, we can calculate Cohen’s Kappa:
Kappa score = (Po – Pe) / (1 – Pe)
In our example, this would be:
K = (0.6 – 0.54)/(1 – 0.54)= 0.06 / 0.46 = 0.1304 or a little over 13%
Kappa can range from 0 to 1. A value of 0 means that there is no agreement between the raters (real-world observer vs classification model), and a value of 1 means that there is perfect agreement between the raters. In most cases, anything over 0.7 is considered to be very good agreement.
Cohen Kappa score can also be used to assess the performance of multi-class classification model. Lets take a look at the following confusion matrix representing a image classification model (CNN model) which classifies image into three different classes such as cat, dog and monkey.
Lets calculate the Kappa score for the above confusion matrix representing multiclass classification model:
Po = (15 + 20 + 10) / 120 = 45/120 = 0.375
Pe = 45*30/120*120 + 40*50/120*120 + 35*40/120*120
= 0.09375 + 0.1389 + 0.09722 = 0.32987
K = (P0 – Pe) / (1 – Pe) = (0.375 – 0.32987) / (1 – 0.32987) = 0.04513 / 0.67013
= 0.0673
The following picture represents the interpretation of Cohen Kappa Score:
There are a few things to keep in mind when interpreting Kappa values:
Python Sklearn.metrics module provides cohen_kappa_score function for calculating the Kappa score or coefficient. The following is the python code example of how to calculate the Kappa score.
from sklearn.metrics import cohen_kappa_score
#define array of ratings for both raters
rater1 = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0]
rater2 = [0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0]
#calculate Cohen's Kappa
cohen_kappa_score(rater1, rater2)
The Cohen Kappa score comes out to be 0.21053
Cohen Kappa Scoring can also be used with cross validation technique as a custom scorer. The following is how Cohen Kappa scoring can be used with cross_val_score, which is a utility function provided by Python Sklearn to evaluate the performance of a model by cross-validation. In the following code, notice the code, scoring=make_scorer(cohen_kappa_score). This specifies that the scoring mechanism for evaluating the model should be the Cohen’s Kappa Score.
from sklearn.metrics import make_scorer
print(cross_val_score(lr_model,
features,
targets,
scoring=make_scorer(cohen_kappa_score),
n_jobs=-1).mean())
In above code, essentially, the code is using cross-validation to assess the average Cohen’s Kappa Score of the logistic regression model (lr_model) across different splits of the data. This provides a more generalized view of the model’s performance, particularly in terms of how well it agrees with the true outcomes, corrected for chance agreement. This is especially useful in situations where class imbalance might make simple accuracy an unreliable metric.
Cohen Kappa Score is a statistic used to measure the agreement between two raters. It can be used to calculate how much agreement there is between raters on a scale from 0-1, with 1 being perfect agreement and 0 being no agreement at all. The higher the score, the more agreement there is between the raters. Cohen Kappa score or Kappa coefficient is used for assessing the performance of machine learning classification models. Cohen Kappa Score is most commonly used in research settings, but it can also be applied to other fields like marketing. Let us know if you have any questions about Cohen Kappa Score or how it can be applied in your field.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…
View Comments
I think the Pe where the 3x3 contingency table is miscalculated
Thank you for pointing out correctly. Corrected!