Classification models are used in classification problems to predict the target class of the data sample. The classification model predicts the probability that each instance belongs to one class or another. It is important to evaluate the performance of the classifications model in order to use these models in production for solving real world problems. Performance measures in machine learning classification models are used to assess how well machine learning classification algorithms perform in a given context. These performance metrics include accuracy, precision, recall and F1-score. Because it helps us understand the strengths and limitations of these models when making predictions in new situations, model performance is essential for machine learning. In this blog post we will explore these four machine learning classification model performance metrics through Python Sklearn example.
As a data scientist, you must get a good understanding of concepts related to the above in relation to measuring classification model performance.
Lets work with Sklearn datasets for breast cancer. You can load the dataset using the following code:
import pandas as pd import numpy as np from sklearn import datasets # # Load the breast cancer data set # bc = datasets.load_breast_cancer() X = bc.data y = bc.target
The target labels in the breast cancer dataset are Benign (1) and Malignant (0). There are 212 records with labels as malignant and 357 records with labels as benign. Let’s create a training and test split where 30% of the dataset is set aside for the testing purposes.
from sklearn.model_selection import train_test_split # # Create training and test split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=y)
Splitting the breast cancer dataset into training and test set results in the test set consisting of 64 records’ labels as benign and 107 records’ labels as malignant. Thus, the actual positive is 107 records and the actual negative is 64 records. Let’s train the model and get the confusion matrix. Here is the code for training the model and printing the confusion matrix.
from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.metrics import confusion_matrix from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score import matplotlib.pyplot as plt # # Standardize the data set # sc = StandardScaler() sc.fit(X_train) X_train_std = sc.transform(X_train) X_test_std = sc.transform(X_test) # # Fit the SVC model # svc = SVC(kernel='linear', C=10.0, random_state=1) svc.fit(X_train, y_train) # # Get the predictions # y_pred = svc.predict(X_test) # # Calculate the confusion matrix # conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred) # # Print the confusion matrix using Matplotlib # fig, ax = plt.subplots(figsize=(5, 5)) ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3) for i in range(conf_matrix.shape): for j in range(conf_matrix.shape): ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large') plt.xlabel('Predictions', fontsize=18) plt.ylabel('Actuals', fontsize=18) plt.title('Confusion Matrix', fontsize=18) plt.show()
The following confusion matrix is printed:
The predicted data results in the above diagram could be read in the following manner given 1 represents malignant cancer (positive).
- True Positive (TP): True positive represents the value of correct predictions of positives out of actual positive cases. Out of 107 actual positive, 104 is correctly predicted positive. Thus, the value of True Positive is 104.
- False Positive (FP): False positive represents the value of incorrect positive predictions. This value represents the number of negatives (out of 64) which gets falsely predicted as positive. Out of 64 actual negative, 3 is falsely predicted as positive. Thus, the value of False Positive is 3.
- True Negative (TN): True negative represents the value of correct predictions of negatives out of actual negative cases. Out of 64 actual negative, 61 is correctly predicted negative. Thus, the value of True Negative is 61.
- False Negative (FN): False negative represents the value of incorrect negative predictions. This value represents the number of positives (out of 107) which gets falsely predicted as negative. Out of 107 actual positive, 3 is falsely predicted as negative. Thus, the value of False Negative is 3.
Given above definitions, lets try and understand the concept of accuracy, precision, recall and f1-score.
What is Precision Score?
Precision: Model precision score represents the model’s ability to correctly predict the positives out of all the positive predictions it made. The precision score is a useful measure of the success of prediction when the classes are very imbalanced. Mathematically, it represents the ratio of true positive to the sum of true positive and false positive.
Precision Score = TP / (FP + TP)
The precision score from the above confusion matrix will come out to be the following:
Precision score = 104 / (3 + 104) = 104/107 = 0.972
The same score can be obtained by using precision_score method from sklearn.metrics
print('Precision: %.3f' % precision_score(y_test, y_pred))
What are different real-world scenarios when precision scores can be used as an evaluation metrics?
The precision score can be used in the scenario where the machine learning model is required to identify all positive examples without any false positives. For example, machine learning models are used in medical diagnosis applications where the doctor wants machine learning model will not provide a label of pneumonia if the patient does not have this disease. Oncologists want models that can identify all cancerous lesions without any false-positive results, and hence one would use a precision score in such cases.
The other example where the precision score can be useful is credit card fraud detection. In credit card fraud detection problems, classification models are evaluated using the precision score to determine how many positive samples were correctly classified by the classification model. You would not like to have a high number of false positives or else you might end up blocking many credit cards and hence a lot of frustrations with the end-users.
What is Recall Score?
Recall: Model recall score represents the model’s ability to correctly predict the positives out of actual positives. This is unlike precision which measures how many predictions made by models are actually positive out of all positive predictions made. For example: If your machine learning model is trying to identify positive reviews, the recall score would be what percent of those positive reviews did your machine learning model correctly predict as a positive. In other words, it measures how good our machine learning model is at identifying all actual positives out of all positives that exist within a dataset. The higher the recall score, the better the machine learning model is at identifying both positive and negative examples. Recall score is a useful measure of success of prediction when the classes are very imbalanced. Mathematically, it represents the ratio of true positive to the sum of true positive and false negative.
Recall Score = TP / (FN + TP)
The recall score from the above confusion matrix will come out to be the following:
Recall score = 104 / (3 + 104) = 104/107 = 0.972
The same score can be obtained by using recall_score method from sklearn.metrics
print('Recall: %.3f' % recall_score(y_test, y_pred))
Recall score can be used in the scenario where the labels are not equally divided among classes. For example, if there is a class imbalance ratio of 20:80 (imbalanced data), then the recall score will be more useful than accuracy because it can provide information about how well the machine learning model identified rarer events.
What is Accuracy Score?
Model accuracy is a machine learning model performance metric that is defined as the ratio of true positives and true negatives to all positive and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions. For example: Let’s assume that you were testing your machine learning model with a dataset of 100 records and that your machine learning model predicted all 90 of those instances correctly. The accuracy metric, in this case, would be: (90/100) = 90%. The accuracy rate is great but it doesn’t tell us anything about the errors our machine learning models make on new data we haven’t seen before.
Mathematically, it represents the ratio of the sum of true positive and true negatives out of all the predictions.
Accuracy Score = (TP + TN)/ (TP + FN + TN + FP)
The accuracy score from above confusion matrix will come out to be the following:
Accuracy score = (104 + 61) / (104 + 3 + 61 + 3) = 165/171 = 0.965
The same score can be obtained by using accuracy_score method from sklearn.metrics
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
What is F1-Score?
Model F1 score represents the model score as a function of precision and recall score. F-score is a machine learning model performance metric that gives equal weight to both the Precision and Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy metrics (it doesn’t require us to know the total number of observations). It’s often used as a single value that provides high-level information about the model’s output quality. This is a useful measure of the model in the scenarios where one tries to optimize either of precision or recall score and as a result, the model performance suffers. The following represents the aspects relating to issues with optimizing either precision or recall score:
- Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer. However, this comes at the cost of predicting malignant cancer in patients although the patients are healthy (a high number of FP).
- Optimize for precision helps with correctness if the patient has a malignant cancer. However, this comes at the cost of missing malignant cancer more frequently (a high number of FN).
Mathematically, it can be represented as harmonic mean of precision and recall score.
F1 Score = 2* Precision Score * Recall Score/ (Precision Score + Recall Score/)
The accuracy score from above confusion matrix will come out to be the following:
F1 score = (2 * 0.972 * 0.972) / (0.972 + 0.972) = 1.89 / 1.944 = 0.972
The same score can be obtained by using f1_score method from sklearn.metrics
print('F1 Score: %.3f' % f1_score(y_test, y_pred))
Here is the summary of what you learned in relation to precision, recall, accuracy and f1-score.
- Precision score is used to measure the model performance on measuring the count of true positives in correct manner out of all positive predictions made.
- Recall score is used to measure the model performance in terms of measuring the count of true positives in correct manner out of all the actual positive values.
- Precision-Recall score is a useful measure of success of prediction when the classes are very imbalanced.
- Accuracy score is used to measure the model performance in terms of measuring the ratio of sum of true positive and true negatives out of all the predictions made.
- F1-score is harmonic mean of precision and recall score and is used as a metrics in the scenarios where choosing either of precision or recall score can result in compromise in terms of model giving high false positives and false negatives respectively.
- Accounts Payable Machine Learning Use Cases - October 25, 2021
- Stock Price Prediction using Machine Learning Techniques - October 24, 2021
- Type I & Type II Errors in Hypothesis Testing: Examples - October 23, 2021