Classification models are used in classification problems to predict the target class of the data sample. The classification model predicts the probability that each instance belongs to one class or another. It is important to evaluate the performance of the classifications model in order to reliably use these models in production for solving real-world problems. Performance measures in machine learning classification models are used to assess how well machine learning classification models perform in a given context. These performance metrics include** accuracy, precision, recall, and F1-score**. Because it helps us understand the strengths and limitations of these models when making predictions in new situations, model performance is essential for machine learning. In this blog post, we will explore these four machine learning classification model performance metrics through Python Sklearn example.

As a data scientist, you must get a good understanding of concepts related to the above in relation to measuring classification models’ performance. Before we get into the details of the performance metrics as listed above, lets understand key terminologies such as true positive, false positive, true negative and false negative with the help of confusion matrix. These terminologies will be used across different performance metrics.

## Terminologies – True Positive, False Positive, True Negative, False Negative

Before we get into the definitions, lets work with Sklearn breast cancer datasets for classifying whether a particular instance of data belongs to **benign or malignant** **breast cancer **class**. **You can load the dataset using the following code:

```
import pandas as pd
import numpy as np
from sklearn import datasets
#
# Load the breast cancer data set
#
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
```

The target labels in the breast cancer dataset are Benign (1) and Malignant (0). There are 212 records with labels as malignant and 357 records with labels as benign. Let’s create a training and test split where 30% of the dataset is set aside for testing purposes.

```
from sklearn.model_selection import train_test_split
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=y)
```

Splitting the breast cancer dataset into training and test set results in the test set consisting of 64 records’ labels as benign and 107 records’ labels as malignant. Thus, the actual positive is 107 records and the **actual negative **is 64 records. Let’s train the model and get the confusion matrix. Here is the code for training the model and printing the confusion matrix.

```
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt
#
# Standardize the data set
#
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Fit the SVC model
#
svc = SVC(kernel='linear', C=10.0, random_state=1)
svc.fit(X_train, y_train)
#
# Get the predictions
#
y_pred = svc.predict(X_test)
#
# Calculate the confusion matrix
#
conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
#
# Print the confusion matrix using Matplotlib
#
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3)
for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large')
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()
```

The following **confusion matrix** is printed:

The predicted data results in the above diagram could be read in the following manner given **1 **represents malignant cancer (positive).

**True Positive (TP)**: True positive measures the extent to which the model correctly predicts the positive class. That is, the model predicts that the instance is positive, and the instance is actually positive. True positives are relevant when we want to know how many positives our model correctly predicts. For example, in a binary classification problem with classes “A” and “B”, if our goal is to predict class “A” correctly, then a true positive would be the number of instances of class “A” that our model correctly predicted as class “A”. Taking a real-world example, if the model is designed to predict whether an email is spam or not, a true positive would occur when the model correctly predicts that an email is a spam. The true positive rate is the percentage of all instances that are correctly classified as belonging to a certain class. True positives are important because they indicate how well our model performs on positive instances. In the above confusion matrix, out of 107 actual positives, 104 are correctly predicted positives. Thus, the value of True Positive is 104.**False Positive (FP)**: False positives occur when the model predicts that an instance belongs to a class that it actually does not. False positives can be problematic because they can lead to incorrect decision-making. For example, if a medical diagnosis model has a high false positive rate, it may result in patients undergoing unnecessary treatment. False positives can be detrimental to classification models because they lower the overall accuracy of the model. There are a few ways to measure false positives, including false positive rates. The false positive rate is the proportion of all negative examples that are predicted as positive. While false positives may seem like they would be bad for the model, in some cases they can be desirable. For example, in medical applications, it is often better to err on the side of caution and have a few false positives than to miss a diagnosis entirely. However, in other applications, such as spam filtering, false positives can be very costly. Therefore, it is important to carefully consider the trade-offs involved when choosing between different classification models. In the above example, the false positive represents the number of negatives (out of 64) that get falsely predicted as positive. Out of 64 actual negatives, 3 is falsely predicted as positive. Thus, the value of False Positive is 3.**True Negative****(TN)**: True negatives are the outcomes that the model correctly predicts as negative. For example, if the model is predicting whether or not a person has a disease, a true negative would be when the model predicts that the person does not have the disease and they actually don’t have the disease. True negatives are one of the measures used to assess how well a classification model is performing. In general, a high number of true negatives indicates that the model is performing well. True negative is used in conjunction with false negative, true positive, and false positive to compute a variety of performance metrics such as accuracy, precision, recall, and F1 score. While true negative provides valuable insight into the classification model’s performance, it should be interpreted in the context of other metrics to get a complete picture of the model’s accuracy. Out of 64 actual negatives, 61 is correctly predicted negative. Thus, the value of True Negative is 61.**False Negative****(FN)**: A false negative occurs when a model predicts an instance as negative when it is actually positive. False negatives can be very costly, especially in the field of medicine. For example, if a cancer screening test predicts that a patient does not have cancer when they actually do, this could lead to the disease progressing without treatment. False negatives can also occur in other fields, such as security or fraud detection. In these cases, a false negative may result in someone being granted access or approving a transaction that should not have been allowed. False negatives are often more serious than false positives, and so it is important to take them into account when evaluating the performance of a classification model. This value represents the number of positives (out of 107) that get falsely predicted as negative. Out of 107 actual positives, 3 is falsely predicted as negative. Thus, the value of False Negative is 3.

Given the above definitions, let’s try and understand the concept of accuracy, precision, recall, and f1-score.

## What is Precision Score?

The **model precision score** measures the proportion of positively predicted labels that are actually correct. Precision is also known as the positive predictive value. Precision is used in conjunction with the recall to trade-off false positives and false negatives. Precision is affected by the class distribution. If there are more samples in the minority class, then precision will be lower. Precision can be thought of as a measure of exactness or quality. If we want to minimize false negatives, we would choose a model with high precision. Conversely, if we want to minimize false positives, we would choose a model with high recall. Precision is mainly used when we need to predict the positive class and there is a greater cost associated with false positives than with false negatives such as in medical diagnosis or spam filtering. For example, if a model is 99% accurate but only has 50% precision, that means that half of the time when it predicts an email is a spam, it is actually not spam.

The precision score is a useful measure of the **success of prediction when the classes are very imbalanced.** Mathematically, it represents the ratio of true positive to the sum of true positive and false positive.

**Precision Score = TP / (FP + TP)**

From the above formula, you could notice that the value of false-positive would impact the precision score. Thus, while building predictive models, you may choose to focus appropriately to build models with lower false positives if a high precision score is important for the business requirements.

The precision score from the above confusion matrix will come out to be the following:

**Precision score** = 104 / (3 + 104) = 104/107 = **0.972**

The same score can be obtained by using the **precision_score **method from** sklearn.metrics**

```
print('Precision: %.3f' % precision_score(y_test, y_pred))
```

### Different real-world scenarios when precision scores can be used as evaluation metrics

The precision score can be used in the scenario where the machine learning model is required to identify all positive examples without any false positives. For example, machine learning models are used in medical diagnosis applications where the doctor wants machine learning models that will not provide a label of pneumonia if the patient does not have this disease. Oncologists ideally want models that can identify all cancerous lesions without any false-positive results, and hence one could use a precision score in such cases. Note that a greater number of false positives will result in a lot of stress for the patients in general although that may not turn out to be fatal from a health perspective. Further tests will be able to negate the false positive prediction.

The other example where the precision score can be useful is credit card fraud detection. In credit card fraud detection problems, classification models are evaluated using the precision score to determine how many positive samples were correctly classified by the classification model. You would not like to have a high number of false positives or else you might end up blocking many credit cards and hence a lot of frustrations with the end-users.

Another example where you would want greater precision is **spam filters**. A greater number of false positives in a spam filter would mean that one or more important emails could be tagged as spam and moved to spam folders. This could hamper in so many different ways including impact on your day-to-day work.

## What is Recall Score?

Model recall score represents the model’s ability to correctly predict the positives out of actual positives. This is unlike precision which measures how many predictions made by models are actually positive out of all positive predictions made. For example: If your machine learning model is trying to identify positive reviews, the recall score would be what percent of those positive reviews did your machine learning model correctly predict as a positive. In other words, it measures how good our machine learning model is at identifying all actual positives out of all positives that exist within a dataset. The higher the recall score, the better the machine learning model is at identifying both positive and negative examples. Recall is also known as sensitivity or the true positive rate. A high recall score indicates that the model is good at identifying positive examples. Conversely, a low recall score indicates that the model is not good at identifying positive examples. Recall is often used in conjunction with other performance metrics, such as precision and accuracy, to get a complete picture of the model’s performance. Mathematically, it represents the ratio of true positive to the sum of true positive and false negative.

**Recall Score = TP / (FN + TP)**

From the above formula, you could notice that the value of false-negative would impact the recall score. Thus, while building predictive models, you may choose to focus appropriately to build models with lower false negatives if a high recall score is important for the business requirements.

The recall score from the above confusion matrix will come out to be the following:

**Recall score** = 104 / (3 + 104) = 104/107 = **0.972**

The same score can be obtained by using the **recall_score **method from** sklearn.metrics**

```
print('Recall: %.3f' % recall_score(y_test, y_pred))
```

Recall score can be used in the scenario where the labels are not equally divided among classes. For example, if there is a class imbalance ratio of 20:80 (imbalanced data), then the recall score will be more useful than accuracy because it can provide information about how well the machine learning model identified rarer events.

### Different real-world scenarios when recall scores can be used as evaluation metrics

Recall score is an important metric to consider when measuring the effectiveness of your machine learning models. It can be used in a variety of real-world scenarios, and it’s important to always aim to improve recall and precision scores together. The following are examples of some real-world scenarios where recall scores can be used as evaluation metrics:

- In medical diagnosis, the recall score should be an extremely high otherwise greater number of false negatives would prove to be fatal to the life of patients. The lower recall score would mean a greater false negative which essentially would mean that some patients who are positive are termed as falsely negative. That would mean that patients would get assured that he/she is not suffering from the disease and therefore he/she won’t take any further action. That could result in the disease getting aggravated and prove fatal to life.
- In a manufacturing system, you would want a higher recall score for machine learning models predictive of the need for system maintenance. A lower recall score would mean a higher false-negative which could result in downtime of the machines and hence impact to the business at large.
- In a credit card fraud detection system, you would want to have a higher recall score of the predictive models predicting fraud transactions. A lower recall score would mean a higher false-negative which would mean greater fraud and hence loss to business in terms of upset users.
- In sentiment analysis, the recall score determines how many relevant tweets or comments are found while the precision score is the fraction of retrieved tweets that are actually tagged as positive. A high recall score will benefit from a focused analysis.

## Precision – Recall Tradeoff

The precision-recall tradeoff is a common issue that arises when evaluating the performance of a classification model. Precision and recall are two metrics that are often used to evaluate the performance of a classifier, and they are often in conflict with each other.

Precision measures the proportion of true positive predictions made by the model (i.e. the number of correct positive predictions divided by the total number of positive predictions). It is a useful metric for evaluating the model’s ability to avoid false positives.

Recall, on the other hand, measures the proportion of true positive cases that were correctly predicted by the model (i.e. the number of correct positive predictions divided by the total number of true positive cases). It is a useful metric for evaluating the model’s ability to avoid false negatives.

In general, increasing the precision of a model will decrease its recall, and vice versa. This is because precision and recall are inversely related – improving one will typically result in a decrease in the other. For example, a model with a high precision will make few false positive predictions, but it may also miss some true positive cases. On the other hand, a model with a high recall will correctly identify most of the true positive cases, but it may also make more false positive predictions.

In order to evaluate a classification model, it is important to consider both precision and recall, rather than just one of these metrics. The appropriate balance between precision and recall will depend on the specific goals and requirements of the model, as well as the characteristics of the dataset. In some cases, it may be more important to have a high precision (e.g. in medical diagnosis), while in others, a high recall may be more important (e.g. in fraud detection).

To balance precision and recall, practitioners often use the F1 score, which is a combination of the two metrics. The F1 score is calculated as the harmonic mean of precision and recall, and it provides a balance between the two metrics. However, even the F1 score is not a perfect solution, as it can be difficult to determine the optimal balance between precision and recall for a given application.

## What is Accuracy Score?

**Model accuracy **is a machine learning classification model performance metric that is defined as the ratio of true positives and true negatives to all positive and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions. For example: Let’s assume that you were testing your machine learning model with a dataset of 100 records and that your machine learning model predicted all 90 of those instances correctly. The accuracy metric, in this case, would be: (90/100) = 90%. The accuracy rate is great but it doesn’t tell us anything about the errors our machine learning models make on new data we haven’t seen before.

Mathematically, it represents the ratio of the sum of true positive and true negatives out of all the predictions.

**Accuracy Score = (TP + TN)/ (TP + FN + TN + FP)**

The accuracy score from above confusion matrix will come out to be the following:

**Accuracy score** = (104 + 61) / (104 + 3 + 61 + 3) = 165/171 = **0.965**

The same score can be obtained by using **accuracy_score **method from** sklearn.metrics**

```
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
```

#### Caution with Accuracy Metrics / Score

The following are some of the **issues with accuracy metrics** / score:

- The same accuracy metrics for two different models may indicate different model performance towards different classes.
- In case of imbalanced dataset, accuracy metrics is not the most effective metrics to be used.

One should be **cautious when relying on the accuracy metrics** of model to evaluate the model performance. Take a look at the following confusion matrix. For model accuracy represented using both the cases (left and right), the accuracy is 60%. However, both the models exhibit different behaviors.

The model performance represented by left confusion matrix indicates that the model has weak positive recognition rate while the right confusion matrix represents that the model has strong positive recognition rate. Note that the accuracy is 60% for both the models. Thus, one needs to dig deeper to understand about the model performance given the accuracy metrics.

The **accuracy metrics is also not reliable **for the models trained on **imbalanced or skewed datasets. **Take a scenario of dataset with 95% imbalance (95% data is negative class). The accuracy of the classifier will be very high as it will be correctly doing right prediction issuing negative most of the time. A better classifier that actually deals with the class imbalance issue, is likely to have a worse accuracy metrics score. In such scenario of **imbalanced dataset**, another metrics **AUC (the area under ROC curve) is more robust than the accuracy metrics** score. The AUC takes into the consideration, the class distribution in imbalanced dataset. The ROC curve is a plot that shows the relationship between the true positive rate and the false positive rate of a classification model. The area under the ROC curve (AUC) is a metric that quantifies the overall performance of the model. A model with a higher AUC is considered to be a better classifier. Also, a much better way to evaluate the performance of a classifier is to look at the confusion matrix.

**Accuracy metrics only considers the number of correct predictions (true positives and true negatives) made by the model.** It does not take into account the relative importance of different types of errors, such as false positives and false negatives. For example, if a model is being used to predict whether a patient has a certain disease, a false positive (predicting that a patient has the disease when they actually do not) may be less severe than a false negative (predicting that a patient does not have the disease when they actually do). In this case, using accuracy as the sole evaluation metric may not provide a clear picture of the model’s performance.

## What is F1-Score?

**Model F1 score** represents the model score as a function of precision and recall score. F-score is a machine learning model performance metric that gives equal weight to both the Precision and Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy metrics (it doesn’t require us to know the total number of observations). It’s often used as a single value that provides high-level information about the model’s output quality. This is a useful measure of the model in the scenarios where one tries to optimize either of precision or recall score and as a result, the model performance suffers. The following represents the aspects relating to issues with optimizing either precision or recall score:

- Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer. However, this comes at the cost of predicting malignant cancer in patients although the patients are healthy (a high number of FP).
- Optimize for precision helps with correctness if the patient has malignant cancer. However, this comes at the cost of missing malignant cancer more frequently (a high number of FN).

Mathematically, it can be represented as a harmonic mean of precision and recall score.

**F1 Score = 2* Precision Score * Recall Score/ ( Precision Score + Recall Score/)**

The accuracy score from the above confusion matrix will come out to be the following:

**F1 score **= (2 * 0.972 * 0.972) / (0.972 + 0.972) = 1.89 / 1.944 = **0.972**

The same score can be obtained by using **f1_score **method from** sklearn.metrics**

```
print('F1 Score: %.3f' % f1_score(y_test, y_pred))
```

## Conclusions

Here is the summary of what you learned in relation to precision, recall, accuracy, and f1-score.

- A precision score is used to measure the model performance in measuring the count of true positives in the correct manner out of all positive predictions made.
- Recall score is used to measure the model performance in terms of measuring the count of true positives in a correct manner out of all the actual positive values.
- Precision-Recall score is a useful measure of success of prediction when the classes are very imbalanced.
- Accuracy score is used to measure the model performance in terms of measuring the ratio of sum of true positive and true negatives out of all the predictions made.
- F1-score is
**harmonic mean of precision and recall score**and is used as a metrics in the scenarios where choosing either of precision or recall score can result in compromise in terms of model giving high false positives and false negatives respectively.

Check out **my latest book** on **reasoning by first principles** titled as – First principles thinking: Building winning products using first principles thinking. You may as well check out the related blog – First principles thinking explained with examples.

- Types of Frequency Distribution & Examples - January 2, 2023
- Business Problems to Analytics Use Cases: How? - December 31, 2022
- Data Analysis Types: Concepts & Examples - December 30, 2022

[…] Accuracy, Precision, Recall & F1-Score – Python Examples … […]