In this post, you will learn about how to use micro-averaging and macro-averaging methods for evaluating scoring metrics (precision, recall, f1-score) for multi-class classification machine learning problem. You will also learn about weighted precision, recall and f1-score metrics in relation to micro-average and macro-average scoring metrics for multi-class classification problem. The concepts will be explained with Python code examples.

## What & Why of Micro and Macro-averaging scoring metrics?

With binary classification, it is very intuitive to score the model in terms of scoring metrics such as precision, recall and F1-score. However, in case of multi-class classification it becomes tricky. The questions to ask are some of the following:

• Which metrics to use to score the model trained for multi-class classification?
• How to calculate precision, recall and f1-score of multi-class classification models?
• When to use micro-average and macro-averaging scores?

In order to take care of the above, macro and micro averaging methods come into picture. Python Sklearn package provides implementation for these methods. This is illustrated with examples in later sections.

The micro-average precision and recall score is calculated from the individual classes’ true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) of the model.

The macro-average precision and recall score is calculated as arithmetic mean of individual classes’ precision and recall scores.

The macro-average F1-score is calculated as arithmetic mean of individual classes’ F1-score.

### When to use micro-averaging and macro-averaging scores?

Use micro-averaging score when there is a need to weight each instance or prediction equally.

Use macro-averaging score when all classes need to be treated equally to evaluate the overall performance of the classifier with regard to the most frequent class labels.

Use weighted macro-averaging score in case of class imbalances (different number of instances related to different class labels). The weighted macro-average is calculated by weighting the score of each class label by the number of true instances when calculating the average.

## Micro-Average & Macro-Average Precision Scores for Multi-class Classification

For multi-class classification problem, micro-average precision scores can be defined as sum of true positives for all the classes divided by the all positive predictions. The positive prediction is sum of all true positives and false positives. Here is how it would look like mathematically:

$$PrecisionMicroAvg = \frac{(TP_1 + TP_2 + … + TP_n)}{(TP_1 + TP_2 + … + TP_n + FP_1 + FP_2 + … + FP_n)}$$

Macro-average precision score can be defined as the arithmetic mean of all the precision scores of different classes. Here is how it would look like mathematically:

$$PrecisionMacroAvg = \frac{(Prec_1 + Prec_2 + … + Prec_n)}{n}$$

## Micro-Average & Macro-Average Recall Scores for Multi-class Classification

For multi-class classification problem, micro-average recall scores can be defined as sum of true positives for all the classes divided by the actual positives (and not the predicted positives).  Here is how it would look like mathematically:

$$RecallMicroAvg = \frac{(TP_1 + TP_2 + … + TP_n)}{(TP_1 + TP_2 + … + TP_n + FN_1 + FN_2 + … + FN_n)}$$

Macro-average Recall score can be defined as the arithmetic mean of all the recall scores of different classes. Here is how it would look like mathematically:

$$RecallMacroAvg = \frac{(Recall_1 + Recall_2 + … + Recall_n)}{n}$$

## Python Examples for Micro-averaging & Macro-averaging Methods

Here is the Python code sample representing the calculation of micro-average and macro-average precision & recall score for model trained on SkLearn IRIS dataset which has three different classes namely, setosa, versicolor, virginica. In order to create a confusion matrix having numbers across all the cells, only one feature is used for training the model. Pay attention to the training data X assigned to iris.data[:, [1]].

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score
#
#
X = iris.data[:, [1]]
y = iris.target
#
# Create the training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)
#
# Create a pipeline
#
pipeline = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1))
#
# Fit the estimator , pipeline
#
pipeline.fit(X_train, y_train)
#
# Get the predictions
#
y_pred = pipeline.predict(X_test)
#
# Calculate the confusion matrix
#
conf_matrix = confusion_matrix(y_test, y_pred)
#
# Print the confusion matrix using Matplotlib
#
fig, ax = plt.subplots(figsize=(6, 6))
ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3)
for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large')

plt.xlabel('Predicted', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()


This is how the confusion matrix would look like with model trained on IRIS data set with just one feature.

Here is how the Python code looks like for measuring the micro-average and macro-average precision scores:

# True positives prediction (diagonally) for all the classes
#
TP = 7 + 1 + 6
#
# False positives for all the classes
#
FP = 1 + 4 + 0 + 12 + 1 + 6
#
# Micro-average precision scores for all the classes
#
precisionScore_manual_microavg = TP / (TP + FP)
#
# Macro-average of precision scores of all the classes
#
precisionScore_manual_macroavg = ((7/8) + (1/8) + (6/22))/3
#
# Print the micro-average and macro-average precision scores
#
precisionScore_manual_microavg, precisionScore_manual_macroavg


The same can as well be calculated using Sklearn precision_score, recall_score and f1-score methods. The parameter “average” need to be passed micro, macro and weighted to find micro-average, macro-average and weighted average scores respectively. Here is the sample code:

#
# Average is assigned micro
#
precisionScore_sklearn_microavg = precision_score(y_test, y_pred, average='micro')
#
# Average is assigned macro
#
precisionScore_sklearn_macroavg = precision_score(y_test, y_pred, average='macro')
#
# Printing micro and macro average precision score
#
precisionScore_sklearn_microavg, precisionScore_sklearn_macroavg


## Conclusions

Here is what you learned about the micro-averaging and macro-averaging scoring metrics in relation to multi-class classification problem.

• Micro-averaging and macro-averaging scoring metrics is used for evaluating models trained for multi-class classification problems.
• Macro-averaging scores are arithmetic mean of individual classes’ score in relation to precision, recall and f1-score
• Micro-averaging precision scores is sum of true positive for individual classes divided by sum of predicted positives for all classes
• Micro-averaging recall scores is sum of true positive for individual classes divided by sum of actual positives for all classes
• Use micro-averaging score when there is a need to weight each instance or prediction equally.
• Use macro-averaging score when all classes need to be treated equally to evaluate the overall performance of the classifier with regard to the most frequent class labels.
• Use weighted macro-averaging score in case of class imbalances (different number of instances related to different class labels).