# Key techniques for Evaluating Machine Learning models

Machine learning is a powerful machine intelligence technique that can be used to develop predictive models for different types of data. It has become the backbone of many intelligent applications and evaluating machine learning model performance at a regular intervals is key to success of such applications. A machine learning model’s performance depends on several factors including the type of algorithm used, how well it was trained and more. In this blog post, we will discuss  essential techniques for evaluating machine-learning model performance in order to provide you with some best practices when working with machine-learning models.

The following are different techniques that can be used for evaluating machine learning model performance:

1. Root mean squared error (RMSE)
2. AUC-ROC curve
3. Logarithmic loss
4. Kappa score
5. Confusion matrix
6. Kolmogorov-Smirnov Test
7. Cross-validation techniques
8. Gini coefficient
9. Gain and lift chart
10. Chi-square test
11. Brier score

## What is RMSE metrics?

RMSE is an abbreviation for Root Mean Squared Error. RMSE is used to evaluate the quality of regression models and assess whether they meet certain criteria. It is a measure of how well an algorithm estimates regression coefficients (or weights) in linear regression analysis and quantifies error due to not knowing the true value. This metric represents the average of all errors committed by a model along its prediction path. In machine learning, RMSE stands as the most commonly used measure to evaluate how well regression models are performing on their tasks. RMSE can be compared to other regression model evaluation metrics such as mean absolute error (MAE).

RMSE is calculated as follows:

The smaller the value of RMSE, the better is the regression model.

## What is AUC-ROC Curve?

AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) Curve is a machine learning model performance evaluation technique that uses the area under the ROC curve to evaluate classification model performance. The curve is created by plotting the true positive rate against the false-positive rate. The model performance is measured by this curve, and it helps understand how a model performs across different threshold values for classifying data points as positives or negatives.AUC-ROC curve is defined as the area under the curve of a Receiver Operating Characteristic (ROC) plot. It is used when performing classification analysis. This curve can be used to describe the performance of a classifier when faced with ROC space, which is a two-dimensional plane created by plotting True Positive Rate (TPR) and False Positive Rate (FPR). The AUC value ranges from 0.50 to 0.70 for random classification while it soars to 0.95 if classification is perfect (i.e., all True Positives and no False Negatives).

AUC-ROC curve can be used in classification problems with two or more classes. In this case, the AUC-ROC curve will be used to evaluate a classifier for each possible outcome of classification (e.g., given an email containing spam/not spam). This is done by finding out the TPR and FPR values associated with these outcomes. The area under ROC curve (AUC) will then be the average AUC of all classifiers. True positive rate is the percentage of data points that are correctly classified as positives, and the false-positive rate is the percentage of negative data points which are incorrectly being marked as positive. True positive rate is also called sensitivity, and false-positive rate is also called fall-out.

Here is a sample plot for AUC-ROC curve:

Higher the value of area under the curve, the machine learning model is performing well.

AUC-ROC is used to measure machine learning model performance for binary classification problems. In such cases, the models have a higher chance of overfitting and producing poor results on unseen test data. AUC-ROC Curve is also useful when evaluating machine learning model performance for multi-class problems, where the number of classes is more than two.

The formula for measuring true positive rate is :

True positive rate = TP / (TP+FN)

The formula for measuring false-positive rate is:

False Positive Rate=FP/ (TN+FP)

## What is Logarithmic loss?

Logarithmic loss is used in classification tasks where the labels are on an exponential scale, such as predicting time left to live (binary classification) or whether a house price is above or below certain value threshold. Log loss can be applied to regression problems as well, but it will behave differently depending on what you’re trying to solve. If there isn’t exponential separation between classes (meaning they’re very close together), then Log loss behaves similarly to cross entropy. In other words, if you have a classification task on an exponential scale, but the classes aren’t very different from each other (meaning there is overlap between them), Log loss might not be helpful.

Log loss is used to evaluate the performance of classification machine learning models that are built using classification algorithms such as logistic regression, support vector machine (SVM), random forest, and gradient boosting.

The idea behind the use of Log loss is similar to taking a base-e exponential or natural logarithm in order to compare model scores from high-value functions which may indicate poor machine learning model performance.

The logarithmic loss value is defined as:

$$Loss = – \log ( P(Y=y))$$ Where $y$ can be either one of the classes in case of classification problems or a real-value target in case of regression problems.

The machine learning models are evaluated by minimizing this loss function. The model with the least value for logarithmic loss has the best performance. This technique is especially useful when there are more than one models that provide similar performances or have almost equal scores, as in such cases it will be difficult to choose the machine learning model that performed better.

In case of machine learning problems with more than two classes, log loss is replaced by the softmax cross-entropy function which ensures that all probabilities get multiplied rather than addition being used to calculate the final score. For classification tasks with large number of classes, Log loss is not a good choice because it doesn’t have the ability to differentiate between many classes as well as other classification losses can by normalizing results across all of your classes.

## What is Kappa’s score?

Kappa score value can be used to evaluate how well our classification model performs compared with human behavior which will be considered perfect (100%). Kappa scores above 0.75 indicate very good classification performance, while values below 0.40 indicate poor classification results. However, Kappa scores are always associated with some degree of uncertainty. This degree of uncertainty is due to the fact that Kappa score depends on both TP and FP rates, which can have different distributions in classification problems (e.g., classifiers might usually make less false predictions than true ones).

It is sometimes also called Cohen’s kappa and has been used since the 1960s to assess inter-rater reliability. It requires a set of known ground truth labels that are assigned by multiple parties for training data points, which can then be compared to the machine learning model’s labels.

The machine learning model is then evaluated using the Kappa score, which can be calculated as follows: $$K=\frac{TP+TN}{P+N}$$ Where P and N are a number of positive and negative samples respectively in a data set. TP (True Positive) refers to samples that the machine learning model has assigned a positive label whereas TN (True Negative) refers to samples that the machine learning model has assigned a negative label.

## What is a confusion matrix?

The confusion matrix is defined as a contingency table that is used to evaluate the performance of machine learning classifiers. It measures how well instances from each category are predicted by the machine learning model. The confusion matrix can be used to measure the performance of machine learning models in three ways:

• The accuracy is the proportion of correct predictions.
• Precision is defined as the true positive rate and tells how many actual positives were identified among all the positives predicted by the machine learning model.
• Recall or sensitivity (true positive rate) is also known as true discovery
• rate. This is the machine learning model’s ability to find all positive instances in a data set.
• Confusion matrices can be generated for machine learning evaluation using different machine learning libraries. In the case of Python, scikit-learn provides a number of functions that can be used to generate confusion matrices from machine learning model scores.

## What is Kolmogorov-Smirnov Test?

Kolmogorov-Smirnov (KS) test is a non-parametric test used for decision-making in machine learning. It is a statistical measure used to evaluate the difference between two probability distributions. For example, it can be used to compare how well an algorithm performs with new data over time or across different machine learning models trained on the same dataset but using different training parameters (hyper-parameters).

The Kolmogorov-Smirnov test is also used for evaluating the equality of two empirical distribution functions (EDF) or to check if they come from a common parent distribution.

The KS test can be calculated as follows:

$$D=\sup_{x}\left|F_A(x) – F_B(x)\right|$$

Where A and B are machine learning models.

If the KS test score is high, it means that there is a big difference between machine learning model performances or they come from different distributions which indicate that machine learning model performance (or their machine learning pipelines) should be improved.

The KS Test has been widely used in machine learning as it can evlauate how well an algorithm performs with new data over time and across different machine learning models trained on the same dataset but using different training parameters (hyper-parameters). It is also used to evaluate machine learning model performance with new data over time and across machine learning models trained on same datasets but using different hyper-parameters.

## Cross-validation techniques

The cross-validation technique is used to evaluate machine learning model performance. This technique helps in determining how well the machine learning model is generalizing to unseen data from the future.

There are generally two types of cross-validation techniques:

• Holdout method/Split sample approach: It divides a given dataset into training and testing datasets (80%–20%, 70%-30%, 60%-40%).
• Repeated hold out: It divides a given dataset into training and testing datasets multiple times. This technique is known as k-fold cross-validation or leaves one out method (k = number of folds, for example, if the value of ‘K’s in five-fold cross-validation is four then it means that data has been divided into five sets and machine learning model is tested on the data in four folds and machine learning model performance is evaluated based on test results of fourth fold).

## 8. What is the Gini coefficient?

Gini coefficient is a statistical measure of distribution inequality and it ranges from 0 to 100%. A model with an extremely low value for the Gini coefficient indicates that the machine learning algorithm succeeded in reducing the variation between class values in favor of increasing the difference between classes. On the other hand, a high value for Gini Index shows machine learning can create a distinction between class values. Thus, a machine learning model with a high Gini Index is better for classification problems than the one that has a low value of the Gini coefficient.

## What is the gain and lift chart?

Gain and lift chart is a good way to evaluate how the machine learning model performs on independent test data. The gain and lift chart is a mixture of precision, recall, and f-score curves. The gain and lift chart plots precision, recall, and f-score curves on the same graph for a single classifier value. Gain and lift charts are used in case of multiclass problems where there is more than two classes involved.

A gain chart is used to find the change in model performance when a machine learning algorithm is applied to independent test data. A gain chart plots precision, recall, and f-score curves for each classifier value.

A lift chart evaluate how a machine learning model performs across different classes. The x-axis of a lift graph represents all possible values that can be predicted by a machine learning model. On the other hand, the y-axis of a lift graph represents the number of observations that belong to each possible value. The machine learning model is then evaluated for each classifier value in order to find out its true positive rate (recall) and false positive rate (precision).

A combination of gain and lift charts can be used to understand machine learning model performance on independent test data.

## What is Chi-square test?

Chi-square test is used to assess the performance of machine learning classification models. It is a hypothesis testing technique that tests if the observed frequencies of events are significantly different from their expected frequency. The null hypothesis for chi-square test is “there exists no difference in model performance” and it’s alternative hypothesis states ‘the model performs differently than what was expected’

In order to use chi-square test, the machine learning model has to be trained with categorical variables only.

## What is Brier’s score?

Brier score is defined as the mean squared error of a machine learning model. It is used to evaluate the accuracy of a set of probabilistic predictions. It is used to quantify how good or bad our machine learning model is in terms of prediction accuracy. Brier score ranges between 0 and 1, where zero stands for perfect predictions while one represents random guesses on average. Lower the Brier score, better the machine learning model accuracy.