Machine learning models are built to learn from training and test data and make predictions on new, unseen data set. The machine learning model is said to overfit the data when it learns patterns that exist only in the training set make prediction with high accuracy. On the other hand, machine learning model underfits if it cannot find any pattern or relationship between variables in both training and testing data sets. In this post, you will learn about some of the key concepts of overfitting and underfitting in relation to machine learning models. In addition, you will also get a chance to test you understanding by attempting the quiz. The quiz will help you prepare well for interview questions in relation to underfitting & overfitting. As data scientists, you must get a good understanding of the overfitting and underfitting concepts.
Introduction to Overfitting & Underfitting
Assuming an independent and identically distributed (I.I.d) dataset, when the prediction error on both the training and test dataset is high, the model is said to have underfitted. This is called underfitting the model or model underfitting. The underfitting problem can be resolved using machine learning modeling techniques like boosting machine learning algorithms that can combine machine learning models in a way to produce better predictions using an ensemble of machine-learning model outputs. Low R-squared value and large standard error of estimate in regression analysis, residual plots from machine learning algorithm for linear or logistic regression model output, and so on can all indicate when underfitting ML models are present.
When the prediction error on the test dataset is quite high or higher than the training dataset, the model can be said to have been overfitted. In other words, when the model accuracy on the training dataset is higher (quite high) than the model accuracy of the test dataset, the model can be said to have been overfitted. This is called model overfitting. When the machine learning algorithm is overfitting the training data, it may happen that the machine learning model works well on a small sample of your dataset and gives you high accuracy. However, when a machine learning model sees new input data, its performance degrades significantly because of overfitting of machine learning models to specific patterns in your data set causing memorization of machine learning model and not a generalization of machine learning model.
Overfitting machine learning models can be minimized or resolved using regularization techniques like LASSO (least absolute shrinkage and selection operator) that penalizes large coefficients more heavily than machine learning algorithms like gradient descent that do not apply such regularization. Various machine learning techniques, including validation curves and cross-fold plots, can be used to spot overfitting.
What are different scenarios in which machine learning models overfitting can happen?
Overfitting of machine learning models can happen in some of the following scenarios:
- When machine learning algorithm is using a much larger training dataset compared with testing set and learns patterns in the large input space that only minimally increase accuracy on a small test set.
- When machine learning algorithm is using too many parameters to model the training data.
- If the hypothesis space searched by the learning algorithm is high. Let’s try and understand what is meaning of hypothesis space and what is meaning of searching hypothesis space. If the learning algorithm used for fitting the model can have a large number of different hyperparameters, and, could be trained with different datasets (called as training dataset) extracted from the same dataset, this could result in a large number of models (hypothesis – h(X)) fit on the same data set. Recall that a hypothesis is an estimator of the target function. Thus, on the same dataset, a large number of models can be fit. This is called a larger hypothesis space. The learning algorithm in such a scenario can be said to have access to a larger hypothesis space. Given this larger hypothesis space, there is a high possibility for the model to overfit the training dataset.
What are different scenarios in which machine learning models underfitting can happen?
Underfitting of machine learning models can happen in some of the following scenarios:
- When training set has far fewer observations than variables, this may lead underfitting or low bias machine learning models. In such cases, machine learning algorithm cannot find any relationship between input data and output variable because machine learning algorithm is not complex enough to model the data.
- When machine learning algorithm cannot find any pattern between training and testing set variables which may happen in high-dimensional dataset or large number of input variables. This could be due to insufficient machine learning model complexity, limited available training observations for learning patterns, limited computing power that limits machine learning algorithms ability to search for patterns in high dimensional space, etc.
Here is a diagram that represents the underfitting vs overfitting in form of model performance error vs model complexity.
In the above diagram, when the model complexity is low (decision stump or 30-NN model), the training and test error are both high. This represents the model underfitting. When the model complexity is very high (1-NN or deep decision tree), there is a very large gap between training and test error. This represents the case of model overfitting. The sweet spot is in between, represented using an orange dashed line. At the sweet spot, e.g., ideal model, there is a smaller gap between training and test error.
Interview Questions on Underfitting & Overfitting
Before getting into the quiz, let’s look at some of the interview questions in relation to overfitting and underfitting concepts:
- What is overfitting and underfitting?
- What is the difference between overfitting and underfitting?
- Illustrate the relationship between training / test error and model complexity in the context of overfitting & underfitting?
Here is the quiz which can help you test your understanding of overfitting & underfitting concepts and prepare well for interviews.