In this post, you will learn about some of the key concepts of overfitting and underfitting in relation to machine learning models. In addition, you will also get a chance to test you understanding by attempting the quiz. The quiz will help you prepare well for interview questions in relation to underfitting & overfitting. As data scientists, you must get a good understanding of the overfitting and underfitting concepts.
Introduction to Overfitting & Underfitting
Assuming independent and identically distributed (I.I.d) dataset, when the prediction error on both the training and test dataset is high, the model is said to have underfit. This is called as underfitting the model or model underfitting. When the prediction error on the test dataset is quite high or higher than the training dataset, the model can be said to have overfit. In other words, when the model accuracy on the training dataset is higher (quite high) than the model accuracy of the test dataset, the model can be said to have overfit. This is called as model overfitting.
There is a higher tendency for the model to overfit the training dataset if the hypothesis space searched by the learning algorithm is high. Let’s try and understand what is meaning of hypothesis space and what is meaning of searching hypothesis space. If the learning algorithm used for fitting the model can have large number of different hyperparameters, and, could be trained with different datasets (called as training dataset) extracted from the same dataset, this could result in large number of models (hypothesis – h(X)) fit on the same data set. Recall that a hypothesis is an estimator of the target function. Thus, on the same dataset, a large number of models can be fit. This is called as larger hypothesis space. The learning algorithm in such scenario can be said to have an access to larger hypothesis space. Given this larger hypothesis space, there is high possibility for the model to overfit the training dataset.
Here is a diagram which represents the underfitting vs overfitting in form of model performance error vs model complexity.
In the above diagram, when the model complexity is low (decision stump or 30-NN model), the training and test error are both high. This represents the model underfitting. When the model complexity is very high (1-NN or deep decision tree), there is a very large gap between training and test error. This represents the case of model overfitting. The sweet spot is in between, represented using orange dashed line. At the sweet spot, e.g., ideal model, there is smaller gap between training and test error.
Interview Questions on Underfitting & Overfitting
Before getting into the quiz, let’s look at some of the interview questions in relation to overfitting and underfitting concepts:
- What is overfitting and underfitting?
- What is the difference between overfitting and underfitting?
- Illustrate the relationship between training / test error and model complexity in the context of overfitting & underfitting?
Here is the quiz which can help you test your understanding of overfitting & underfitting concepts and prepare well for interviews.