In this post, you will learn about some of the key concepts of overfitting and underfitting in relation to machine learning models. In addition, you will also get a chance to test you understanding by attempting the quiz. The quiz will help you prepare well for interview questions in relation to underfitting & overfitting. As data scientists, you must get a good understanding of the overfitting and underfitting concepts.
Introduction to Overfitting & Underfitting
Assuming independent and identically distributed (I.I.d) dataset, when the prediction error on both the training and test dataset is high, the model is said to have underfit. This is called as underfitting the model or model underfitting. When the prediction error on the test dataset is quite high or higher than the training dataset, the model can be said to have overfit. In other words, when the model accuracy on the training dataset is higher (quite high) than the model accuracy of the test dataset, the model can be said to have overfit. This is called as model overfitting.
There is a higher tendency for the model to overfit the training dataset if the hypothesis space searched by the learning algorithm is high. Let’s try and understand what is meaning of hypothesis space and what is meaning of searching hypothesis space. If the learning algorithm used for fitting the model can have large number of different hyperparameters, and, could be trained with different datasets (called as training dataset) extracted from the same dataset, this could result in large number of models (hypothesis – h(X)) fit on the same data set. Recall that a hypothesis is an estimator of the target function. Thus, on the same dataset, a large number of models can be fit. This is called as larger hypothesis space. The learning algorithm in such scenario can be said to have an access to larger hypothesis space. Given this larger hypothesis space, there is high possibility for the model to overfit the training dataset.
Here is a diagram which represents the underfitting vs overfitting in form of model performance error vs model complexity.
In the above diagram, when the model complexity is low (decision stump or 30-NN model), the training and test error are both high. This represents the model underfitting. When the model complexity is very high (1-NN or deep decision tree), there is a very large gap between training and test error. This represents the case of model overfitting. The sweet spot is in between, represented using orange dashed line. At the sweet spot, e.g., ideal model, there is smaller gap between training and test error.
Interview Questions on Underfitting & Overfitting
Before getting into the quiz, let’s look at some of the interview questions in relation to overfitting and underfitting concepts:
- What is overfitting and underfitting?
- What is the difference between overfitting and underfitting?
- Illustrate the relationship between training / test error and model complexity in the context of overfitting & underfitting?
Here is the quiz which can help you test your understanding of overfitting & underfitting concepts and prepare well for interviews.
#1. Assuming I.I.d training and test data, for some random model that has not been fit on the training dataset, the training error is expected to be _________ the test error
#2. Assuming I.I.d training and test data, for the model that has been fit on the training dataset, the training error is expected to be _________ the test error
#3. The training error or accuracy of the model fit on the I.I.d training and test data set provides an __________ biased estimate of the generalization performance
#4. In case of underfitting, both the training and test error are _________
#5. In case of overfitting, the gap between training and test error is ___________
#6. In case of overfitting ,the training error is _________ than test error
#7. Given the larger hypothesis space, there is a higher tendency for the model to ________ the training dataset
#8. Given the following type of decision tree model, which may result in model underfitting?
#9. Given the following type of decision tree model, which may result in model overfitting?
#10. Given the following models trained using K-NN, the model which could result in overfitting will most likely have the value of K as ___________
#11. Given the following models trained using K-NN, the model which could result in underfitting will most likely have the value of K as ___________
#12. A model suffering from underfitting will most likely be having _____________
#13. A model suffering from overfitting will most likely having
- Most Common Types of Machine Learning Problems - January 14, 2021
- Historical Dates & Timeline for Deep Learning - January 10, 2021
- Machine Learning Techniques for Stock Price Prediction - January 10, 2021