Overfitting & Underfitting Concepts & Interview Questions

Overfitting and underfitting represented using Model error vs complexity plot

Machine learning models are built to learn from training and test data and make predictions on new, unseen data set. The machine learning model is said to overfit the data when it learns patterns that exist only in the training set make prediction with high accuracy. On the other hand, machine learning model underfits if it cannot find any pattern or relationship between variables in both training and testing data sets. In this post, you will learn about some of the key concepts of overfitting and underfitting in relation to machine learning models. In addition, you will also get a chance to test you understanding by attempting the quiz. The quiz will help you prepare well for interview questions in relation to underfitting & overfitting. As data scientists, you must get a good understanding of the overfitting and underfitting concepts. 

Table of Contents

Introduction to Overfitting & Underfitting

Assuming an independent and identically distributed (I.I.d) dataset, when the prediction error on both the training and test dataset is high, the model is said to have underfitted. This is called underfitting the model or model underfitting. The underfitting problem can be resolved using machine learning modeling techniques like boosting machine learning algorithms that can combine machine learning models in a way to produce better predictions using an ensemble of machine-learning model outputs. Low R-squared value and large standard error of estimate in regression analysis, residual plots from machine learning algorithm for linear or logistic regression model output, and so on can all indicate when underfitting ML models are present.

When the prediction error on the test dataset is quite high or higher than the training dataset, the model can be said to have been overfitted. In other words, when the model accuracy on the training dataset is higher (quite high) than the model accuracy of the test dataset, the model can be said to have been overfitted. This is called model overfitting. When the machine learning algorithm is overfitting the training data, it may happen that the machine learning model works well on a small sample of your dataset and gives you high accuracy. However, when a machine learning model sees new input data, its performance degrades significantly because of overfitting of machine learning models to specific patterns in your data set causing memorization of machine learning model and not a generalization of machine learning model.

Overfitting machine learning models can be minimized or resolved using regularization techniques like LASSO (least absolute shrinkage and selection operator) that penalizes large coefficients more heavily than machine learning algorithms like gradient descent that do not apply such regularization. Various machine learning techniques, including validation curves and cross-fold plots, can be used to spot overfitting.

What are different scenarios in which machine learning models overfitting can happen?

Overfitting of machine learning models can happen in some of the following scenarios:

  • When machine learning algorithm is using a much larger training dataset compared with testing set and learns patterns in the large input space that only minimally increase accuracy on a small test set.
  • When machine learning algorithm is using too many parameters to model the training data.
  • If the hypothesis space searched by the learning algorithm is high. Let’s try and understand what is meaning of hypothesis space and what is meaning of searching hypothesis space. If the learning algorithm used for fitting the model can have a large number of different hyperparameters, and, could be trained with different datasets (called as training dataset) extracted from the same dataset, this could result in a large number of models (hypothesis – h(X)) fit on the same data set. Recall that a hypothesis is an estimator of the target function. Thus, on the same dataset, a large number of models can be fit. This is called a larger hypothesis space. The learning algorithm in such a scenario can be said to have access to a larger hypothesis space. Given this larger hypothesis space, there is a high possibility for the model to overfit the training dataset.

What are different scenarios in which machine learning models underfitting can happen?

Underfitting of machine learning models can happen in some of the following scenarios:

  • When training set has far fewer observations than variables, this may lead underfitting or low bias machine learning models. In such cases, machine learning algorithm cannot find any relationship between input data and output variable because machine learning algorithm is not complex enough to model the data.
  • When machine learning algorithm cannot find any pattern between training and testing set variables which may happen in high-dimensional dataset or large number of input variables. This could be due to insufficient machine learning model complexity, limited available training observations for learning patterns, limited computing power that limits machine learning algorithms ability to search for patterns in high dimensional space, etc.

Here is a diagram that represents the underfitting vs overfitting in form of model performance error vs model complexity.

Overfitting and Underfitting represented using model error vs complexity
Fig 1. Overfitting and Underfitting represented using model error vs complexity

In the above diagram, when the model complexity is low (decision stump or 30-NN model), the training and test error are both high. This represents the model underfitting. When the model complexity is very high (1-NN or deep decision tree), there is a very large gap between training and test error. This represents the case of model overfitting. The sweet spot is in between, represented using an orange dashed line. At the sweet spot, e.g., ideal model, there is a smaller gap between training and test error.

Interview Questions on Underfitting & Overfitting

Before getting into the quiz, let’s look at some of the interview questions in relation to overfitting and underfitting concepts:

  • What is overfitting and underfitting?
  • What is the difference between overfitting and underfitting?
  • Illustrate the relationship between training / test error and model complexity in the context of overfitting & underfitting?

Here is the quiz which can help you test your understanding of overfitting & underfitting concepts and prepare well for interviews.

Results

Ajitesh Kumar
Follow me
Ajitesh Kumar
Follow me

#1. Assuming I.I.d training and test data, for some random model that has not been fit on the training dataset, the training error is expected to be _________ the test error

#2. Assuming I.I.d training and test data, for the model that has been fit on the training dataset, the training error is expected to be _________ the test error

#3. The training error or accuracy of the model fit on the I.I.d training and test data set provides an __________ biased estimate of the generalization performance

#4. In case of underfitting, both the training and test error are _________

#5. In case of overfitting, the gap between training and test error is ___________

#6. In case of overfitting ,the training error is _________ than test error

#7. Given the larger hypothesis space, there is a higher tendency for the model to ________ the training dataset

#8. Given the following type of decision tree model, which may result in model underfitting?

#9. Given the following type of decision tree model, which may result in model overfitting?

#10. Given the following models trained using K-NN, the model which could result in overfitting will most likely have the value of K as ___________

#11. Given the following models trained using K-NN, the model which could result in underfitting will most likely have the value of K as ___________

#12. A model suffering from underfitting will most likely be having _____________

#13. A model suffering from overfitting will most likely having

finish
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data Science and Machine Learning / Deep Learning. In addition, I am also passionate about various different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia etc and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc. I would love to connect with you on Linkedin.
Posted in Data Science, Interview questions, Machine Learning. Tagged with , , .

2 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.