Machine Learning – Validation Techniques (Interview Questions)

0

Validation techniques in machine learning are used to get the error rate of the ML model which can be considered as close to the true error rate of the population. In case the data volume is large enough to be representative of the population, you may not need the validation techniques. However, in real world scenario, we work with the sample of data which may not be the true representative of the population. This is where validation techniques come into the picture.

In this post, you will briefly learn about different validation techniques such as following and also presented with practice test having questions and answers which could be used for interviews.

  • Resubstitution
  • Hold-out
  • K-fold cross-validation
  • LOOCV
  • Random subsampling
  • Bootstrapping

Revision Notes – Machine Learning Validation Techniques

  • Resubstitution: In case, the whole data is used for training the model and the error rate is evaluated based on outcome vs actual value from the same training data set, this error is called as the resubstitution error. This technique is called as resubstitution validation technique.
  • Hold-out: In order to avoid the resubstitution error, the data is split into two different data set labeled as training and test data set. This can be a 60-40 or 70-30 or 80-20 split. This technique is called as hold-out validation technique. In this case, there is a likelihood that uneven distribution of different classes of data is found in training and test data set. To fix this, the training and test data set is created with equal distribution of different classes of data. This process is called stratification.
  • K-fold cross-validation: In this technique, k-1 folds are used for training and the remaining one is used for testing as shown in the picture given below.
    K-fold cross-validation

    Figure 1. K-fold cross-validation

    The advantage is that entire data is used for training and testing. The error rate of the model is average of the error rate of each iteration. This technique can also be called as a form of Repeated Hold-out Method. The error rate could be improved by using stratification technique.

  • Leave-one-out cross-validation (LOOCV): In this technique, all of the data except one record is used for training and one record is used for testing. This process is repeated for N times if there are N records. The advantage is that entire data is used for training and testing. The error rate of the model is average of the error rate of each iteration. The following diagram represents the LOOCV validation technique.
    LOOCV validation technique

    Figure 2. LOOCV validation technique

  • Random subsampling: In this technique, multiple sets of data is randomly chosen from the dataset and combined to form a test data set. The remaining data form the training dataset. The following diagram represents the random subsampling validation technique. The error rate of the model is average of the error rate of each iteration.
    random subsampling validation technique

    Figure 3. Random Subsampling validation technique

  • Bootstrapping: In this technique, the training data set is randomly selected with replacement. The remaining examples that were not selected for training are used for testing. Unlike K-fold cross-validation, the value is likely to change from fold-to-fold. The error rate of the model is average of the error rate of each iteration. The following diagram represents the same.
    bootstrapping validation technique

    Figure 4. Bootstrapping validation technique

Practice Test – ML Model Validation Techniques

Given 100% of data is used for training, the validation technique can be called as ______

Given 80% of data is selected for training and remaining 20% for testing, this validation technique can be called as _______

Given 80% of data is selected for training and remaining 20% for testing, and this process is carried out for four times and error rate is averaged out, this validation technique can be called as _______

Given 1000 records, 1000 models are trained with 999 records as part of training sample and remaining 1 sample for testing, and the error rate is averaged out, this validation technique can be called as _______

The process of making sure that there is an equal split of classes in training and test samples is called as _________

In K-fold cross-validation technique, the value of k being large could lead to which of the following in relation to error rate

In K-fold cross-validation technique, the value of k being small could lead to which of the following in relation to error rate

The most common choice for K in K-fold cross-validation technique is _______

For sparse data set, which of the following validation technique could be preferred?

In K-fold cross-validation technique, the value of k being large could lead to which of the following in relation to error rate

In K-fold cross-validation technique, the value of k being small could lead to which of the following in relation to error rate

For N records, LOOCV can also be called as N-fold cross-validation

Further Reading / References

Summary

In this post, you learned about different validation techniques used for finding error rates of machine learning models.

Did you find this article useful? Do you have any questions or suggestions about this article in relation to machine learning model validation techniques? Leave a comment and ask your questions and I shall do my best to address your queries.

Ajitesh Kumar

Ajitesh Kumar

Ajitesh has been recently working in the area of AI and machine learning. Currently, his research area includes Safe & Quality AI. In addition, he is also passionate about various different technologies including programming languages such as Java/JEE, Javascript and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc.

He has also authored the book, Building Web Apps with Spring 5 and Angular.
Ajitesh Kumar

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.