In this post, you will learn about K-fold Cross Validation concepts with Python code example. K-fold cross validation is a data splitting technique that can be implemented with k > 1 folds. K-Fold Cross Validation is also known as k-cross, k-fold cross validation, k-fold CV and k-folds. The k-fold cross validation technique can be implemented easily using Python with scikit learn package which provides an easy way to calculate k-fold cross validation models. It is important to learn the concepts cross validation concepts in order to perform model tuning with an end goal to choose model which has the high generalization performance. As a data scientist / machine learning Engineer, you must have a good understanding of the cross validation concepts in general.
What and Why of K-fold Cross Validation
K-fold cross-validation is defined as a method for estimating the performance of a model on unseen data. It is a technique used for hyperparameter tuning such that the model with the most optimal value of hyperparameters can be trained. It is a resampling technique without replacement. The advantage of this approach is that each example is used for training and validation (as part of a test fold) exactly once. This yields a lower-variance estimate of the model performance than the holdout method. This technique is used because it helps to avoid overfitting, which can occur when a model is trained using all of the data. By using k-fold cross-validation, we are able to “test” the model on k different data sets, which helps to ensure that the model is generalizable.
The following is done in this technique for training, validating and testing the model:
- The dataset is split into training and test dataset.
- The training dataset is then split into K-folds.
- Out of the K-folds, (K-1) fold is used for training
- 1 fold is used for validation
- The model with specific hyperparameters is trained with training data (K-1 folds) and validation data as 1 fold. The performance of the model is recorded.
- The above steps (step 3, step 4 and step 5) is repeated until each of the k-fold got used for validation purpose. This is why it is called k-fold cross validation.
- Finally, the mean and standard deviation of the model performance is computed by taking all of the model scores calculated in step 5 for each of the K models.
- Step 3 to Step 7 is repeated for different values of hyperparameters.
- Finally, the hyperparameters which result in most optimal mean and standard value of model scores get selected.
- The model is then trained using the training data set (step 2) and the model performance is computed on the test data set (step 1).
Here is the diagram representing steps 2 to steps 7. The diagram is taken from the book, Python Machine Learning by Dr. Sebastian Raschka and Vahid Mirjalili. The diagram summarises the concept behind K-fold cross-validation with K = 10.
Why use Cross-validation technique?
The conventional technique for training and testing the model is to split the data in two different splits which are termed as training and test split. For a decent size of data, the training and test split is taken as 70:30. Here are a few challenges due to which cross-validation technique is used:
- Challenges with training-test split: In order to train the model of optimal performance, the hyperparameters are tweaked appropriately to achieve good model performance with the test data. However, this technique results in the risk of overfitting on the test set. This is because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.
- Challenges with training-validation-test split: In order to take care of above issue, there are three splits which get created. They are training, validation and test split. The model hyperparameters get tuned using training and validation set. And, finally, the model generalization performance is determined using test data split. However, this technique also has the shortcomings. By partitioning the data into three sets, the number of samples which can be used for learning the model gets reduced. The results depend on a particular random choice for the pair of (train, validation) sets.
To overcome the above challenges, the cross-validation technique is used. As described earlier in this section, two different splits such as training and test split get created. However, cross-validation is applied to the training data by creating K-folds of training data in which (K-1) fold is used for training and the remaining fold is used for testing. This process is repeated for K times and the model performance is calculated for a particular set of hyperparameters by taking the mean and standard deviation of all the K models created. The hyperparameters giving the most optimal model are calculated. Finally, the model has trained again on the training data set using the most optimal hyperparameter and the generalization performance is computed by calculating model performance on the test dataset. The diagram given below represents the same.
K-fold cross-validation is also used for model selection, where it is compared against other model selection techniques such as the Akaike information criterion and Bayesian information criterion.
When to select what values of K?
Here are the guidelines on when to select what value of K:
- The standard value of K is 10 and used with the data of decent size.
- For a very large data set, one can use the value of K as 5 (K-5). One can obtain an accurate estimate of the average performance of the model while reducing the computational cost of refitting and evaluating the model on the different folds.
- The number of folds increases if the data is relatively small. However, larger values of k results in the increase of the runtime of the cross-validation algorithm. This yield model performance estimates with higher variance, since the training folds become smaller.
- For very small data sets, leave-one-out cross-validation (LOOCV) technique is used. In this technique, the validation data consists of just one record.
It is recommended to use stratified k-fold cross-validation in order to achieve better bias and variance estimates, especially in cases of unequal class proportions.
K-fold Cross-Validation with Python (using Cross-Validation Generators)
In this section, you will learn about how to use cross-validation generators such as some of the following to compute the cross-validation scores. The cross-validator generators given below return the indices of training and test splits. These indices can be used to create training and test splits and train different models. Later, the mean and standard deviation of model performance of different models is computed to assess the effectiveness of hyperparameter values and further tune them appropriately.
Here is the Python code which illustrates the usage of the class StratifiedKFold (sklearn.model_selection) for creating training and test splits. The code can be found on this Kaggle page, K-fold cross-validation example. Pay attention to some of the following in the Python code given below:
- Instance of StratifiedKFold is created by passing number of folds (n_splits=10)
- Split method is invoked on the instance of StratifiedKFold to gather the indices of training and test splits for those many folds
- Training and test data is passed to the instance of pipeline.
- Scores of different models get calculated.
- Finally, mean and standard deviation of model scores is computed.
from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import StratifiedKFold # # Create an instance of Pipeline # pipeline = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100, max_depth=4)) # # Create an instance of StratifiedKFold which can be used to get indices of different training and test folds # strtfdKFold = StratifiedKFold(n_splits=10) kfold = strtfdKFold.split(X_train, y_train) scores =  # # # for k, (train, test) in enumerate(kfold): pipeline.fit(X_train.iloc[train, :], y_train.iloc[train]) score = pipeline.score(X_train.iloc[test, :], y_train.iloc[test]) scores.append(score) print('Fold: %2d, Training/Test Split Distribution: %s, Accuracy: %.3f' % (k+1, np.bincount(y_train.iloc[train]), score)) print('\n\nCross-Validation accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))
Here is how the output from the above code execution would look like:
K-fold Cross-Validation with Python (using Sklearn.cross_val_score)
Here is the Python code which can be used to apply the cross-validation technique for model tuning (hyperparameter tuning). The code can be found on this Kaggle page, K-fold cross-validation example. Pay attention to some of the following in the code given below:
- cross_val_score class of sklearn.model_selection module is used for computing the cross validation scores. This is one of the simplest way to It computes the scores by splitting the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation. The input to the cross_val_score includes an estimator (having fit and predict method), the cross-validation object and the input dataset.
- The input estimator to the cross_val_score can be either an estimator or a pipeline (sklearn.pipeline).
- One other input to the cross_val_score is cross validation object which is assigned to the parameter, cv. The parameter, cv, can take one of the following values:
- An integer that represents the number of folds in a StratifiedKFold cross validator.
- If cv is not specified, 5-fold cross-validation is applied.
- An instance of cross-validation splitter which can be one of the following:
from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier # # Create an instance of Pipeline # pipeline = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100, max_depth=4)) # # Pass instance of pipeline and training and test data set # cv=10 represents the StratifiedKFold with 10 folds # scores = cross_val_score(pipeline, X=X_train, y=y_train, cv=10, n_jobs=1) print('Cross Validation accuracy scores: %s' % scores) print('Cross Validation accuracy: %.3f +/- %.3f' % (np.mean(scores),np.std(scores)))
Here is how the output would look like as a result of the execution of the above code:
What are disadvantages of k-fold cross validation?
The disadvantage of k-fold cross-validation is that it can be slow to execute and it can be hard to parallelize. Additionally, k-fold cross-validation is not always the best option for all types of data sets. For example, k-fold cross validation might not be as accurate when there are few training examples relative to the number of test examples. In these cases, a different type of cross-validation, such as leave-one-out cross-validation, might be more appropriate. k-fold cross-validation is also not suitable for time series data. K-fold cross-validation should not be used when k is too small or k is too large.
Here is the summary of what you learned in this post about k-fold cross-validation:
- K-fold cross validation is used for model tuning / hyperparameters tuning.
- K-fold cross validation involves split the data into training and test data sets, applying K-fold cross-validation on training data set and selecting the model with most optimal performance
- There are several cross validation generators such as KFold, StratifiedKFold which can be used for this technique.
- Sklearn.model_selection module’s cross_val_score helper class can be used for applying K-fold cross validation in simple manner.
- Use LOOCV method for very small data sets.
- For very large data sets, one can use the value of K as 5.
- The value of K = 10 is standard value of K.
- It is recommended to use stratified k-fold cross-validation in order to achieve better bias and variance estimates, especially in cases of unequal class proportions.