Learning Curves Explained with Python Sklearn Example

0

In this post, you will learn about how to use learning curves in learning curves using Python code (Sklearn) example to determine model bias-variance. Knowing how to use learning curves will help you assess / diagnose whether the model is suffering from high bias (underfitting) or high variance (overfitting) and whether increasing training data samples could help solve bias or variance problem. 

Some of the following topics are covered in this post:

  • Why learning curves?
  • Python Sklearn example for Learning curve

Why Learning Curves?

Learning curve in machine learning is used to assess how models will perform with varying number of training samples.  This is achieved by monitoring the training and validation scores (model accuracy) with increasing number of training samples. The following represent training (orange dashed line), validation (blue line) and desired model accuracy (black dashed line). 

 

Fig 1. Learning curves representing high bias and high variance

Pay attention to some of the following in the above diagram:

  • High Bias Models (Underfitting): The plot on the left side represents the model having both low training and cross-validation accuracy. This indicates that that the model under fits the training data and thus, is the case of high bias. You may notice that as the training samples size increases, the training accuracy decreases and validation accuracy increases. However, the validation accuracy is far from the desired accuracy. Some of the ways to address the high-bias issue are following:
    • Add more features: Increase the number of parameters of the model, for example, by create additional features
    • Decrease the degree of regularization: Decreasing the degree of regularization, for example, in support vector machine (SVM) or logistic regression classifiers.
  • High Variance Models (Overfitting): The plot on the right side represents a model that has large gap between training and validation accuracy. The training accuracy is larger than the validation accuracy. These models suffer from high variance (overfitting). You may notice that as the training samples size increases, the training accuracy decreases and validation accuracy increases. However, the training accuracy is much greater than validation accuracy and also desired accuracy. Some of the ways to address this problem of overfitting are following:
    • Add more data: Collect more training data; This may not always help, though as adding more data may result in noise.
    • Remove less important features: Reduce the complexity of the model by removing noisy features; For unregularized models, you can use feature selection or feature extraction techniques to decrease the number of features
    • Increase the regularization parameter, for example, in support vector machine (SVM) or logistic regression classifiers.

Python Sklearn Example for Learning Curve

In this section, you will see how to assess the model learning with Python Sklearn breast cancer datasets. Pay attention to some of the following in the code given below:

  • An instance of pipeline created using sklearn.pipeline make_pipeline method is used as an estimator. You could as well use any machine learning algorithm supporting fit and predict method as an estimator
  • Learning_curve method takes cross-validation as an input parameter. In the example is 10-Fold StratifiedKFold cross-validation algorithm. Instead, you can use any other cross-validation algorithm.
  • Training sizes of 10 intervals got created using np.linspace(0.1, 1, 10) method.
  • Due to usage of cross-validation method, average (mean) accuracies for training and validation data for calculated.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import learning_curve
from sklearn import datasets
import matplotlib.pyplot as plt
#
# Load Breast Cancer Dataset
#
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
#
# Create a pipeline; This will be passed as an estimator to learning curve method
#
pipeline = make_pipeline(StandardScaler(), 
                        LogisticRegression(penalty='l2', solver='lbfgs', random_state=1, max_iter=10000))
#
# Use learning curve to get training and test scores along with train sizes
#
train_sizes, train_scores, test_scores = learning_curve(estimator=pipeline, X=X_train, y=y_train,
                                                       cv=10, train_sizes=np.linspace(0.1, 1.0, 10),
                                                     n_jobs=1)
#
# Calculate training and test mean and std
#
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
#
# Plot the learning curve
#
plt.plot(train_sizes, train_mean, color='blue', marker='o', markersize=5, label='Training Accuracy')
plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean, color='green', marker='+', markersize=5, linestyle='--', label='Validation Accuracy')
plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')
plt.title('Learning Curve')
plt.xlabel('Training Data Size')
plt.ylabel('Model accuracy')
plt.grid()
plt.legend(loc='lower right')
plt.show()

Here is how the plot would look like:

Learning curve representing training and validation scores vs training data size
Fig 2. Learning curve representing training and validation scores vs training data size

Note some of the following in above learning curve plot:

  • For training sample size less than 200, the difference between training and validation accuracy is much larger. This is the case of overfitting
  • For training size greater than 200, the model is better. It is a sign of good bias-variance trade-off.

Conclusions

Here is the summary of what you learned in this post:

  • Use learning curve as a mechanism to diagnose machine learning model bias-variance problem.
  • For model having underfitting / high-bias, both the training and validation scores are vary low and also lesser than the desired accuracy.
  • In order to reduce underfitting, consider adding more features. Or, consider reducing degree of regularization for models (build using SVM, logistic regression etc) which support regularization.
  • For model having overfitting / high-variance, there is a large gap between training and validation accuracy. Also, training accuracy may come to be more than desired accuracy.
  • In order to reduce overfitting, consider adding more features and data (although adding data may not always work). For regularized models, consider increasing the value of regularization. But take caution or else model will underfit.
Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.