In this post, you will learn about the concept of Bagging along with Bagging Classifier Python code example. Bagging is also called bootstrap aggregation. It is a data sampling technique where data is sampled with replacement. Bagging classifier helps combine prediction of different estimators and in turn helps reduce variance.
In this post, you will learn about the following topics:
- Introduction to Bagging and Bagging Classifier
- Bagging Classifier python example
Introduction to Bagging & Bagging Classifier
Bagging classifier can be called as an ensemble meta-estimator which is created by fitting multiple versions of base estimator, trained with modified training data set created using bagging sampling technique (data sampled using replacement) or otherwise. The final predictor (also called as bagging classifier) combines the predictions made by each estimator / classifier by voting (classification) or by averaging (regression). Read more details about this technique in this paper, Bias, Variance and Arcing Classifiers by Leo Brieman. While creating each of the individual estimators, one can configure the number of samples and/or features which need to be considered while fitting the individual estimators. Take a look at the diagram below to get a better understanding of the bagging classification.
Pay attention to some of the following in the above diagram:
- Training data sets consists of data samples represented using different colors.
- Random samples are drawn with replacement. This essentially means that there could be duplicate data in each of the samples.
- Each sample is used to train different estimators / classifiers represented using classifier 1, classifier 2, …, classifier n
- An ensemble classifier is created which takes the predictions from different estimators and make the final prediction.
- The performance of the ensemble classifier is tested using the training data set.
Bagging classifier helps in reducing the variance of individual estimators by introducing randomisation into the training stage of each of the estimators and making an ensemble out of all the estimators. Note that the high variance means that changing the training data set results in the constructed or trained estimator by a great deal.
Bagging Classifier can be termed as some of the following based on the sampling technique used for creating training samples:
- Pasting Sampling: When the random subsets of data is taken in the random manner without replacement (bootstrap = False), the algorithm can be called as Pasting
- Bagging Sampling: When the random subsets of data are drawn with replacement (bootstrap = True), the algorithm can be called as Bagging. It is also called as bootstrap aggregation.
- Random Subspace: When the random subsets of features are drawn, the algorithm can be termed as Random Subspace.
- Random Patches: When both the ransom subsets of samples and features are drawn, the algorithm can be termed as Random Patches.
In this post, the bagging classifier is created using Sklearn BaggingClassifier with number of estimators set to 100, max_features set to 10, and max_samples set to 100 and the sampling technique used is default (bagging). The method applied is random patches as both the samples and features are drawn in the random manner.
When to use Bagging Classifier?
Bagging classifier helps reduce the variance of unstable classifiers (having high variance). The unstable classifiers include classifiers trained using algorithms such as decision tree which is found to have high variance and low bias. Thus, one can get the most benefit of using bagging classifier for algorithms such as decision trees. The stable classifiers such as linear discriminant analysis which have low variance may not benefit much from bagging technique.
That said, one can still applying bagging classifier with base estimators related to any algorithm. In this post, for illustration purpose, the base estimator is trained using Logistic Regression.
Bagging Classifier Python Example
In this section, you will learn about how to use Python Sklearn BaggingClassifier for fitting the model using Bagging algorithm. The following is done to illustrate how Bagging Classifier help improve the generalization performance of the model. In order to demonstrate the aspects of generalization performance, the following is done:
- A model is fit using LogisticRegression algorithm
- A model is fit using BaggingClassifier with base estimator as LogisticRegression
The Sklearn breast cancer dataset is used for fitting the model.
Model fit using Logistic Regression
Here is the code which can be used to fit a logistic regression model:
import numpy as np from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.ensemble import BaggingClassifier from sklearn.model_selection import GridSearchCV # # Load the breast cancer dataset # bc = datasets.load_breast_cancer() X = bc.data y = bc.target # # Create training and test split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y) # # Pipeline Estimator # pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1)) # # Fit the model # pipeline.fit(X_train, y_train) # # Model scores on test and training data # print('Model test Score: %.3f, ' %pipeline.score(X_test, y_test), 'Model training Score: %.3f' %pipeline.score(X_train, y_train))
The model comes up with the following scores. Note that the model tends to overfit the data as the test score is 0.965 and training score is 0.991.
Model fit using Bagging Classifier
In this section, we will fit a bagging classifier using different hyperparameters such as the following and base estimator as pipeline built using Logistic Regression. Note that you can further performed a Grid Search or Randomized search to get the most appropriate estimator.
- n_estimators = 100
- max_features = 10
- max_samples = 100
Here is how the Python code will look like for the Bagging Classifier model:
# # Pipeline Estimator # pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1)) # # Instantiate the bagging classifier # bgclassifier = BaggingClassifier(base_estimator=pipeline, n_estimators=100, max_features=10, max_samples=100, random_state=1, n_jobs=5) # # Fit the bagging classifier # bgclassifier.fit(X_train, y_train) # # Model scores on test and training data # print('Model test Score: %.3f, ' %bgclassifier.score(X_test, y_test), 'Model training Score: %.3f' %bgclassifier.score(X_train, y_train))
The model comes up with the following scores. Note that the model tends to overfit the data as the test score is 0.965 and training score is 0.974. However, the model will give better generalization performance than the model fit with Logistic Regression.
Here is the summary of what you learned about the bagging classifier and how to go about training / fitting a bagging classifier model:
- Bagging classifier is an ensemble classifier which is created using multiple estimators which can be trained using different sampling techniques including pasting (samples drawn without sampling), bagging or bootstrap aggregation (samples drawn with replacement), random subspaces (random features are drawn), random patches (random samples & features are drawn)
- Bagging classifier helps reduce the variance of individual estimators by sampling technique and combining the predictions.
- Consider using bagging classifier for algorithm which results in unstable classifiers (classifier having high variance). For example, decision tree results in construction of unstable classifier having high variance and low bias.
- Beta Distribution Explained with Python Examples - September 24, 2020
- Bernoulli Distribution Explained with PythonExamples - September 23, 2020
- K-Nearest Neighbors Explained with Python Examples - September 22, 2020