Last updated: 25th Nov, 2023
Bagging is a type of an ensemble machine learning approach that combines the outputs from many learner to improve performance. The bagging algorithm works by dividing the training set into smaller subsets. These subsets are then processed through different machine-learning models. After processing, the predictions from each model are combined. This combination of predictions is used to generate an overall prediction for each instance in the original data. In this blog post, you will learn about the concept of Bagging along with Bagging Classifier Python code example.
Bagging can be used in machine learning for both classification and regression problem. The bagging classifier technique is utilized across a range of machine learning algorithms. These include decision stumps, artificial neural networks (such as multi-layer perceptron), support vector machines, and maximum entropy classifiers.
Check out my detailed post on machine learning titled What is machine learning? Concepts & Examples. It does provide different definitions along with explaining different aspects of machine learning which will be useful for creating good concepts around machine learning. Here is the sample excerpt:
Machine learning is about approximating mathematical functions (equations) representing real-world scenarios. These mathematical functions are also referred to as “mathematical models” or just models. The reason why machine learning models are called function approximations because it will be extremely difficult to find exact function which can be used to predict or estimate real world scenarios.
A bagging classifier, in the context of classification, is an ensemble model formed by training multiple versions of a basic model / base estimator on various iterations of the training data set. This data set is modified using a bagging sampling technique, where data is sampled with replacement, or through other methods. Each version of the basic model learns from a slightly different set of data, contributing to a more robust overall classification model. The bagging sampling technique can result in a training set consisting of a duplicate dataset or a unique data set. This sampling technique is also called bootstrap aggregation. The final predictor (also called a bagging classifier) combines the predictions made by each estimator/classifier by voting (classification) or by averaging (regression). Read more details about this technique in this paper, Bias, Variance and Arcing Classifiers by Leo Breiman.
While creating each of the individual models, one can configure the number of samples and/or features that need to be considered while fitting the individual estimators. Take a look at the diagram below to get a better understanding of the bagging classification.
The basic idea is to repeatedly (with replacement) sample observations from the training set and use each observation as a “bootstrap sample” in fitting different estimators that will be combined in the final result. Pay attention to some of the following in the above diagram:
Here is another view of the bagging classifier. Note how different bags of training data set is created with replacement. Different bags may represent data related to different features which can be used to train different models.
The following are some of the reasons for using bagging classifiers:
Bagging Classifier can be termed as some of the following based on the sampling technique used for creating training samples:
In this post, the bagging classifier is created using Sklearn BaggingClassifier with a number of estimators set to 100, max_features set to 10, and max_samples set to 100 and the sampling technique used is the default (bagging). The method applied is random patches as both the samples and features are drawn in a random manner.
In this section, you will learn about how to use Python Sklearn BaggingClassifier for fitting the model using the Bagging algorithm. The following is done to illustrate how the Bagging Classifier help improves the generalization performance of the model. In order to demonstrate the aspects of generalization performance, the following is done:
The Sklearn breast cancer dataset is used for fitting the model.
Here is the code which can be used to fit a logistic regression model:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
#
# Load the breast cancer dataset
#
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)
#
# Pipeline Estimator
#
pipeline = make_pipeline(StandardScaler(),
LogisticRegression(random_state=1))
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Model scores on test and training data
#
print('Model test Score: %.3f, ' %pipeline.score(X_test, y_test),
'Model training Score: %.3f' %pipeline.score(X_train, y_train))
The model comes up with the following scores. Note that the model tends to overfit the data as the test score is 0.965 and the training score is 0.991.
In this section, we will fit a bagging classifier using different hyperparameters such as the following and base estimator as pipeline built using Logistic Regression. Note that you can further perform a Grid Search or Randomized search to get the most appropriate estimator.
Here is how the Python code will look like for the Bagging Classifier model:
#
# Pipeline Estimator
#
pipeline = make_pipeline(StandardScaler(),
LogisticRegression(random_state=1))
#
# Instantiate the bagging classifier
#
bgclassifier = BaggingClassifier(base_estimator=pipeline, n_estimators=100,
max_features=10,
max_samples=100,
random_state=1, n_jobs=5)
#
# Fit the bagging classifier
#
bgclassifier.fit(X_train, y_train)
#
# Model scores on test and training data
#
print('Model test Score: %.3f, ' %bgclassifier.score(X_test, y_test),
'Model training Score: %.3f' %bgclassifier.score(X_train, y_train))
The model comes up with the following scores. Note that the model tends to overfit the data as the test score is 0.965 and the training score is 0.974. However, the model will give better generalization performance than the model fit with Logistic Regression.
Here is the summary of what you learned about the bagging classifier and how to go about training / fitting a bagging classifier model:
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…