Decoding Bagging in Random Forest: Examples

This blog provides an overview of how bagging, or bootstrap aggregating, improves the effectiveness of Random Forest machine learning models. You will learn about the process of creating multiple data subsets through bootstrap sampling, building individual decision trees for each subset, and how this diversity among trees reduces overfitting, leading to more accurate and robust random forest models. The post also explains how the aggregation of predictions from these trees ensures a balanced and less biased overall model. You also get to learn based on Python code example.

What is Bagging?

Before we delve into Random Forest, it’s crucial to understand the concept of bagging. Bagging is a general ensemble technique in machine learning that aims to improve the stability and accuracy of algorithms. It involves creating multiple versions of a predictor and using these to get an aggregated prediction.

“Bagging” is short for Bootstrap Aggregating. The term is derived from two key concepts:

Bootstrap sampling: This refers to a statistical method for resampling a dataset. In bootstrap sampling, multiple subsets of a dataset are created by randomly selecting data points with replacement. This means each subset can have repeated data points and some of the original data points may be left out. Bootstrap sampling is a way to estimate certain properties (like mean, variance) of a population by repeatedly sampling a subset of data. In the context of bagging, each subset created through bootstrap sampling is used to train a separate model. The following is a visual representation of bootstrap sampling:
Aggregating: After training multiple models on different bootstrap samples, bagging involves combining (or aggregating) their predictions. In classification tasks, this is often done through a majority voting system, where the final prediction is the one that the majority of the models agree upon. In regression tasks, it typically involves averaging the predictions of all models.

So, the term “bagging” encapsulates the entire process: creating multiple models by training them on bootstrapped subsets and then aggregating their predictions to form a final, more robust prediction. This technique is widely used to improve the stability and accuracy of machine learning algorithms, especially in reducing the variance of models that tend to overfit their training data.

Bagging vs Random Forest: How they work together?

Random Forest is a sophisticated version of the traditional bagging method. It applies the concept of bagging to decision trees. Here is how bagging is related to the Random Forest algorithm:

Bootstrap Sampling in Random Forest: The bagging in Random Forest starts with creating multiple subsets of the training dataset using bootstrap sampling. Each subset is generated by randomly selecting observations from the original dataset with replacement, meaning the same observation can appear more than once in a subset. Consider a dataset of housing prices with features like size, location, age, and amenities. Random Forest starts by creating different subsets of this dataset (say Subset 1, Subset 2, etc.) using bootstrap sampling.
Training Individual Decision Trees: In a Random Forest, each decision tree is trained on a different bootstrap sample. This use of varied data subsets helps in creating a diverse set of trees. Diverse trees are crucial in bagging as they ensure that the model captures various aspects and patterns in the data, reducing the likelihood of overfitting. Continuing with our example of training random forest model for housing dataset, it means that for Subset 1, a decision tree (Tree 1) will be trained, and similarly for other subsets. These trees are not identical because they are trained on slightly different data. Tree 1 might be more influenced by the size and location, while Tree 2 might give more weight to age and amenities.
Incorporating Feature Randomness: Apart from bootstrap sampling, Random Forest introduces an additional layer of randomness by randomly selecting a subset of features for splitting nodes in each tree. This feature randomness further diversifies the trees and is a distinctive aspect of Random Forest compared to traditional bagging methods. Continuing with our example, when Tree 1 decides to split at the root node, it might only consider size and amenities, not age or location.
Aggregating Predictions: Once all the trees are trained, the Random Forest algorithm aggregates their predictions. For classification using random forest, this is often a majority voting system: each tree ‘votes’ for a class, and the class receiving the most votes is the final prediction. For regression, the final prediction is typically the average of all the trees’ predictions.
Enhancing Accuracy and Stability: The fundamental principle of bagging is to improve model accuracy and reduce overfitting. In Random Forest, the combination of bootstrap sampling and feature randomness helps in building a model that is not only accurate but also robust to variations in the data. The aggregation of predictions from multiple diverse trees leads to a model that performs better on unseen data compared to individual trees.

Examples and Applications of Bagging in Random Forest Models

The following illustrates the application of bagging in random forest models:

Real Estate Pricing: A real estate company wants to predict housing prices based on various features. Random Forest can be employed where each tree might focus on different aspects – one tree might give more importance to location while another to house size. The final price prediction is an average of all these trees, leading to a more balanced and accurate estimation.
Medical Diagnosis: In a medical diagnosis application, suppose we’re trying to predict whether a patient has a certain disease based on symptoms and test results. Each decision tree in the Random Forest might focus on different symptoms or combinations of symptoms. The final diagnosis is based on the majority vote from all trees, potentially leading to a more reliable diagnosis than relying on a single decision tree.

Benefits and Drawbacks of Bagging in Random Forest

Bagging in Random Forest offers several benefits:

Reduces Overfitting: Overfitting occurs when a model captures noise in the data rather than the underlying pattern. Random Forest reduces this risk by averaging the outputs of multiple decision trees, each trained on different bootstrap samples of the data. Since each tree sees only a portion of the data, its individual biases (overemphasis on certain patterns or anomalies in the data) are less likely to dominate the final model. The aggregated output is more generalized and robust, performing better on unseen data.
Handles Large Data Well: Random Forest is particularly effective in handling large datasets. It breaks down a large dataset into smaller subsets (via bootstrap sampling), allowing individual decision trees to train on these more manageable chunks. This distributed approach not only manages memory and processing requirements more efficiently but also ensures that the model captures a wider range of patterns and relationships in the data.
Feature Importance: One of the notable aspects of Random Forest is its ability to rank the importance of features in predicting the target variable. It does this by observing how random re-shuffling of each feature affects the model accuracy, thereby identifying which features contribute most to the predictive power of the model. This feature importance ranking is invaluable in understanding the driving factors behind the model’s predictions and can be used for feature selection in further modeling.

However, there are drawbacks:

Complexity: A Random Forest model, comprising multiple decision trees, is inherently more complex than a single decision tree. This complexity makes it more challenging to visualize and interpret the model. Unlike a single decision tree, where decisions can be traced through a clear path, the ensemble nature of Random Forest makes such traceability and interpretability difficult. This complexity can be a drawback in applications where model interpretability is crucial, like in certain areas of healthcare or finance.
Computationally Intensive: The process of training multiple decision trees on different subsets of data and then aggregating their predictions requires more computational power and memory than training a single decision tree. This increased computational demand means that Random Forests may not be suitable for scenarios with limited computational resources or where real-time model training and predictions are required.

Bagging in Random Forest: Understanding with Python Example

The following Python code creates a simple dataset, applies a Random Forest model to it, and then demonstrates the bootstrap sampling and aggregation processes:

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.3, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train a Random Forest Regressor
random_forest = RandomForestRegressor(n_estimators=10, random_state=42)
random_forest.fit(X_train, y_train)

# Predict using the Random Forest model
predictions = random_forest.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")

# Get leaf indices for each sample for each estimator (tree)
leaf_indices = np.array([tree.apply(X_train) for tree in random_forest.estimators_])

# Print leaf indices of the first few samples for the first few trees
print("Leaf indices for the first few samples from the first few trees:")
for i in range(min(3, leaf_indices.shape[0])):  # First 3 trees
    print(f"Tree {i} leaf indices: {leaf_indices[i, :5]}")  # First 5 samples

# Demonstrate aggregation
tree_predictions = [tree.predict(X_test) for tree in random_forest.estimators_]

# Example: compare the predictions of the first tree and the aggregated prediction
print(f"Predictions by the first tree: {tree_predictions[0][:5]}")
print(f"Aggregated predictions: {predictions[:5]}")

The above Python code does the following:

Creates a synthetic dataset for a regression task.
Splits the dataset into training and testing sets.
Trains a Random Forest Regressor on the training data.
Predicts and evaluates the model on the test data.
Demonstrates the bootstrap samples used for the first few trees in the Random Forest.
Compares the predictions from an individual tree and the aggregated predictions from the entire forest.

Conclusion

Bagging in Random Forest is a powerful technique in machine learning, offering robustness and accuracy. By understanding and applying this concept effectively, we can tackle complex predictive problems across various fields, from finance to healthcare. Like any tool, its effectiveness depends on the skill of the user and the appropriateness of its application to the problem at hand. With the examples and insights provided, you’re now better equipped to harness the power of bagging in Random Forest in your data science endeavors.

Author
Recent Posts

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning and BI. I would love to connect with you on Linkedin.
Check out my books titled as Designing Decisions, and First Principles Thinking.