Random forest classifiers are popular machine learning algorithms that are used for classification. In this post, you will learn about the concepts of random forest classifiers and how to train a Random Forest Classifier using the Python Sklearn library. This code will be helpful if you are a beginner data scientist or just want to quickly get a code sample to get started with training a machine learning model using the Random Forest algorithm. The following topics will be covered:
What is a random forest classifier & how do they work?
Random forests are a type of machine learning algorithm that is used for classification and regression tasks. A classifier model takes data input and assigns it to one of several categories. For example, given a set of images consisting of dogs and cats images, a classifier could be used to predict whether each image is of a dog or a cat. In a nutshell, a random forest algorithm works by creating multiple decision trees, each of which is based on a random subset of the data. Decision trees are a type of algorithm that makes predictions by looking at the data inputs and determining which category they belong to. Random forests take this one step further by creating multiple decision trees and then averaging their results. This helps to reduce the chance of overfitting, which is when the algorithm only works well on the training data and not on new data. Random forests are a powerful tool for machine learning and can be used for a variety of tasks such as facial recognition, fraud detection, predicting consumer behavior, and stock market predictions.
Random forest can be considered as an ensemble of several decision trees. The idea is to aggregate the prediction outcome of multiple decision trees and create a final outcome based on the averaging mechanism (majority voting). It helps the model trained using the random forest to generalize better with the larger population. In addition, the model becomes less susceptible to overfitting / high variance. Here are the key steps of random forest algorithm:
- Take a random sample of size n (randomly choose n examples with replacement – bootstrap)
- Grow the decision tree from the above sample based on the following:
- Select m features in a random manner out of all the features
- Create the tree by splitting the data using m features based on the objective function (maximizing the information gain)
- Repeat the above steps for k number of trees as specified.
- Aggregate the prediction outcome of different trees and come up with a final prediction based on majority voting or averaging.
The diagram below represents the above-mentioned steps:
Note how random samples of data (using bootstrap sampling) with different feature set is taken and used to create decision trees of different sizes. This is why this set of trees is called random forest. The prediction is an aggregation of classification output from each of the decision trees.
Here is another interesting image which I could find on the internet. Picture courtesy (Jinsol Kim page)
How is a random forest classifier less likely to overfit?
Here is how random forest is less likely to overfit:
- By training the model on a random subset of the data, the random forest classifier is less likely to overfit.
- By creating multiple decision trees, each of which is based on a random subset of the data, the random forest classifier is less likely to overfit.
- By averaging the results of the multiple decision trees, the random Forest classifier is less likely to overfit.
What hyperparameters can be tuned for random forest classifiers?
The following represents some of the hyperparameters that can be tuned for random forest classifiers:
- criterion: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Gini impurity is defined as the sum of the squared probabilities of each class, while information gain is defined as the decrease in entropy. In the case of random forest, a decrease in entropy can be understood as the increase in the purity of the node. In other words, the random forest tries to maximize the information gain at each node.
- max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_split: The minimum number of samples required to split an internal node.
- random_state: The random seed used to generate the random subsets of features and data.
- n_estimators: The number of decision trees in the random forest.
- max_features: The maximum number of features to consider when looking for the best split.
Advantages & disadvantages of Random forest classifier
Random forests are more accurate than decision trees because they reduce the variance of the model, and, are less likely to overfit. This is done by averaging the predictions of the individual trees. Random forests can also handle missing values and outliers better than decision trees. Random forests are also easier to tune than decision trees. Random forest classifiers are easy to tune with the help of hyperparameter tuning.
The following represents some of the key disadvantages of using a random forest classifier:
- Random forest classifiers can be slow to train. However, the accuracy and flexibility of random forest models make them worth the extra time investment.
- Random Forest classifiers can be difficult to interpret.
Random Forest Classifier – Python Code Example
Here are the steps that can be followed to implement random forest classification models in Python:
- Load the required libraries: The first step is to load the required libraries. We will need the random forest classifier from scikit-learn and NumPy.
- Import the dataset: Next, we will import the dataset. For this example, we will use the iris dataset that is included in scikit-learn. This dataset contains 150 samples of irises, each of which has four features: sepal length, sepal width, petal length, and petal width. The goal is to predict the species of iris-based on these four features.
- Split the dataset into training and test sets: We will split the dataset into training and test sets. We will use 70% of the data for training and 30% for testing.
- Train the model on the training set: Next, we will train the random forest classifier on the training set.
- Make predictions on the test set: Finally, we will make predictions on the test set and evaluate the accuracy of our model.
- Hyperparameter tuning: Once we have a basic model working, we can improve its performance by tuning the hyperparameters.
Here is the code sample for training Random Forest Classifier using Python code. Note the usage of n_estimators hyperparameter. The value of n_estimators as
import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from mlxtend.plotting import plot_decision_regions from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier # # Load IRIS data set # iris = datasets.load_iris() X = iris.data[:, 2:] y = iris.target # # Create training/ test data split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y) # # Create an instance of Random Forest Classifier # forest = RandomForestClassifier(criterion='gini', n_estimators=5, random_state=1, n_jobs=2) # # Fit the model # forest.fit(X_train, y_train) # # Measure model performance # y_pred = forest.predict(X_test) print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
The model performance comes out to be 97.8%. Here is what the decision regions will look like after plotting it with plot_decision_regions function mlxtend.plotting class.
from mlxtend.plotting import plot_decision_regions X_combined = np.vstack((X_train, X_test)) y_combined = np.hstack((y_train, y_test)) # # plot_decision_regions function takes "forest" as classifier # fig, ax = plt.subplots(figsize=(7, 7)) plot_decision_regions(X_combined, y_combined, clf=forest) plt.xlabel('petal length [cm]') plt.ylabel('petal width [cm]') plt.legend(loc='upper left') plt.tight_layout() plt.show()
Here is how the diagram will look like:
In this post, you learned some of the following:
- Random forest is an ensemble of the decision trees. It can be used for both classification and regression tasks.
- Random forest helps avoid overfitting which is one of the key problems with decision tree classifiers.
- For creating a random forest, multiple trees are created using different sample sizes and feature sets.