Random Forest Classifier Python Code Example

0

In this post, you will learn about how to train a Random Forest Classifier using Python Sklearn library. This code will be helpful if you are a beginner data scientist or just want to quickly get code sample to get started with training a machine learning model using Random Forest algorithm. The following topics will be covered:

  • Brief introduction of Random Forest
  • Python code example for training a random forest classifier

Brief Introduction to Random Forest Classifier

Random forest can be considered as an ensemble of several decision trees. The idea is to aggregate the prediction outcome of multiple decision trees and create a final outcome based on averaging mechanism (majority voting). It helps the model trained using random forest to generalize better with larger population. In addition, the model becomes less susceptible to overfitting / high variance. Here are the key steps of random forest algorithm:

  • Take a random sample of size n (randomly choose n examples with replacement)
  • Grow the decision tree from the above sample based on the following:
    • Select m features in random manner out of all the features
    • Create the tree by splitting the data using m features based on the objective function (maximising the information gain)
  • Repeat above steps for k number of trees as specified.
  • Aggregate the prediction outcome of different trees and come up with final prediction based on majority voting or averaging.

Random Forest Classifier – Python Code Example

Here is the code sample for training Random Forest Classifier using Python code. Note the usage of n_estimators hyper parameter. The value of n_estimators as

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from mlxtend.plotting import plot_decision_regions
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

#
# Load IRIS data set
#
iris = datasets.load_iris()
X = iris.data[:, 2:]
y = iris.target

#
# Create training/ test data split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

#
# Create an instance of Random Forest Classifier
#
forest = RandomForestClassifier(criterion='gini',
                                 n_estimators=5,
                                 random_state=1,
                                 n_jobs=2)
#
# Fit the model
#
forest.fit(X_train, y_train)

#
# Measure model performance
#
y_pred = forest.predict(X_test)
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))

The model performance comes out to be 97.8%. Here is how the decision regions will look like after plotting it with plot_decision_regions function mlxtend.plotting class.

from mlxtend.plotting import plot_decision_regions

X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))

#
# plot_decision_regions function takes "forest" as classifier
#
fig, ax = plt.subplots(figsize=(7, 7))
plot_decision_regions(X_combined, y_combined, clf=forest)
plt.xlabel('petal length [cm]')
plt.ylabel('petal width [cm]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

Here is how the diagram will look like:

Decision boundaries created by Random Forest Classifier
Fig 1. Decision boundaries created by Random Forest Classifier

Conclusion

In this post, you learned some of the following:

  • Random forest is an ensemble of decision tree.
  • Random forest helps avoid overfitting which is one of the key problem with decision tree classifier.
  • For creating random forest, multiple trees are created using different sample sizes and features set.
  • One of the key hyper parameter of random forest is number of trees represented using n_estimators.
Ajitesh Kumar
Follow me
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.