In this post, you will learn about how to train a Random Forest Classifier using Python Sklearn library. This code will be helpful if you are a beginner data scientist or just want to quickly get code sample to get started with training a machine learning model using Random Forest algorithm. The following topics will be covered:
- Brief introduction of Random Forest
- Python code example for training a random forest classifier
Brief Introduction to Random Forest Classifier
Random forest can be considered as an ensemble of several decision trees. The idea is to aggregate the prediction outcome of multiple decision trees and create a final outcome based on averaging mechanism (majority voting). It helps the model trained using random forest to generalize better with larger population. In addition, the model becomes less susceptible to overfitting / high variance. Here are the key steps of random forest algorithm:
- Take a random sample of size n (randomly choose n examples with replacement)
- Grow the decision tree from the above sample based on the following:
- Select m features in random manner out of all the features
- Create the tree by splitting the data using m features based on the objective function (maximising the information gain)
- Repeat above steps for k number of trees as specified.
- Aggregate the prediction outcome of different trees and come up with final prediction based on majority voting or averaging.
Random Forest Classifier – Python Code Example
Here is the code sample for training Random Forest Classifier using Python code. Note the usage of n_estimators hyper parameter. The value of n_estimators as
import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from mlxtend.plotting import plot_decision_regions from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier # # Load IRIS data set # iris = datasets.load_iris() X = iris.data[:, 2:] y = iris.target # # Create training/ test data split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y) # # Create an instance of Random Forest Classifier # forest = RandomForestClassifier(criterion='gini', n_estimators=5, random_state=1, n_jobs=2) # # Fit the model # forest.fit(X_train, y_train) # # Measure model performance # y_pred = forest.predict(X_test) print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
The model performance comes out to be 97.8%. Here is how the decision regions will look like after plotting it with plot_decision_regions function mlxtend.plotting class.
from mlxtend.plotting import plot_decision_regions X_combined = np.vstack((X_train, X_test)) y_combined = np.hstack((y_train, y_test)) # # plot_decision_regions function takes "forest" as classifier # fig, ax = plt.subplots(figsize=(7, 7)) plot_decision_regions(X_combined, y_combined, clf=forest) plt.xlabel('petal length [cm]') plt.ylabel('petal width [cm]') plt.legend(loc='upper left') plt.tight_layout() plt.show()
Here is how the diagram will look like:
In this post, you learned some of the following:
- Random forest is an ensemble of decision tree.
- Random forest helps avoid overfitting which is one of the key problem with decision tree classifier.
- For creating random forest, multiple trees are created using different sample sizes and features set.
- One of the key hyper parameter of random forest is number of trees represented using n_estimators.