Feature Importance using Random Forest Classifier – Python

Random forest for feature importance

In this post, you will learn about how to use Sklearn Random Forest Classifier (RandomForestClassifier) for determining feature importance using Python code example. This will be useful in feature selection by finding most important features when solving classification machine learning problem. It is very important to understand feature importance and feature selection techniques for data scientists to use most appropriate features for training machine learning models. Recall that other feature selection techniques includes L-norm regularization techniques, greedy search algorithms techniques such as sequential backward / sequential forward selection etc. The following are some of the topics covered in this post:

  • Why feature importance?
  • Random Forest for feature importance
  • Using Sklearn RandomForestClassifier for Feature Importance

Why Feature Importance?

Determining feature importance is one of the key steps of machine learning model development pipeline. The outcome of feature importance stage is a set of features along with the measure of their importance. Once the importance of features get determined, the features can be selected appropriately. One can apply feature selection and feature importance techniques to select the most important features. Note that the selection of key features results in models requiring optimal computational complexity while ensuring reduced generalization error as a result of noise introduced by less important features.

Random Forest for Feature Importance

Using Random forest algorithm, the feature importance can be measured as the average impurity decrease computed from all decision trees in the forest. This is irrespective of the fact whether the data is linear or non-linear (linearly inseparable)

Sklearn RandomForestClassifier for Feature Importance

Sklearn RandomForestClassifier can be used for determining feature importance. It collects the feature importance values so that the same can be accessed via the feature_importances_ attribute after fitting the RandomForestClassifier model.  Sklearn wine data set is used for illustration purpose. Here are the steps:

  • Create training and test split
  • Train the model using RandomForestClassifier
  • Get the feature importance value
  • Visualize the feature importance

Create the Train / Test Split

Here is the python code for creating training and test split of Sklearn Wine dataset. The code demonstrates how to work with Pandas dataframe and Numpy array (ndarray) alternatively by converting Numpy arrays to Pandas Dataframe.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
# Load the wine datasets
wine = datasets.load_wine()
df = pd.DataFrame(wine.data)
df[13] = wine.target
df.columns = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline', 'class']
# Create training and test split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1:], test_size = 0.3, random_state=1)
# Feature scaling
sc = StandardScaler()
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
# Training / Test Dataframe
cols = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline']
X_train_std = pd.DataFrame(X_train_std, columns=cols)
X_test_std = pd.DataFrame(X_test_std, columns=cols)

Train the model using Sklearn RandomForestClassifier

Here is the python code for training RandomForestClassifier model using training and test data set created in the previous section:

from sklearn.ensemble import RandomForestClassifier
# Train the mode
forest.fit(X_train_std, y_train.values.ravel())

Determine feature importance values

Here is the python code which can be used for determining feature importance. The attribute, feature_importances_ gives the importance of each feature in the order in which the features are arranged in training dataset. Note how the indices are arranged in descending order while using argsort method (most important feature appears first)

importances = forest.feature_importances_
# Sort the feature importance in descending order
sorted_indices = np.argsort(importances)[::-1]

Visualize the feature importance

With the sorted indices in place, the following python code will help create a bar chart for visualizing feature importance.

import matplotlib.pyplot as plt

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), importances[sorted_indices], align='center')
plt.xticks(range(X_train.shape[1]), X_train.columns[sorted_indices], rotation=90)

Here is how the matplotlib.pyplot visualization pot looks like:

Visualization plot for feature importance using RandomForestClassifier
Fig 1. Visualization plot for feature importance using RandomForestClassifier
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in Data Science, Machine Learning, Python. Tagged with , , .

One Response

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.