In this post, you will learn about how to use Sklearn Random Forest Classifier (RandomForestClassifier) for determining feature importance using Python code example. This will be useful in feature selection by finding most important features when solving classification machine learning problem. It is very important to understand feature importance and feature selection techniques for data scientists to use most appropriate features for training machine learning models. Recall that other feature selection techniques includes L-norm regularization techniques, greedy search algorithms techniques such as sequential backward / sequential forward selection etc. The following are some of the topics covered in this post:
- Why feature importance?
- Random Forest for feature importance
- Using Sklearn RandomForestClassifier for Feature Importance
Why Feature Importance?
Determining feature importance is one of the key steps of machine learning model development pipeline. The outcome of feature importance stage is a set of features along with the measure of their importance. Once the importance of features get determined, the features can be selected appropriately. One can apply feature selection and feature importance techniques to select the most important features. Note that the selection of key features results in models requiring optimal computational complexity while ensuring reduced generalization error as a result of noise introduced by less important features.
Random Forest for Feature Importance
Using Random forest algorithm, the feature importance can be measured as the average impurity decrease computed from all decision trees in the forest. This is irrespective of the fact whether the data is linear or non-linear (linearly inseparable)
Sklearn RandomForestClassifier for Feature Importance
Sklearn RandomForestClassifier can be used for determining feature importance. It collects the feature importance values so that the same can be accessed via the feature_importances_ attribute after fitting the RandomForestClassifier model. Sklearn wine data set is used for illustration purpose. Here are the steps:
- Create training and test split
- Train the model using RandomForestClassifier
- Get the feature importance value
- Visualize the feature importance
Create the Train / Test Split
Here is the python code for creating training and test split of Sklearn Wine dataset. The code demonstrates how to work with Pandas dataframe and Numpy array (ndarray) alternatively by converting Numpy arrays to Pandas Dataframe.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn import datasets # # Load the wine datasets # wine = datasets.load_wine() df = pd.DataFrame(wine.data) df = wine.target df.columns = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline', 'class'] # # Create training and test split # X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1:], test_size = 0.3, random_state=1) # # Feature scaling # sc = StandardScaler() sc.fit(X_train) X_train_std = sc.transform(X_train) X_test_std = sc.transform(X_test) # # Training / Test Dataframe # cols = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline'] X_train_std = pd.DataFrame(X_train_std, columns=cols) X_test_std = pd.DataFrame(X_test_std, columns=cols)
Train the model using Sklearn RandomForestClassifier
Here is the python code for training RandomForestClassifier model using training and test data set created in the previous section:
from sklearn.ensemble import RandomForestClassifier # # Train the mode # forest.fit(X_train_std, y_train.values.ravel())
Determine feature importance values
Here is the python code which can be used for determining feature importance. The attribute, feature_importances_ gives the importance of each feature in the order in which the features are arranged in training dataset. Note how the indices are arranged in descending order while using argsort method (most important feature appears first)
importances = forest.feature_importances_ # # Sort the feature importance in descending order # sorted_indices = np.argsort(importances)[::-1]
Visualize the feature importance
With the sorted indices in place, the following python code will help create a bar chart for visualizing feature importance.
import matplotlib.pyplot as plt plt.title('Feature Importance') plt.bar(range(X_train.shape), importances[sorted_indices], align='center') plt.xticks(range(X_train.shape), X_train.columns[sorted_indices], rotation=90) plt.tight_layout() plt.show()
Here is how the matplotlib.pyplot visualization pot looks like: