Data Science

Feature Importance & Random Forest – Sklearn Python Example

Last updated: 9th Dec, 2023

When building machine learning classification and regression models, understanding which features most significantly impact your model’s predictions can be as crucial as the predictions themselves. This post delves into the concept of feature importance in the context of one of the most popular algorithms available – the Random Forest. Whether used for classification or regression tasks, Random Forest not only offers robust and accurate predictions but also provides insightful metrics to find the most important features in your dataset. You will learn about how to use Random Forest regression and classification algorithms for determining feature importance using Sklearn Python code example. It is very important to understand feature importance and feature selection techniques for data scientists to use most important features for training machine learning models. Recall that other feature selection techniques includes L-norm regularization techniques, greedy search algorithms techniques such as sequential backward / sequential forward selection etc. 

What & Why of Feature Importance?

Feature importance is a key concept in machine learning that refers to the relative importance of each feature in the training data. In other words, it tells us which features are most important in terms of predicting the values of the target variable with most accuracy. Determining feature importance is one of the key steps of machine learning model development pipeline. Feature importance can be calculated in a number of ways, but all methods typically rely on calculating some sort of score that measures how often a feature is used in the model and how much it contributes to the overall predictions.

Beyond Random Forest, feature importance in Python can be assessed using Linear Models for coefficient analysis, Gradient Boosting Machines (XGBoost, LightGBM) for built-in importance metrics, Permutation Importance for model-independent assessment, SHAP values for detailed explanations, and dimensionality reduction using PCA.

Feature importance example

Imagine a healthcare organization developing a machine learning model to predict the likelihood of patients being readmitted to the hospital within 30 days of discharge. They use a variety of features like patient age, diagnosis, length of stay, number of previous admissions, medication types, and lifestyle factors such as smoking status. By applying feature importance analysis using a Random Forest algorithm, the team discovers that the length of stay, number of previous admissions, and certain types of medication are the most important features in predicting readmissions. Surprisingly, lifestyle factors like smoking status, initially thought to be significant, have minimal impact according to the model. This insight allows the healthcare team to focus on optimizing post-discharge plans especially for patients with longer hospital stays and multiple admissions, potentially tailoring medication strategies to reduce readmission rates.

Why is feature importance important?

The following are some of the key reasons representing the importance of the feature importance:

  • Model Simplification: By identifying which features contribute most to the model’s predictions, we can simplify the model by selecting only the most important features. This not only reduces the complexity but also speeds up the training process.
  • Model Interpretability: Knowing which features are most important in a model helps in interpreting the model’s decisions. This is particularly important in fields where understanding the decision-making process is as important as the decision itself, such as in healthcare or finance.
  • Improved Accuracy: Sometimes, removing irrelevant or less important features can actually improve the model’s accuracy by reducing the noise in the data.
  • Enhanced Feature Engineering: Insights from feature importance can guide the process of feature engineering, helping in creating more meaningful features that could improve model performance.
  • Regulatory and Ethical Compliance: In industries with strict regulatory requirements, understanding feature importance is crucial to ensure that models do not rely on discriminatory or sensitive features.
  • Detecting Data Quality Issues: If unexpected features are identified as highly important, it might indicate issues with data quality or biases in the data, prompting a reevaluation of the dataset.

Feature importance & feature selection go hand-in-hand

Feature importance is used to select features for building models, debugging models, and understanding the data. The outcome of feature importance stage is a set of features along with the measure of their importance. Once the importance of features get determined, the features can be selected appropriately. One can apply feature selection and feature importance techniques to select the most important features. Note that the selection of key features results in models requiring optimal computational complexity while ensuring reduced generalization error as a result of noise introduced by less important features.

Feature importance metrics

Feature importance can be measured on a scale from 0 to 1, with 0 indicating that the feature has no importance and 1 indicating that the feature is absolutely essential. Feature importance values can also be negative, which indicates that the feature is actually harmful to the model performance.

Feature Importance using Random Forest

As discussed in prevision section, feature importance in machine learning involves evaluating the significance of the input features in predicting the target variable. The random forest algorithm, encompassing both the classifier and regressor variants, stands out for its inherent ability to rank features based on their importance. The random forest classifier feature importance and the random forest regressor feature importance are derived from the average decrease in impurity across all trees within the model, a process that is well-handled by the feature_importances_ attribute in the sklearn library.

Understanding what feature importance in random forest entails is key to model optimization and interpretability. Whether one employs the RandomForestClassifier or RandomForestRegressor from sklearn, the technique remains consistent. It allows for insightful feature importance visualization, making it easier to communicate the results. In Python, libraries such as sklearn provide straightforward methods to not only compute but also to retrieve and visualize this information, through feature_importances_ or by using methods tailored for feature selection in Python.

Python Sklearn RandomForestClassifier for Feature Importance

Sklearn RandomForestClassifier can be used for determining feature importance. It collects the feature importance values so that the same can be accessed via the feature_importances_ attribute after fitting the RandomForestClassifier model.  Sklearn wine data set is used for illustration purpose. Here are the steps:

  • Create training and test split
  • Train the model using RandomForestClassifier
  • Get the feature importance value
  • Visualize the feature importance

Create the Train / Test Split

Here is the python code for creating training and test split of Sklearn Wine dataset. The code demonstrates how to work with Pandas dataframe and Numpy array (ndarray) alternatively by converting Numpy arrays to Pandas Dataframe.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
#
# Load the wine datasets
#
wine = datasets.load_wine()
df = pd.DataFrame(wine.data)
df[13] = wine.target
df.columns = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline', 'class']
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1:], test_size = 0.3, random_state=1)
#
# Feature scaling
#
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Training / Test Dataframe
#
cols = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline']
X_train_std = pd.DataFrame(X_train_std, columns=cols)
X_test_std = pd.DataFrame(X_test_std, columns=cols)

Train the model using Sklearn RandomForestClassifier

Here is the python code for training RandomForestClassifier model using training and test data set created in the previous section:

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)
#
# Train the mode
#
forest.fit(X_train_std, y_train.values.ravel())

Determine feature importance values

Here is the python code which can be used for determining feature importance. The attribute, feature_importances_ gives the importance of each feature in the order in which the features are arranged in training dataset. Note how the indices are arranged in descending order while using argsort method (most important feature appears first)

import numpy as np

importances = forest.feature_importances_
#
# Sort the feature importance in descending order
#
sorted_indices = np.argsort(importances)[::-1]

feat_labels = df.columns[1:]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[sorted_indices[f]], 
                            importances[sorted_indices[f]]))

The following will be printed representing the feature importances.

Interpreting feature importance using visualization plot

We will create a plot for interpreting feature importance from the output of random forest classifier. With the sorted indices in place, the following python code will help create a bar chart for visualizing feature importance.

import matplotlib.pyplot as plt

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), importances[sorted_indices], align='center')
plt.xticks(range(X_train.shape[1]), X_train.columns[sorted_indices], rotation=90)
plt.tight_layout()
plt.show()

Here is how the matplotlib.pyplot visualization plot representing feature importance looks like. Let’s look into how to interpret feature importance using this plot.

Fig 1. Visualization plot for feature importance using RandomForestClassifier

Python Sklearn RandomForestRegressor for Feature Importance

The following Python code snippet demonstrates how to extract and visualize feature importance from a Random Forest Regressor using the Boston housing dataset from sklearn. The code begins by importing the necessary modules, loading the dataset, and then splitting it into features and the target variable. A Random Forest Regressor model is instantiated and fitted to the data. The feature_importances_ attribute of the fitted model is used to obtain the importance scores of each feature. These importance scores indicate the relative contribution of each feature to the model’s predictions.

import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor

# Load the Boston housing dataset
boston = load_boston()
X, y = boston.data, boston.target

# Create a Random Forest Regressor
rfr = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rfr.fit(X, y)

# Get feature importances
importances = rfr.feature_importances_
indices = range(len(importances))

# Rearrange feature names so they match the sorted feature importances
names = [boston.feature_names[i] for i in importances.argsort()]

# Plot Feature Importance
plt.figure()
plt.title("Feature Importance with Random Forest Regressor")
plt.barh(indices, sorted(importances), align='center')
plt.yticks(indices, names)
plt.xlabel('Relative Importance')
plt.show()

Interpreting feature importance using visualization plot

We will create a plot for interpreting feature importance from the output of random forest regressor. The following visualization plot plots feature scores for features used in training the model. Let’s look into how to interpret feature importance from this plot.

The plot generated above displays the feature importance as determined by a Random Forest Regressor trained on the Boston housing dataset. Each bar represents a feature from the dataset, and the length of the bar signifies the relative importance of that feature in predicting the target variable, which in this case is the median value of owner-occupied homes.

From this plot, we can deduce which features the Random Forest algorithm found most predictive for housing prices. Features with longer bars are deemed more significant by the model and thus have a greater impact on the model’s predictions. For instance, if ‘RM’ (average number of rooms per dwelling) has a long bar, it is considered a strong predictor of housing prices.

However, it’s important to note that while feature importance provides valuable insights, it does not indicate the nature of the relationship between the feature and the target (whether positive or negative) and may not capture dependencies between features. Thus, it should be one of many tools used for model interpretation

Frequently Asked Questions (FAQs)

The following are some of the most common FAQs related to assessing feature importance using random forest algorithms for regression and classification:

  • Can feature importance be determined using decision tree model?
    • Decision tree models assess feature importance by measuring how much each feature contributes to reducing a criterion like Gini impurity or mean squared error at each node. This importance is aggregated across the tree, with the sum normalized, indicating the relative influence of each feature.
  • Can feature importance be found with random forest regression models?
    • Yes; Demonstrated in this blog post
  • Can feature importance be found with random forest classifier models?
    • Yes; Demonstrated in this blog post
  • How to use random forest for feature selection?
    • To use Random Forest for feature selection, train the model on your dataset and then evaluate the feature importances provided by the model. These importances are based on how much each feature decreases the impurity in the model’s decision trees. Select the most important features based on a threshold or a specific number of top features. This approach helps in identifying the most relevant features for the prediction task.
  • How is Gini impurity related with feature importance?
    • Gini impurity is a measure used in decision trees, including Random Forests, to quantify how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. In the context of feature importance, a feature is considered more important if splitting on that feature significantly decreases the Gini impurity in the nodes of the tree. Essentially, a greater decrease in Gini impurity due to a split on a particular feature indicates higher importance of that feature in the decision-making process of the tree. This decrease in impurity is aggregated over all trees in a Random Forest to assess the overall importance of each feature.
  • What are different feature importance algorithms?
    • Various algorithms assess feature importance in machine learning. Tree-based models like Decision Trees and Random Forests use Gini Importance, based on node impurity reduction. Permutation Feature Importance evaluates the impact of feature shuffling on model performance. Linear model coefficients indicate importance, requiring feature normalization. SHAP and LIME provide advanced, model-agnostic explanations of feature contributions. Principal Component Analysis (PCA) assists in feature reduction, while techniques like layer-wise relevance propagation (LRP) are used in neural networks.
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

View Comments

Recent Posts

Agentic Reasoning Design Patterns in AI: Examples

In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…

2 months ago

LLMs for Adaptive Learning & Personalized Education

Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…

2 months ago

Sparse Mixture of Experts (MoE) Models: Examples

With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…

3 months ago

Anxiety Disorder Detection & Machine Learning Techniques

Anxiety is a common mental health condition that affects millions of people around the world.…

3 months ago

Confounder Features & Machine Learning Models: Examples

In machine learning, confounder features or variables can significantly affect the accuracy and validity of…

3 months ago

Credit Card Fraud Detection & Machine Learning

Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…

3 months ago