Last updated: 9th Dec, 2023
When building machine learning classification and regression models, understanding which features most significantly impact your model’s predictions can be as crucial as the predictions themselves. This post delves into the concept of feature importance in the context of one of the most popular algorithms available – the Random Forest. Whether used for classification or regression tasks, Random Forest not only offers robust and accurate predictions but also provides insightful metrics to find the most important features in your dataset. You will learn about how to use Random Forest regression and classification algorithms for determining feature importance using Sklearn Python code example. It is very important to understand feature importance and feature selection techniques for data scientists to use most important features for training machine learning models. Recall that other feature selection techniques includes L-norm regularization techniques, greedy search algorithms techniques such as sequential backward / sequential forward selection etc.
Feature importance is a key concept in machine learning that refers to the relative importance of each feature in the training data. In other words, it tells us which features are most important in terms of predicting the values of the target variable with most accuracy. Determining feature importance is one of the key steps of machine learning model development pipeline. Feature importance can be calculated in a number of ways, but all methods typically rely on calculating some sort of score that measures how often a feature is used in the model and how much it contributes to the overall predictions.
Beyond Random Forest, feature importance in Python can be assessed using Linear Models for coefficient analysis, Gradient Boosting Machines (XGBoost, LightGBM) for built-in importance metrics, Permutation Importance for model-independent assessment, SHAP values for detailed explanations, and dimensionality reduction using PCA.
Imagine a healthcare organization developing a machine learning model to predict the likelihood of patients being readmitted to the hospital within 30 days of discharge. They use a variety of features like patient age, diagnosis, length of stay, number of previous admissions, medication types, and lifestyle factors such as smoking status. By applying feature importance analysis using a Random Forest algorithm, the team discovers that the length of stay, number of previous admissions, and certain types of medication are the most important features in predicting readmissions. Surprisingly, lifestyle factors like smoking status, initially thought to be significant, have minimal impact according to the model. This insight allows the healthcare team to focus on optimizing post-discharge plans especially for patients with longer hospital stays and multiple admissions, potentially tailoring medication strategies to reduce readmission rates.
The following are some of the key reasons representing the importance of the feature importance:
Feature importance is used to select features for building models, debugging models, and understanding the data. The outcome of feature importance stage is a set of features along with the measure of their importance. Once the importance of features get determined, the features can be selected appropriately. One can apply feature selection and feature importance techniques to select the most important features. Note that the selection of key features results in models requiring optimal computational complexity while ensuring reduced generalization error as a result of noise introduced by less important features.
Feature importance can be measured on a scale from 0 to 1, with 0 indicating that the feature has no importance and 1 indicating that the feature is absolutely essential. Feature importance values can also be negative, which indicates that the feature is actually harmful to the model performance.
As discussed in prevision section, feature importance in machine learning involves evaluating the significance of the input features in predicting the target variable. The random forest algorithm, encompassing both the classifier and regressor variants, stands out for its inherent ability to rank features based on their importance. The random forest classifier feature importance and the random forest regressor feature importance are derived from the average decrease in impurity across all trees within the model, a process that is well-handled by the feature_importances_ attribute in the sklearn library.
Understanding what feature importance in random forest entails is key to model optimization and interpretability. Whether one employs the RandomForestClassifier or RandomForestRegressor from sklearn, the technique remains consistent. It allows for insightful feature importance visualization, making it easier to communicate the results. In Python, libraries such as sklearn provide straightforward methods to not only compute but also to retrieve and visualize this information, through feature_importances_ or by using methods tailored for feature selection in Python.
Sklearn RandomForestClassifier can be used for determining feature importance. It collects the feature importance values so that the same can be accessed via the feature_importances_ attribute after fitting the RandomForestClassifier model. Sklearn wine data set is used for illustration purpose. Here are the steps:
Here is the python code for creating training and test split of Sklearn Wine dataset. The code demonstrates how to work with Pandas dataframe and Numpy array (ndarray) alternatively by converting Numpy arrays to Pandas Dataframe.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
#
# Load the wine datasets
#
wine = datasets.load_wine()
df = pd.DataFrame(wine.data)
df[13] = wine.target
df.columns = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline', 'class']
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1:], test_size = 0.3, random_state=1)
#
# Feature scaling
#
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Training / Test Dataframe
#
cols = ['alcohal', 'malic_acid', 'ash', 'ash_alcalinity', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoids_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od_dilutedwines', 'proline']
X_train_std = pd.DataFrame(X_train_std, columns=cols)
X_test_std = pd.DataFrame(X_test_std, columns=cols)
Here is the python code for training RandomForestClassifier model using training and test data set created in the previous section:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=500,
random_state=1)
#
# Train the mode
#
forest.fit(X_train_std, y_train.values.ravel())
Here is the python code which can be used for determining feature importance. The attribute, feature_importances_ gives the importance of each feature in the order in which the features are arranged in training dataset. Note how the indices are arranged in descending order while using argsort method (most important feature appears first)
import numpy as np
importances = forest.feature_importances_
#
# Sort the feature importance in descending order
#
sorted_indices = np.argsort(importances)[::-1]
feat_labels = df.columns[1:]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[sorted_indices[f]],
importances[sorted_indices[f]]))
The following will be printed representing the feature importances.
We will create a plot for interpreting feature importance from the output of random forest classifier. With the sorted indices in place, the following python code will help create a bar chart for visualizing feature importance.
import matplotlib.pyplot as plt
plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), importances[sorted_indices], align='center')
plt.xticks(range(X_train.shape[1]), X_train.columns[sorted_indices], rotation=90)
plt.tight_layout()
plt.show()
Here is how the matplotlib.pyplot visualization plot representing feature importance looks like. Let’s look into how to interpret feature importance using this plot.
The following Python code snippet demonstrates how to extract and visualize feature importance from a Random Forest Regressor using the Boston housing dataset from sklearn. The code begins by importing the necessary modules, loading the dataset, and then splitting it into features and the target variable. A Random Forest Regressor model is instantiated and fitted to the data. The feature_importances_ attribute of the fitted model is used to obtain the importance scores of each feature. These importance scores indicate the relative contribution of each feature to the model’s predictions.
import matplotlib.pyplot as plt from sklearn.datasets import load_boston from sklearn.ensemble import RandomForestRegressor # Load the Boston housing dataset boston = load_boston() X, y = boston.data, boston.target # Create a Random Forest Regressor rfr = RandomForestRegressor(n_estimators=100, random_state=42) # Fit the model rfr.fit(X, y) # Get feature importances importances = rfr.feature_importances_ indices = range(len(importances)) # Rearrange feature names so they match the sorted feature importances names = [boston.feature_names[i] for i in importances.argsort()] # Plot Feature Importance plt.figure() plt.title("Feature Importance with Random Forest Regressor") plt.barh(indices, sorted(importances), align='center') plt.yticks(indices, names) plt.xlabel('Relative Importance') plt.show()
We will create a plot for interpreting feature importance from the output of random forest regressor. The following visualization plot plots feature scores for features used in training the model. Let’s look into how to interpret feature importance from this plot.
The plot generated above displays the feature importance as determined by a Random Forest Regressor trained on the Boston housing dataset. Each bar represents a feature from the dataset, and the length of the bar signifies the relative importance of that feature in predicting the target variable, which in this case is the median value of owner-occupied homes.
From this plot, we can deduce which features the Random Forest algorithm found most predictive for housing prices. Features with longer bars are deemed more significant by the model and thus have a greater impact on the model’s predictions. For instance, if ‘RM’ (average number of rooms per dwelling) has a long bar, it is considered a strong predictor of housing prices.
However, it’s important to note that while feature importance provides valuable insights, it does not indicate the nature of the relationship between the feature and the target (whether positive or negative) and may not capture dependencies between features. Thus, it should be one of many tools used for model interpretation
The following are some of the most common FAQs related to assessing feature importance using random forest algorithms for regression and classification:
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…
View Comments
Thanks very useful info easy to understand