In this post, you will learn about one of feature selection techniques namely sequential forward selection with Python code example. Refer to my earlier post on sequential backward selection technique for feature selection. Sequential forward selection algorithm is a part of sequential feature selection algorithms. Some of the following topics will be covered in this post:
Sequential feature selection algorithms including sequential forward selection algorithm belongs to the family of greedy search algorithms which are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The idea is to select a subset of features that is most relevant to the problem, which results in optimal computation efficiency while achieving reduced generalization error by filtering out irrelevant features (that acts as a noise).
Some of the common techniques used for feature selection includes regularization techniques (L1 / L2 norm) and sequential forward / backward feature selection. For those algorithms such as K-Nearest Neighbours which do not support regularisation techniques, sequential feature selection is commonly used. Here is a good read on sequential feature selection technique. Note that the sequential feature selection techniques are based on greedy search algorithms which applies combinatorial methods for feature search.
Sequential forward selection algorithm is about execution of the following steps to search the most appropriate features out of N features to fit in K-features subset.
Here is the Python code sample for the algorithm. Note some of the following in the code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from itertools import combinations
from sklearn.base import clone
from sklearn.metrics import accuracy_score
class SequentialForwardSelection():
'''
Instantiate with Estimator and given number of features
'''
def __init__(self, estimator, k_features):
self.estimator = clone(estimator)
self.k_features = k_features
'''
X_train - Training data Pandas dataframe
X_test - Test data Pandas dataframe
y_train - Training label Pandas dataframe
y_test - Test data Pandas dataframe
'''
def fit(self, X_train, X_test, y_train, y_test):
max_indices = tuple(range(X_train.shape[1]))
total_features_count = len(max_indices)
self.subsets_ = []
self.scores_ = []
self.indices_ = []
'''
Iterate through the feature space to find the first feature
which gives the maximum model performance
'''
scores = []
subsets = []
for p in combinations(max_indices, r=1):
score = self._calc_score(X_train.values, X_test.values, y_train.values, y_test.values, p)
scores.append(score)
subsets.append(p)
#
# Find the single feature having best score
#
best_score_index = np.argmax(scores)
self.scores_.append(scores[best_score_index])
self.indices_ = list(subsets[best_score_index])
self.subsets_.append(self.indices_)
#
# Add a feature one by one until k_features is reached
#
dim = 1
while dim < self.k_features:
scores = []
subsets = []
current_feature = dim
'''
Add the remaining features one-by-one from the remaining feature set
Calculate the score for every feature combinations
'''
idx = 0
while idx < total_features_count:
if idx not in self.indices_:
indices = list(self.indices_)
indices.append(idx)
score = self._calc_score(X_train.values, X_test.values, y_train.values, y_test.values,indices)
scores.append(score)
subsets.append(indices)
idx += 1
#
# Get the index of best score
#
best_score_index = np.argmax(scores)
#
# Record the best score
#
self.scores_.append(scores[best_score_index])
#
# Get the indices of features which gave best score
#
self.indices_ = list(subsets[best_score_index])
#
# Record the indices of features for best score
#
self.subsets_.append(self.indices_)
dim += 1
self.k_score_ = self.scores_[-1]
'''
Transform training, test data set to the data set
havng features which gave best score
'''
def transform(self, X):
return X.values[:, self.indices_]
'''
Train models with specific set of features
indices - indices of features
'''
def _calc_score(self, X_train, X_test, y_train, y_test, indices):
self.estimator.fit(X_train[:, indices], y_train.ravel())
y_pred = self.estimator.predict(X_test[:, indices])
score = accuracy_score(y_test, y_pred)
return score
Here is the code which represents how an instance of LogisticRegression can be passed with training and test data set and the best features are derived. Although regularization technique can be used with LogisticRegression, this is just used for illustration purpose.
#
# Instantiate the estimator - LogisticRegression
#
lr = LogisticRegression(C=1.0, random_state=1)
#
# Number of features
#
k_features = 5
#
# Instantiate SequentialBackwardSearch
#
sfs = SequentialForwardSelection(lr, k_features)
#
# Fit the data to determine the k_features which give the
# most optimal model performance
#
sfs.fit(X_train, X_test, y_train, y_test)
#
# Transform the training data set to dataset having k_features
# giving most optimal model performance
#
X_train_sfs = sfs.transform(X_train)
#
# Transform the test data set to dataset having k_features
#
X_test_sfs = sfs.transform(X_test)
Here is a glimpse of the training data used in the above example:
Here is the plot representing the model performance vs number of features which got derived from executing sequential forward selection algorithm.
k_features = [len(k) for k in sfs.subsets_]
plt.plot(k_features, sfs.scores_, marker='o')
plt.ylim([0.7, 1.02])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
plt.show()
Here is how the plot would look like:
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…