Feature Selection vs Feature Extraction: Machine Learning

Feature extraction vs feature selection

Machine learning has become an increasingly important tool for businesses and researchers alike in recent years. From identifying patterns in data to making predictions about future outcomes, machine learning algorithms are now being used in a wide variety of fields. However, the success of these algorithms often depends on the quality of the features used to train them. This is where the concepts of feature selection and feature extraction come in.

In this blog post, we’ll explore the difference between feature selection and feature extraction, two key techniques used in machine learning to optimize feature sets for better model performance. Both feature selection and feature extraction are used for dimensionality reduction which is key to reducing model complexity and overfitting. The dimensionality reduction is one of the most important aspects of training machine learning models. We’ll discuss why it’s important to understand these concepts, and we’ll provide examples of how they can be applied in real-world scenarios. Whether you’re a data scientist looking to improve your machine learning models, or a business owner looking to better understand the potential of machine learning, understanding the basics of feature selection and feature extraction is essential. So, let’s dive in and explore these concepts in more detail.

Feature Selection Concepts & Techniques

Feature selection is a process in machine learning that involves identifying and selecting the most relevant subset of features out of the original features in a dataset to be used as inputs for a model. The goal of feature selection is to improve model performance by reducing the number of irrelevant or redundant features that may introduce noise or bias into the model.

The importance of feature selection lies in its ability to improve model accuracy and efficiency by reducing the dimensionality of the dataset. By selecting only the most important features, the model can focus on the key variables that have the greatest impact on the outcome, and ignore irrelevant or redundant features that may only add noise to the data. This can result in faster training times, improved accuracy, reduced generalization error introduced due to noise by irrelevant features, and, more robust models that are less prone to overfitting.

If we don’t adopt feature selection when training a machine learning model, we may encounter several problems. Firstly, including too many features in the model can lead to the curse of dimensionality, where the model becomes computationally expensive and may struggle to generalize well to new data. Secondly, including irrelevant or redundant features can introduce noise and bias into the model, leading to overfitting and reduced performance on new data. Therefore, feature selection is crucial to ensure that the model is accurate, efficient, and generalizes well to new data.

We need to do feature selection when the dataset contains a large number of features, or when the features are highly correlated, redundant, or irrelevant.

The following represents some of the important feature selection techniques:

Regularization techniques such as L1 norm regularization

L1 norm regularization, also known as Lasso regularization, is a common regularization technique used in feature selection. It works by adding a penalty term that encourages the model to select only the most important features, while reducing the weights of irrelevant or redundant features to zero. L1 norm regularization introduces sparsity into the feature weights, meaning that only a subset of the features have non-zero weights. The other features are effectively ignored by the model, resulting in a form of automatic feature selection. L1 norm regularization can be particularly useful in cases where the dataset contains many features, and some of them are irrelevant or redundant.

Feature importance technique for features selection

Feature importance techniques such as using estimator such as Random Forest algorithm to fit a model and select features based on the value of attribute such as feature_importances_ . The feature_importances_ attribute of the Random Forest estimator can be used to obtain the relative importance of each feature in the dataset. The feature_importances_ attribute of the Random Forest estimator provides a score for each feature in the dataset, indicating how important that feature is for making predictions. These scores are calculated based on the reduction in impurity (e.g., Gini impurity or entropy) achieved by splitting the data on that feature. The feature with the highest score is considered the most important, while features with low scores can be considered less important or even irrelevant. The code below

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the IRIS dataset
iris = load_iris()

# Split data into features (X) and target variable (y)
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Print feature importances
for feature, importance in zip(iris.feature_names, importances):
    print(f'{feature}: {importance}')

This is what will get printed.

Greedy search algorithms for features selection

Greedy search algorithms such as some of the following which are useful for algorithms (such as K-nearest neighbours, K-NN) where regularization techniques are not supported.

Different types of features selection techniques

According to the utilized training data (labeled, unlabeled, or partially labeled), feature selection methods can be divided into supervised, unsupervised, and semi-supervised models. According to their relationship with learning methods, feature selection methods can be classified into the following:

  • Filter methods: The filter model only considers the association between the feature and the class label. Filter methods involves ranking features based on a statistical measure and selecting a subset of the top-ranked features for the model. The filter method is independent of the model and can be used with any machine learning algorithm. The most common statistical measures used for ranking features include some of the following:
    • Pearson correlation coefficient
    • Chi-squared test
  • Wrapper methods: Wrapper methods are a class of feature selection techniques that select subsets of features by evaluating the performance of a machine learning model. Unlike filter methods, wrapper methods use the model’s performance on the training data as a criterion for selecting features. They involve repeatedly training and evaluating a model on different subsets of features, and selecting the subset that achieves the best performance. Wrapper methods have several advantages, including their ability to handle complex interactions between features and to select features that are important for a specific model, rather than for the dataset as a whole. However, they can be computationally expensive and may overfit the training data if the number of features is too large. There are several types of wrapper methods, including some of the following:
    • Forward selection
    • Backward elimination
  • Embedded methods: In embedded method, the features are selected in the training process of learning model, and the feature selection result outputs automatically while the training process is finished. Unlike filter and wrapper methods, embedded methods are built into the algorithm and select the most relevant features during model training. Embedded methods typically involve adding a penalty term to the loss function during model training, which encourages the model to select only the most important features. The penalty term can be based on L1 or L2 regularization, and is used to constrain the weights of the features. Features with low weights are effectively ignored by the model, while features with high weights are considered important for making predictions.

According to the evaluation criterion, feature selection methods can be derived from correlation, Euclidean distance, consistency, dependence and information measures. According to the type of output, feature selection methods can be divided into feature rank (weighting) and subset selection models.

Feature Extraction Concepts & Techniques

Feature extraction is about extracting/deriving information from the original features set to create a new features subspace. The primary idea behind feature extraction is to compress the data with the goal of maintaining most of the relevant information. As with feature selection techniques, these techniques are also used for reducing the number of features from the original features set to reduce model complexity, model overfitting, enhance model computation efficiency and reduce generalization error. The following are different types of feature extraction techniques:

  • Principal component analysis (PCA) for unsupervised data compression. Here is a detailed post on feature extraction using PCA with Python example. You will get a good understanding of how PCA can help with finding the directions of maximum variance in high-dimensional data and projects the data onto a new subspace with equal or fewer dimensions than the original one. This is explained with example of identifying Taj Mahal (7th wonder of world) from top view or side view based on dimensions in which there is maximum variance. The diagram below shows the dimensions of maximum variance (PCA1 and PCA2) as a result of PCA.

    principal component analysis explained
  • Linear discriminant analysis (LDA) as a supervised dimensionality reduction technique for maximizing class separability
  • Nonlinear dimensionality reduction via kernel principal component analysis (KPCA)

When to use Feature Selection & Feature Extraction

The key difference between feature selection and feature extraction techniques used for dimensionality reduction is that while the original features are maintained in the case of feature selection algorithms, the feature extraction algorithms transform the data onto a new feature space.

Feature selection techniques can be used if the requirement is to maintain the original features, unlike the feature extraction techniques which derive useful information from data to construct a new feature subspace. Feature selection techniques are used when model explainability is a key requirement.

Feature extraction techniques can be used to improve the predictive performance of the models, especially, in the case of algorithms that don’t support regularization.

Unlike feature selection, feature extraction usually needs to transform the original data to features with strong pattern recognition ability, where the original data can be regarded as features with weak recognition ability.

Quiz – Test your knowledge

Here is a quick quiz you can use to check your knowledge on feature selection vs feature extraction.

Feature selection and feature extraction methods are one and same.

Correct! Wrong!

Which of the following can be used for feature extraction?

Correct! Wrong!

Which of the following technique is used for feature extraction?

Correct! Wrong!

Which of the following can be used for feature selection?

Correct! Wrong!

Which of the following can be used for feature selection?

Correct! Wrong!

In which of the following techniques, the original features set are maintained?

Correct! Wrong!

Which of the following techniques is recommended when original feature set is required to be maintained?

Correct! Wrong!

Which of the following technique is recommended when the model interpretability is key requirement?

Correct! Wrong!

Resources

Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Data Science, Machine Learning. Tagged with , .

Leave a Reply

Your email address will not be published. Required fields are marked *