Last updated: 3rd May, 2024
Have you ever wondered why some machine learning models perform exceptionally well while others don’t? Could the magic ingredient be something other than the algorithm itself? The answer is often “Yes,” and the magic ingredient is feature engineering. Good feature engineering can make or break a model.
In this blog, we will demystify various techniques for feature engineering, including feature extraction, interaction features, encoding categorical variables, feature scaling, and feature selection. To demonstrate these methods, we’ll use a real-world dataset containing car sales data. This dataset includes a variety of features such as ‘Company Name’, ‘Model Name’, ‘Price’, ‘Model Year’, ‘Mileage’, and more. Through this dataset, we’ll explore how to improve a machine learning model by implementing effective feature engineering techniques while leveraging Python code.
Whether you’re a seasoned data scientist or a machine learning beginner, you’ll find these techniques invaluable. Before getting started, we load the dataset and get the summary information.
import pandas as pd
# Load the Excel file into a DataFrame
file_path = '/path/Clean Data_pakwheels.xlsx'
df = pd.read_excel(file_path)
# Display some basic statistics and the first few rows of the DataFrame
df_info = df.info()
df_head = df.head()
df_info, df_head
The dataset contains 13 features, including:
Feature extraction is a powerful technique for reducing the dimensionality of your dataset, thereby potentially improving your model’s performance. While there are various methods like Principal Component Analysis (PCA), polynomial features, encoding & binning, etc., in this blog, we’ll focus on feature aggregation. Feature aggregation allows us to combine two or more existing features to create new ones that capture the essential information from the original set. By doing so, we aim to enhance the predictive power of our machine learning model while retaining interpretability.
Given the dataset, here are some potential feature aggregation ideas:
The following Python code represents extracting features from existing set of raw features.
# Make a copy of the original DataFrame for feature engineering
df_encoded = df.copy()
# Create aggregated features
df_encoded['Age_vs_Mileage'] = df_encoded['Model Year'] / (df_encoded['Mileage'] + 1) # Added 1 to avoid division by zero
df_encoded['Engine_Efficiency'] = df_encoded['Engine Capacity'] / (df_encoded['Mileage'] + 1) # Added 1 to avoid division by zero
df_encoded['Model_Year_and_Engine_Capacity'] = df_encoded['Model Year'] + df_encoded['Engine Capacity']
# Display the first few rows with the new aggregated features
df_encoded[['Age_vs_Mileage', 'Engine_Efficiency', 'Model_Year_and_Engine_Capacity']].head()
Categorical variables would required to be transformed into numerical format that could be provided to machine learning algorithms. The following Python code uses Sklearn LabelEncoder (sklearn.preprocessing.LabelEncoder).
from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Encode categorical columns
for column in categorical_columns:
df_encoded[column] = label_encoder.fit_transform(df[column])
# Display the first few rows of the encoded DataFrame
df_encoded.head()
The categorical variables would get encoded using label encoding. This transformation converts each unique category to an integer, allowing us to use these features in machine learning algorithms that require numerical input. Check out my related post, When to use LabelEncoder – Python Example.
Feature scaling is a crucial component of feature engineering, especially when you’re working with algorithms sensitive to the scale of input variables, such as k-NN, Support Vector Machines, and neural networks. Diverse ranges of numerical features can lead to a disproportionate impact on the model. For instance, a feature like ‘Mileage’ that ranges in the thousands could dominate a feature like ‘Model Year,’ which might only vary between 2000 and 2020. By scaling features to a similar range, we ensure that no particular feature disproportionately influences the model’s performance. This not only helps in faster convergence of the model but also improves the accuracy and interpretability, allowing each feature to contribute more equally to the prediction.
The following is a Python code example of doing feature scaling using StandardScaler (sklearn.preprocessing.StandardScaler). You can also learn about feature scaling in detail in one of my related blogs: Feature Scaling in Machine Learning – Python Examples.
from sklearn.preprocessing import StandardScaler
# Initialize StandardScaler
scaler = StandardScaler()
# List of columns to scale
columns_to_scale = ['Price', 'Model Year', 'Mileage', 'Engine Capacity']
# Scale the selected columns
df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])
# Display the first few rows of the scaled DataFrame
df_encoded.head()
The selected numerical variables have been scaled using Standard Scaling, which transforms them to have a mean of 0 and a standard deviation of 1. This is often a good practice, especially for algorithms that are sensitive to the scale of the input features.
The next step in our feature engineering journey is feature selection, a process pivotal for enhancing model performance by focusing only on the most impactful features. While Recursive Feature Elimination (RFE) and Random Forest algorithms are popular for their straightforward implementation and effectiveness, they’re far from being the only options. Other methods include forward selection, backward selection, LASSO, correlation matrix with heatmap, etc.
We will use Random Forest algorithm to identify feature importance and select the top 5 features. Here is the code:
from sklearn.ensemble import RandomForestRegressor
# Initialize Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model to the entire dataset
rf.fit(df_encoded.drop('Price', axis=1), df_encoded['Price'])
# Get feature importances
feature_importances = rf.feature_importances_
# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': df_encoded.drop('Price', axis=1).columns,
'Importance': feature_importances}).sort_values(by='Importance', ascending=False)
# Display the feature importances
feature_importance_df
The feature importances obtained from the Random Forest model are as follows:
The percentages represent the relative importance of each feature in predicting the car price, according to the Random Forest model. As we can see, ‘Engine Capacity’ and ‘Model Year’ are the most influential features.
Interaction features are a type of derived feature created from existing features in a dataset to capture the combined effect of two or more variables on the target variable. These are particularly useful in regression, classification, and other predictive modeling tasks, as they can reveal complex relationships and dependencies between variables that are not apparent when considering the variables individually.
Interaction features represent a set of simple pairwise features that can be calculated as the product of two features. The analogy is the logical AND. For example, if you are predicting the effectiveness of a marketing campaign (target variable), and you have two features, advertising budget and seasonality, an interaction feature could be the product of these two features. This interaction could capture the effect of increasing the advertising budget during the high-season versus the low season. For categorical variables, interaction features often involve creating dummy variables that represent the combination of categories across two or more features. For example, combining gender (“male”, “female”) and product type (“clothes”, “electronics”) could lead to interaction features like “male_clothes” and “female_electronics”.
Adding interaction features can significantly increase the number of features in a dataset, potentially leading to high dimensionality problems such as overfitting, where a model performs well on training data but poorly on unseen data.
While interaction features can improve model performance, they also make the model more complex. This can lead to longer training times and harder-to-interpret models.
Determining which interactions are meaningful and should be included in the model can be challenging. It often requires domain knowledge or techniques like feature importance analysis.
Feature engineering is not just another step in the machine learning pipeline; it’s an art that requires a deep understanding of the domain as well as the data. In this blog, we’ve explored various feature engineering techniques such as feature extraction through aggregation, encoding categorical variables, feature scaling, and feature selection methods like Recursive Feature Elimination (RFE) and Random Forest, among others. Through a real-world dataset on car sales, we demonstrated how each of these techniques could significantly impact the performance of a machine learning model.
Remember, the best features depend on the problem at hand, and often, it’s the quality of the features, not the complexity of the model, that determines success. So, the next time you’re working on a machine learning project, give ample time to feature engineering—it might just be the “magic ingredient” that makes your model stand out.
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…
ChatGPT Canvas is a cutting-edge, user-friendly platform that simplifies content creation and elevates collaboration. Whether…