While training machine learning models, we come across the need for scaling features in order to have different features contribute to the predictions in an appropriate manner. Without scaling, features with larger numerical ranges can dominate those with smaller ranges, leading to biased or inefficient learning. In this post you will learn about this feature engineering technique namely feature scaling with Python code examples using which you could significantly improve performance of machine learning models. To demonstrate the technique, the models will be trained using Perceptron (single-layer neural network) classifier.
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization or standardization. Feature scaling is generally performed during the data pre-processing stage, before training models using machine learning algorithms. The goal is to transform the data so that each feature is in the same range (e.g. between -1 to 1 or 0 to 1). This ensures that no single feature dominates the others, and makes training and tuning quicker and more effective.
Feature scaling can be accomplished using a variety of linear and non-linear methods, including min-max scaling, z-score standardization, clipping, winsorizing, taking logarithm of inputs before scaling, etc. Which method you choose will depend on your data and your machine learning algorithm.
To understand the feature scaling concept with an example, consider a dataset with two features, age and salary. Age is usually distributed between 0 and 80 years, while salary is usually distributed between 0 and 1 million dollars. If we apply linear regression algorithm to this dataset without feature scaling, the algorithm will give more weight to the salary feature since it has a much larger range. However, by rescaling both features to the range [-1, 1], we can give both features equal weight and improve the performance of our machine learning algorithm.
Machine learning algorithms using optimizing or learning algorithms such as gradient descent optimizers which are tuned to work well with numbers in the [-1, 1] range, require scaling the numeric values to lie in that range. The primary reason is the following:
Optimization algorithms such as Gradient descent require more steps to converge with larger magnitudes as the curvature of the loss function increases. The reason is that the derivatives of features with larger relative magnitudes will tend to be larger as well, and this will result in abnormal and undesirable weight updates. For larger weight updates, it will require more steps to converge and thereby increase the computation load.
Scaling the data such that it lies in the range [–1, 1] would make the error function more spherical. Thus, models trained with transformed / scaled data in the range of [-1, 1] will tend to converge faster and are therefore faster/cheaper to train. In addition, the [–1, 1] range offers the highest floating point precision. The following code can be run to check the computation time it takes to train model with raw data and the training data:
import timeit
from sklearn import datasets, linear_model
#
# Load the Sklearn diabetes data set
#
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
#
# Create scaled data set
#
raw = diabetes_X[:, None, 2]
max_raw = max(raw)
min_raw = min(raw)
scaled = (2*raw - max_raw - min_raw)/(max_raw - min_raw)
#
# Define method for training a linear regression model with
# raw and scaled data set
#
def train_raw_data():
linear_model.LinearRegression().fit(raw, diabetes_y)
def train_scaled_data():
linear_model.LinearRegression().fit(scaled, diabetes_y)
#
# Use the timeit method to measure the
# execution of training method
#
raw_time = timeit.timeit(train_raw_data, number=1000)
scaled_time = timeit.timeit(train_scaled_data, number=1000)
#
# Print the time taken to
# train the model with raw data and scaled data
#
raw_time, scaled_time
Another reason why one must go for scaling dataset is that some machine learning algorithms and techniques are very sensitive to the relative magnitudes of the different features. For instance, the clustering algorithm such as k-means uses the Euclidean distance as its proximity measure. This will result in algorithm rely heavily on features with relatively larger magnitudes.
Yet another reason why one must go for feature scaling is that the lack of scaling affects the efficacy of L1 or L2 regularization since the magnitude of weights for a feature depends on the magnitude of values of that feature, and so different features will be affected differently by regularization. By scaling all features to lie between [–1, 1], it can be ensured that there is not much of a difference in the relative magnitudes of different features.
There are different techniques which can be followed to scale the feature before you use the data for training your models. These techniques include min-max scaling and Z-score normalization (also known as standard scaling).
Feature scaling is performed when the dataset contains features that are highly varying in magnitudes, units, and ranges. The following is the details related to different kind of scaling as briefed above:
x_scaled = (x1 -x1_min)/(x1_max – x1_min)
The problem with min-max scaling is that the maximum and minimum value (x1_max and x1_min) would need to be estimated from the training dataset, and they often turn out to be outlier values. Thus, the real data often gets shrunk to a very narrow range in the [0, 1] band.x_scaled = (x1 – x1_mean)/x1_stddev
The picture below represents the formula for both standardization and min-max scaling.
import numpy as np from sklearn.preprocessing import StandardScaler # Example data data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create the scaler and scale the data scaler = StandardScaler() scaled_data = scaler.fit_transform(data) # Display the scaled data print(scaled_data)
import numpy as np
from scipy.stats import mstats
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25, 30, 100, 101, 102, 103, 108, 109, 112, 115, 125])
# Winsorizing at the 10% and 90% percentiles
winsorized_data = mstats.winsorize(data, limits=[0.1, 0.1])
print(winsorized_data)
The following gets printed. Note the winsorized data in the beginning and end of the array.
[ 3 3 3 4 5 6 7 8 9 10 15 25 30 100 101 102 103 108 109 112 112 112]
Non-linear transformation: Consider the scenario when the training data is found to be skewed and neither uniformly or normally distributed? In that case, it is recommended to apply non-linear transformation to the training data before going for scaling it. One of the most common trick is to take the logarithm of the training data and then apply one of the scaling techniques as discussed above. Other common non-linear transformations techniques include taking the sigmoid and polynomial expansions (square, square root, cube, cube root, and so on) before applying the scaling techniques.
In this section, we will look at the end-to-end flow of training a Perceptron model while doing feature scaling. We can see a significant improvement in the performance of models when features are scaled using appropriate method discussed above.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
The above code represents StandardScaler class of sklearn.preprocessing module. The fit method of StandardScaler is used to estimate sample mean and standard deviation for each feature using training data. The transform method is then used to estimate the standardized value of features using those estimated parameters (mean & standard deviation).
The next step is to train a Perceptron model and measure the accuracy:
prcptrnFS = Perceptron(eta0=0.1, random_state=1)
prcptrnFS.fit(X_train_std, Y_train)
Y_predict_std = prcptrnFS.predict(X_test_std)
print("Misclassified examples %d" %(Y_test != Y_predict_std).sum())
from sklearn.metrics import accuracy_score
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))
The accuracy score comes out to be 0.978 with the number of misclassified examples as 1. You can note that the accuracy score increased by almost 40%. Thus, it is recommended to perform do feature scaling before training the model.
That’s all for now. I hope you found this article helpful. If you have any questions, please don’t hesitate to let me know. I would also be happy to provide more information on linear and non-linear feature transformation feature scaling techniques such as min-max scaling, Z-score normalization, logarithmic transformation, etc. of training data if you are interested.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…