# Feature Scaling in Machine Learning: Python Examples

In this post you will learn about a simple technique namely feature scaling with Python code examples using which you could improve machine learning models. The models will be trained using Perceptron (single-layer neural network) classifier.

First and foremost, lets quickly understand what is feature scaling and why one needs it?

## What is Feature Scaling and Why does one need it?

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization or standardization. Feature scaling is generally performed during the data pre-processing stage, before training models using machine learning algorithms.  The goal is to transform the data so that each feature is in the same range (e.g. between -1 and 1). This ensures that no single feature dominates the others, and makes training and tuning quicker and more effective. Feature scaling can be accomplished using a variety of linear and non-linear methods, including min-max scaling, z-score standardization, clipping, winsorizing, taking logarithm of inputs before scaling, etc. Which method you choose will depend on your data and your machine learning algorithm.

Consider a dataset with two features, age and salary. Age is usually distributed between 0 and 80 years, while salary is usually distributed between 0 and 1 million dollars. If we apply a machine learning algorithm to this dataset without feature scaling, the algorithm will give more weight to the salary feature since it has a much larger range. However, by rescaling both features to the range [-1, 1], we can give both features equal weight and improve the performance of our machine learning algorithm.

### Why feature scaling in the first place?

Machine learning algorithms using optimizing or learning algorithms such as gradient descent optimizers which are tuned to work well with numbers in the [-1, 1] range, scaling the numeric values to lie in that range. The primary reason is the following:

Optimization algorithms such as Gradient descent require more steps to converge with larger magnitudes as the curvature of the loss function increases. The reason is that the derivatives of features with larger relative magnitudes will tend to be larger as well, and this will result in abnormal and undesirable weight updates. For larger weight updates, it will require more steps to converge and thereby increase the computation load.

Scaling the data such that it lies in the range [–1, 1] would make the error function more spherical. Thus, models trained with transformed / scaled data in the range of [-1, 1] will tend to converge faster and are therefore faster/cheaper to train. In addition, the [–1, 1] range offers the highest floating point precision. The following code can be run to check the computation time it takes to train model with raw data and the training data:

import timeit
from sklearn import datasets, linear_model
#
# Load the Sklearn diabetes data set
#
#
# Create scaled data set
#
raw = diabetes_X[:, None, 2]
max_raw = max(raw)
min_raw = min(raw)
scaled = (2*raw - max_raw - min_raw)/(max_raw - min_raw)
#
# Define method for training a linear regression model with
# raw and scaled data set
#
def train_raw_data():
linear_model.LinearRegression().fit(raw, diabetes_y)

def train_scaled_data():
linear_model.LinearRegression().fit(scaled, diabetes_y)

#
# Use the timeit method to measure the
# execution of training method
#
raw_time = timeit.timeit(train_raw_data, number=1000)
scaled_time = timeit.timeit(train_scaled_data, number=1000)
#
# Print the time taken to
# train the model with raw data and scaled data
#
raw_time, scaled_time


Another reason why one must go for scaling dataset is that some machine learning algorithms and techniques are very sensitive to the relative magnitudes of the different features. For instance, the clustering algorithm such as k-means uses the Euclidean distance as its proximity measure. This will result in algorithm rely heavily on features with relatively larger magnitudes.

Yet another reason why one must go for feature scaling is that the lack of scaling affects the efficacy of L1 or L2 regularization since the magnitude of weights for a feature depends on the magnitude of values of that feature, and so different features will be affected differently by regularization. By scaling all features to lie between [–1, 1], it can be ensured that there is not much of a difference in the relative magnitudes of different features.

### Different types of features scaling (Linear & Non-Linear transformations)

Feature scaling is performed when the dataset contains features that are highly varying in magnitudes, units, and ranges. The following is the details related to different kind of scaling as briefed above:

• Min-max scaling: Min-max scaling, also known as feature scaling, is a method used to standardize data before feeding it into a machine learning algorithm. The goal of min-max scaling is to ensure that all features are on a similar scale ([-1,1] or [0, 1]), which makes training the algorithm more efficient. For example, imagine we are training a machine learning model to predict house prices. If one of the features is the size of the house in square feet, we would want to make sure that this value is scaled appropriately before feeding it into the model. Otherwise, the model may place too much importance on this feature and produce inaccurate predictions. Min-max scaling can be used to achieve this goal by transforming all values so that they fall within a specific range (e.g., [0,1] or [-1,1]). The following is the formula for min-max scaling:

x_scaled = (x1 -x1_min)/(x1_max – x1_min)

The problem with min-max scaling is that the maximum and minimum value (x1_max and x1_min) would need to be estimated from the training dataset, and they often turn out to be outlier values. Thus, the real data often gets shrunk to a very narrow range in the [–1, 1] band.
• Z-score normalization: Z-score normalization, also known as Z-score standardization or mean-variance scaling, is a method of feature scaling that aims to rescale features so that they have a mean of zero and a standard deviation of one. This process can be useful for machine learning models that require features to be on the same scale in order to produce accurate results. For example, Z-score normalization is often used when training neural networks. Z-score normalization can be applied to data sets with any distribution; however, it is most effective when the data is Normally distributed. When Z-score normalization is applied to data that is not Normally distributed, it may compress some of the data points and expand others, which can impact the accuracy of machine learning models. Z-score normalization addresses the problem of outliers without requiring prior knowledge of what the reasonable range is by linearly scaling the input using the mean and standard deviation estimated over the training dataset. The following represents the formula for Z-score normalization. The same is implemented in StandardScaler whose usage is shown later in this post.

x_scaled = (x1 – x1_mean)/x1_stddev

The picture below represents the formula for both standardization and min-max scaling.

• Non-linear transformation: Consider the scenario when the training data is found to be skewed and neither uniformly or normally distributed? In that case, it is recommended to apply non-linear transformation to the training data before going for scaling it. One of the most common trick is to take the logarithm of the training data and then apply one of the scaling techniques as discussed above. Other common non-linear transformations techniques include taking the sigmoid and polynomial expansions (square, square root, cube, cube root, and so on) before applying the scaling techniques.

In this post, we will learn to use the Standardization (also known as z-score normalization) technique for feature scaling. We will use the StandardScaler from sklearn.preprocessing package.

## Train a Perceptron Model without Feature Scaling

Here is the code for training a model without feature scaling. First and foremost, let’s load the dataset and create the dataset comprising of features and labels. In this post, the IRIS dataset has been used. In the below code, X is created as training data whose features are sepal length and petal length.

from sklearn import datasets
X = iris.data[:, [0, 2]]
Y = iris.target


Next step is to create the training and test split. The sklearn.model_selection module provides class train_test_split which couldbe used for creating the training / test split. Note that stratification is not used.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)


Next step is to create an instance of Perceptron classifier and train the model using X_train and Y_train dataset / label. The code below uses Perceptron class of sklearn.linear_model module.

from sklearn.linear_model import Perceptron

prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train, Y_train)


Next step is to measure the model accuracy. This can be measured using the class accuracy_score of sklearn.metrics module or calling score method on the Perceptron instance.

from sklearn.metrics import accuracy_score
Y_predict = prcptrn.predict(X_test)
print("Misclassified examples %d" %(Y_test != Y_predict).sum())
print("Accuracy Score %.3f" %accuracy_score(Y_test, Y_predict))


The accuracy score comes out to be 0.578 with number of misclassified example as 19.

## Train a Perceptron Model with Feature Scaling

One does the feature scaling with the help of the following code. This step is followed just after creating training and test split.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)


The above code represents StandardScaler class of sklearn.preprocessing module. The fit method of StandardScaler is used to estimate sample mean and standard deviation for each feature using training data. The transform method is then used to estimate the standardized value of features using those estimated parameters (mean & standard deviation).

The next step is to train a Perceptron model and measure the accuracy:

prcptrnFS = Perceptron(eta0=0.1, random_state=1)
prcptrnFS.fit(X_train_std, Y_train)

Y_predict_std = prcptrnFS.predict(X_test_std)
print("Misclassified examples %d" %(Y_test != Y_predict_std).sum())

from sklearn.metrics import accuracy_score
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))


The accuracy score comes out to be 0.978 with the number of misclassified examples as 1.

You can note that the accuracy score increased by almost 40%.

Thus, it is recommended to perform do feature scaling before training the model.

## Conclusion

That’s all for now. I hope you found this article helpful. If you have any questions, please don’t hesitate to let me know. I would also be happy to provide more information on linear and non-linear feature transformation feature scaling techniques such as min-max scaling, Z-score normalization, logarithmic transformation, etc. of training data if you are interested.