Python – Improve Model Performance using Feature Scaling

In this post you will learn about a simple technique namely feature scaling using which you could improve machine learning models. The models will be trained using Perceptron (single-layer neural network) classifier.

First and foremost, lets quickly understand what is feature scaling and why one needs it?

What is Feature Scaling and Why does one need it?

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. This is performed when the dataset contains features that are highly varying in magnitudes, units, and ranges. The following are a different kind of scaling:

  • Min-max scaling: The input value is scaled to the range of [-1, 1]. The problem is that one needs to estimate the minimum and maximum values from the given training data set.  The presence of outliers needs to be dealt with and that brings complexity. The following is the formula for min-max scaling:

    x_scaled = (x1 -x1_min)/(x1_max – x1_min)

  • Z-score normalization: Z-score normalization addresses the problem of outliers without requiring prior knowledge of what the reasonable range is by linearly scaling the input using the mean and standard deviation estimated over the training dataset. The following represents the formula for Z-score normalization. The same is implemented in StandardScaler whose usage is shown later in this post. 

    x_scaled = (x1 – x1_mean)/x1_stddev

In this post, we will learn to use the Standardization (also known as z-score normalization) technique for feature scaling. We will use the StandardScaler from sklearn.preprocessing package.

Train a Perceptron Model without Feature Scaling

Here is the code for training a model without feature scaling. First and foremost, let’s load the dataset and create the dataset comprising of features and labels. In this post, the IRIS dataset has been used. In the below code, X is created as training data whose features are sepal length and petal length.

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
Y = iris.target

Next step is to create the training and test split. The sklearn.model_selection module provides class train_test_split which couldbe used for creating the training / test split. Note that stratification is not used. 

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

Next step is to create an instance of Perceptron classifier and train the model using X_train and Y_train dataset / label. The code below uses Perceptron class of sklearn.linear_model module.

from sklearn.linear_model import Perceptron

prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train, Y_train)

Next step is to measure the model accuracy. This can be measured using the class accuracy_score of sklearn.metrics module or calling score method on the Perceptron instance. 

from sklearn.metrics import accuracy_score
Y_predict = prcptrn.predict(X_test)
print("Misclassified examples %d" %(Y_test != Y_predict).sum())
print("Accuracy Score %.3f" %accuracy_score(Y_test, Y_predict))

The accuracy score comes out to be 0.578 with number of misclassified example as 19.

Train a Perceptron Model with Feature Scaling

One does the feature scaling with the help of the following code. This step is followed just after creating training and test split.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

The above code represents StandardScaler class of sklearn.preprocessing module. The fit method of StandardScaler is used to estimate sample mean and standard deviation for each feature using training data. The transform method is then used to estimate the standardized value of features using those estimated parameters (mean & standard deviation).

The next step is to train a Perceptron model and measure the accuracy:

prcptrnFS = Perceptron(eta0=0.1, random_state=1)
prcptrnFS.fit(X_train_std, Y_train)

Y_predict_std = prcptrnFS.predict(X_test_std)
print("Misclassified examples %d" %(Y_test != Y_predict_std).sum())

from sklearn.metrics import accuracy_score
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))

The accuracy score comes out to be 0.978 with the number of misclassified examples as 1.

You can note that the accuracy score increased by almost 40%.

Thus, it is recommended to perform do feature scaling before training the model.

Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in AI, Data Science, Machine Learning. Tagged with , , .

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.