Feature Scaling & Stratification for Model Performance (Python)

In this post, you will learn about how to improve machine learning models performance using techniques such as feature scaling and stratification. The following topics are covered in this post. The concepts have been explained using Python code samples.

  • What is feature scaling and why one needs to do it?
  • What is stratification?
  • Training Perceptron model without feature scaling and stratification
  • Training Perceptron model with feature scaling
  • Training Perceptron model with feature scaling and stratification

What is Feature Scaling and Why is it needed?

Feature scaling is a technique of standardizing the features present in the data in a fixed range. This is done when data consists of features of varying magnitude, units and ranges.

In Python, the most popular way of feature scaling is to use StandardScaler class of sklearn.preprocessing module.

What is Stratification?

Stratification is a technique used to ensure that the subsampling without replacement results in the data sets so that each class is correctly represented in the resulting subsets — the training and the test set.  Not doing stratification would result in affecting the statistics of the sample. The degree to which subsampling without replacement affects the statistic of a sample is inversely proportional to the size of the sample.

For example, in IRIS dataset found in sklearn.datasets, the class distribution of the sample of 150 is 50 (Virginia) , 50 (Versicolor), 50 (setosa). 

Note that there are three different classes and the data set is small (150). In order to create two split, e.g., training and test dataset, we will need to ensure that the class distribution does not get altered for statistics to not get altered. This is where we will need stratification. Note that if data set is large enough, subsampling without replacement may not affect the sample statistics that much.

In the following sections, we will see how the model performance improves with feature scaling and stratification. 

The following Python modules and classes used for the code given in the following sections:

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

from sklearn import datasets

Training Perceptron Model without Feature Scaling & Stratification

Here is a Python code training model without feature scaling and stratification:

iris = datasets.load_iris()
X = iris.data
Y = iris.target

# Create training and test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

# Execute np.bincount(Y_train) to check class distribution of training split; The output comes something like array([36, 32, 37]). If we would have used stratification, the output would have been array([35, 35, 35]) 

# Train the Perceptron model
prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train, Y_train)

# Measure the accuracy
print("Accuracy Score %.3f" %prcptrn.score(X_test, Y_test))

The accuracy score of model trained without feature scaling and stratification comes out to be 73.3%

Training Perceptron Model with Feature Scaling

In this section, we will the feature scaling technique. Feature scaling is done using different techniques such as standardization or min-max normalization. For standardization, StandardScaler class of sklearn.preprocessing module is used. For min-max normalization, MinMaxScaler class of same sklearn module is used. 

In this example, we will use StandardScaler for feature scaling.

# Create Training and Test Split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

# Do feature scaling for training and test data set
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Train / Fit the Perceptron model
prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train_std, Y_train)

# Measure the model performance
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))

The accuracy score of model trained with feature scaling comes out to be 86.7%. Note that model has a higher performance than the previous model which was trained / fit without feature scaling.

Training Perceptron Model with Feature Scaling & Stratification

In this section, we will train the model using both feature scaling and stratification. Note the stratify = Y representing the fact that stratification is done based on classes found in Y.

# Create training and test split based on stratification
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)

# Perform Feature Scaling
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Train / Fit the Perceptron model
prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train_std, Y_train)

# Measure the model performance
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))

The accuracy score of model trained with feature scaling & stratification comes out to be 95.6%. Note that model has a higher performance than the previous two models which was trained / fit without feature scaling. One can test the stratification by executing np.bincount(Y_train). This would print the output consisting of array([35, 35, 35]). This represents that Y_train consists of equal distribution of all the classes. 

 

 

Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in AI, Data Science, Machine Learning. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *