In this post, you will learn about how to improve machine learning models performance using techniques such as feature scaling and stratification. The following topics are covered in this post. The concepts have been explained using Python code samples.
Feature scaling is a technique of standardizing the features present in the data in a fixed range. This is done when data consists of features of varying magnitude, units and ranges.
In Python, the most popular way of feature scaling is to use StandardScaler class of sklearn.preprocessing module.
Stratification is a technique used to ensure that the subsampling without replacement results in the data sets so that each class is correctly represented in the resulting subsets — the training and the test set. Not doing stratification would result in affecting the statistics of the sample. The degree to which subsampling without replacement affects the statistic of a sample is inversely proportional to the size of the sample.
For example, in IRIS dataset found in sklearn.datasets, the class distribution of the sample of 150 is 50 (Virginia) , 50 (Versicolor), 50 (setosa).
Note that there are three different classes and the data set is small (150). In order to create two split, e.g., training and test dataset, we will need to ensure that the class distribution does not get altered for statistics to not get altered. This is where we will need stratification. Note that if data set is large enough, subsampling without replacement may not affect the sample statistics that much.
In the following sections, we will see how the model performance improves with feature scaling and stratification.
The following Python modules and classes used for the code given in the following sections:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn import datasets
Here is a Python code training model without feature scaling and stratification:
iris = datasets.load_iris()
X = iris.data
Y = iris.target
# Create training and test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# Execute np.bincount(Y_train) to check class distribution of training split; The output comes something like array([36, 32, 37]). If we would have used stratification, the output would have been array([35, 35, 35])
# Train the Perceptron model
prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train, Y_train)
# Measure the accuracy
print("Accuracy Score %.3f" %prcptrn.score(X_test, Y_test))
The accuracy score of model trained without feature scaling and stratification comes out to be 73.3%
In this section, we will the feature scaling technique. Feature scaling is done using different techniques such as standardization or min-max normalization. For standardization, StandardScaler class of sklearn.preprocessing module is used. For min-max normalization, MinMaxScaler class of same sklearn module is used.
In this example, we will use StandardScaler for feature scaling.
# Create Training and Test Split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# Do feature scaling for training and test data set
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
# Train / Fit the Perceptron model
prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train_std, Y_train)
# Measure the model performance
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))
The accuracy score of model trained with feature scaling comes out to be 86.7%. Note that model has a higher performance than the previous model which was trained / fit without feature scaling.
In this section, we will train the model using both feature scaling and stratification. Note the stratify = Y representing the fact that stratification is done based on classes found in Y.
# Create training and test split based on stratification
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)
# Perform Feature Scaling
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
# Train / Fit the Perceptron model
prcptrn = Perceptron(eta0=0.1, random_state=1)
prcptrn.fit(X_train_std, Y_train)
# Measure the model performance
print("Accuracy Score %0.3f" % accuracy_score(Y_test, Y_predict_std))
The accuracy score of model trained with feature scaling & stratification comes out to be 95.6%. Note that model has a higher performance than the previous two models which was trained / fit without feature scaling. One can test the stratification by executing np.bincount(Y_train). This would print the output consisting of array([35, 35, 35]). This represents that Y_train consists of equal distribution of all the classes.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…