MinMaxScaler vs StandardScaler – Python Examples

MinMaxScaler vs StandardScaler

In machine learning, MinMaxscaler and StandardScaler are two scaling algorithms for continuous variables. The MinMaxscaler is a type of scaler that scales the minimum and maximum values to be 0 and 1 respectively. While the StandardScaler scales all values between min and max so that they fall within a range from min to max. In this blog post, you will learn about concepts and differences between MinMaxScaler & StandardScaler with the help of Python code examples. Note that these are classes provided by sklearn.preprocessing module and used for feature scaling purposes. As a data scientist, you will need to learn these concepts in order to train machine learning models using algorithms that require features to be on the same scale. For algorithms such as random forests and decision trees which are scale-invariant, you do not need to use these feature scaling techniques.

Here is the sample Pandas data frame which will be used later in this post for illustration purposes:

import pandas as pd
import numpy as np

arr = np.array([['M', 81.4, 82.2, 44, 6.1, 120000, 'no'],
               ['M', 75.2, 86.2, 40, 5.9, 80000, 'no'],
               ['F', 80.0, 83.2, 34, 5.4, 210000, 'yes'],
               ['F', 85.4, 72.2, 46, 5.6, 50000, 'yes'],
               ['M', 68.4, 87.2, 28, 5.11, 70000, 'no']])
#
# Create Pandas DataFrame
#
df = pd.DataFrame(arr)
df.columns = ['gender', 'hsc_p', 'ssc_p', 'age', 'height', 'salary', 'suffer_from_disease']
#
# Convert the string data type to int and float appropriately
#
df[['age', 'salary']] = df[['age', 'salary']].astype(int)
df[['ssc_p', 'hsc_p', 'height']] = df[['ssc_p', 'hsc_p', 'height']].astype(float)

Here is how the data frame looks like:

Sample Pandas Dataframe
Fig 1. Sample Pandas Dataframe

Why is Feature Scaling needed?

Feature scaling is about transforming the values of different numerical features to fall within a similar range like each other. The feature scaling is used to prevent the supervised learning models from getting biased toward a specific range of values. For example, if your model is based on linear regression and you do not scale features, then some features may have a higher impact than others which will affect the performance of predictions by giving undue advantage for some variables over others. This puts certain classes at disadvantage while training model. This is why it becomes important to use scaling algorithms so that you can standardize your feature values.

This process of feature scaling is done so that all features can share the same scale and hence avoid problems such as some of the following:

  • Loss in accuracy
  • Increase in computational cost as data values vary widely over different orders of magnitude.

For example, in the data set used in this post, pay attention to feature values of salary, age, and height. The values of salary are in the range of 50000 to 210000 (in the above example) while the values of age are in the range 1 to 100 and the values of height are in the range 4 ft to 7 ft. When such data set is applied on algorithms such as gradient descent optimization or K-nearest neighbors, the algorithm tries and find optimized weights or distances to handle feature values having larger values. This results in models which are sub-optimal in nature. This is where feature scaling comes into the picture. The idea is to transform the value of features in a similar range like others for machine learning algorithms to behave better resulting in optimal models.

Feature scaling is not important for algorithms such as random forest or decision trees which are scaling invariant. The scale of the value of the feature does not impact the model performance of models trained using these algorithms (random forest/decision tree).

Normalization vs Standardization

The two common approaches to bringing different features onto the same scale are normalization and standardization.

What is Normalization?

Normalization refers to the rescaling of the features to a range of [0, 1], which is a special case of min-max scaling. To normalize the data, the min-max scaling can be applied to one or more feature columns. Here is the formula for normalizing data based on min-max scaling. Normalization is useful when the data is needed in the bounded intervals.

Nromalization based on min-max scaling
Fig 2. Normalizing data based on min-max scaling concepts

This is how the Python method would look like for normalizing one or more columns:

def normalize(values):
    return (values - values.min())/(values.max() - values.min()) 

In order to apply the normalization technique to one or more feature columns, one could use the following Python code (with reference to the dataset used in this post). Note the usage of apply method which applies the normalize method shown above on multiple feature columns all at once.

cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
#
# Normalize the feature columns
#
df[cols] = df[cols].apply(normalize)

What is Standardization?

The standardization technique is used to center the feature columns at mean 0 with a standard deviation of 1 so that the feature columns have the same parameters as a standard normal distribution. Unlike Normalization, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values. Here is the formula for standardization.

Standardization formula
Fig 3. Standardization formula

This is how the Python method would look like for standardizing one or more columns:

def standardize(values):
    return (values - values.mean())/values.std()

In order to apply the standardization techniques to one or more feature columns, one could use the following Python code (with reference to the dataset used in this post). Note the usage of apply method which applies the standardize method on multiple feature columns all at once.

cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
#
# Standardize the feature columns; Dataframe needs to be recreated for the following command to work properly.
#
df[cols] = df[cols].apply(standardize)

MinMaxScaler for Normalization

MinMaxScaler is a class from sklearn.preprocessing which is used for normalization. Here is the sample code:

from sklearn.preprocessing import MinMaxScaler

mmscaler = MinMaxScaler()
cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
df[cols] = mmscaler.fit_transform(df[cols])

In case of normalizing the training and test data set, the MinMaxScaler estimator will fit on the training data set and the same estimator will be used to transform both training and the test data set. The following code demonstrates the same assuming X consists of the training data set and y consists of corresponding labels. IRIS data set is used for illustration purposes.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

mmscaler = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

StandardScaler for Standardization

StandardScaler is a class from sklearn.preprocessing which is used for standardization. Here is the sample code:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
df[cols] = sc.fit_transform(df[cols])

In case of standardizing the training and test data set, the StandardScaler estimator will fit on the training data set and the same estimator will be used to transform both training and the test data set. The following code demonstrates the same. IRIS data set is used for illustration purposes.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

sc = StandardScaler()
X_train_norm = sv.fit_transform(X_train)
X_test_norm = sc.transform(X_test)

Conclusion

Here are some conclusions you can take away as the learning:

  • Feature scaling is about transforming the value of features in the similar range like others for machine learning algorithms to behave better resulting in optimal models.
  • Feature scaling is not required for algorithms such as random forest or decision tree
  • Standardization and normalization are two most common techniques for feature scaling.
  • Normalization is about transforming the feature values to fall within the bounded intervals (min and max)
  • Standardization is about transforming the feature values to fall around mean as 0 with standard deviation as 1
  • Standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling
  • MinMaxScaler class of sklearn.preprocessing is used for normalization of features.
  • StandardScaler class of sklearn.preprocessing is used for standardization of features.
Ajitesh Kumar
Follow me
Latest posts by Ajitesh Kumar (see all)

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in Data Science, Machine Learning, Python. Tagged with , , .

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.