MinMaxScaler vs StandardScaler – Python Examples

0

In this post, you will learn about concepts and differences between MinMaxScaler & StandardScaler with the help of Python code examples. Note that these are classes provided by sklearn.preprocessing module and used for feature scaling purpose. As a data scientist, you will need to learn these concepts in order to train machine learning models using algorithms which requires features to be on the same scale. For algorithms such as random forests and decision trees which are scale invariant, you do not need to use these feature scaling techniques.

The following topics are covered in this post:

  • Why is feature scaling needed?
  • Normalization vs Standardization
  • MinMaxScaler for normalization
  • StandardScaler for standardization

Here is the sample Pandas data frame which will be used later in this post for illustration purpose:

import pandas as pd
import numpy as np

arr = np.array([['M', 81.4, 82.2, 44, 6.1, 120000, 'no'],
               ['M', 75.2, 86.2, 40, 5.9, 80000, 'no'],
               ['F', 80.0, 83.2, 34, 5.4, 210000, 'yes'],
               ['F', 85.4, 72.2, 46, 5.6, 50000, 'yes'],
               ['M', 68.4, 87.2, 28, 5.11, 70000, 'no']])
#
# Create Pandas DataFrame
#
df = pd.DataFrame(arr)
df.columns = ['gender', 'hsc_p', 'ssc_p', 'age', 'height', 'salary', 'suffer_from_disease']
#
# Convert the string data type to int and float appropriately
#
df[['age', 'salary']] = df[['age', 'salary']].astype(int)
df[['ssc_p', 'hsc_p', 'height']] = df[['ssc_p', 'hsc_p', 'height']].astype(float)

Here is how the data frame looks like:

Sample Pandas Dataframe
Fig 1. Sample Pandas Dataframe

Why is Feature Scaling needed?

Feature scaling is about transforming the values of different numerical features to fall within the similar range like each other. For example, in the data set used in this post, pay attention to feature values of salary, age and height. The values of salary is in the range of 50000 to 210000 (in above example) while the values of age is in range 1 to 100 and the values of height is in the range 4 ft to 7 ft. When such data set is applied on algorithms such as gradient descent optimization or K-nearest neighbours, the algorithm tries and find optimize weights or distances to handle feature values having larger values. This results in the models which are sub-optimal in nature. This is where feature scaling comes into picture. The idea is to transform the value of features in the similar range like others for machine learning algorithms to behave better resulting in optimal models.

Feature scaling is not important for algorithms such as random forest or decision tree which are scaling invariant. The scale of the features value do not impact the model performance of models trained using these algorithms (random forest / decision tree).

Normalization vs Standardization

The two common approaches to bringing different features onto the same scale are normalization and standardization.

What is Normalization?

Normalization refers to the rescaling of the features to a range of [0, 1], which is a special case of min-max scaling. To normalize the data, the min-max scaling can be applied to one or more feature column. Here is the formula for normalizing data based on min-max scaling. Normalization is useful when the data is needed in the bounded intervals.

Nromalization based on min-max scaling
Fig 2. Normalizing data based on min-max scaling concepts

This is how the Python method would look like for normalizing one or more columns:

def normalize(values):
    return (values - values.min())/(values.max() - values.min()) 

In order to apply the normalization technique to one or more feature columns, one could use the following Python code (with reference to the dataset used in this post). Note the usage of apply method which applies the normalize method shown above on multiple feature columns all at once.

cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
#
# Normalize the feature columns
#
df[cols]= df[cols].apply(normalize)

What is Standardization?

Standardization technique is used to center the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution. Unlike Normalization, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values. Here is the formula for standardization.

Standardization formula
Fig 3. Standardization formula

This is how the Python method would look like for standardizing one or more columns:

def standardize(values):
    return (values - values.mean())/values.std()

In order to apply the standardization techniques to one or more feature columns, one could use the following Python code (with reference to the dataset used in this post). Note the usage of apply method which applies the standardize method on multiple feature columns all at once.

cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
#
# Standardize the feature columns; Dataframe needs to be recreated for the following command to work properly.
#
df[cols]= df[cols].apply(standardize)

MinMaxScaler for Normalization

MinMaxScaler is a class from sklearn.preprocessing which is used for normalization. Here is the sample code:

from sklearn.preprocessing import MinMaxScaler

mmscaler = MinMaxScaler()
cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
df[cols]= mmscaler.fit_transform(df[cols])

In case of normalizing the training and test data set, the MinMaxScaler estimator will fit on the training data set and the same estimator will be used to transform both training and the test data set. The following code demonstrates the same assuming X consists of training data set and y consists for corresponding labels. IRIS data set is used for illustration purpose.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

mmscaler = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

StandardScaler for Standardization

StandardScaler is a class from sklearn.preprocessing which is used for standardization. Here is the sample code:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
cols = ['hsc_p', 'ssc_p', 'age', 'height', 'salary']
df[cols]= sc.fit_transform(df[cols])

In case of standardizing the training and test data set, the StandardScaler estimator will fit on the training data set and the same estimator will be used to transform both training and the test data set. The following code demonstrates the same. IRIS data set is used for illustration purpose.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

sc = StandardScaler()
X_train_norm = sv.fit_transform(X_train)
X_test_norm = sc.transform(X_test)

Conclusion

Here are some conclusions you can take away as the learning:

  • Feature scaling is about transforming the value of features in the similar range like others for machine learning algorithms to behave better resulting in optimal models.
  • Feature scaling is not required for algorithms such as random forest or decision tree
  • Standardization and normalization are two most common techniques for feature scaling.
  • Normalization is about transforming the feature values to fall within the bounded intervals (min and max)
  • Standardization is about transforming the feature values to fall around mean as 0 with standard deviation as 1
  • Standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling
  • MinMaxScaler class of sklearn.preprocessing is used for normalization of features.
  • StandardScaler class of sklearn.preprocessing is used for standardization of features.
Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.