Handle Class Imbalance using Class Weight – Python

0

In this post, you will learn about how to tackle with or handle class imbalance by adjusting class weight while solving a machine learning classification problem. This will be illustrated using Sklearn Python code example.

What is Class Imbalance?

Class imbalance is a one of the most common problem when solving classification problems related to healthcare domain, banking (fraud) domain etc. For example, if you want to build a model which classifies a transaction to be fraud or otherwise, the dataset will be highly imbalanced as there won’t be many instances where fraud-related transactions is found. The challenge related to building models having high performance is to address highly skewed data class distribution, which is referred to as the imbalanced classification problems. An imbalanced classification problem occurs when the classes in the dataset have a highly unequal number of samples. Here is how the class imbalance would look like:

 

Representation of Class Imbalance
Fig 1. Representation of Class Imbalance

There are different techniques such as some of the following for handling class imbalance when training machine learning models with dataset having imbalanced classes. 

  • Using class weight: This technique assigns a larger penalty to wrong predictions on the minority class.
  • Under-sampling the data related to majority classes
  • Over-sampling data related to minority classes
  • Generation of synthetic training examples – One of the most widely used algorithm for synthetic training examples is Synthetic Minority Over-sampling Technique (SMOTE). This will be dealt in future posts

Python package such as Imbalanced Learn can be used to apply techniques related under-sampling majority classes, upsampling minority classes and SMOTE.  In this post, technique related to using class weight will be used for tackling class imbalance.

How to create a Sample Dataset having Class Imbalance?

In this section, you will learn about how to create an imbalanced dataset (imbalance class distribution) using Sklearn Breast cancer dataset. Let’s take the Sklearn data set representing to breast cancer. Although the class distribution is 212 for malignant class and 357 for benign class, an imbalanced distribution could look like the following:

Benign class – 357 
Malignant class – 30

This is how you could create the above mentioned imbalanced class distribution using Python Sklearn and Numpy:

from sklearn import datasets
import numpy as np

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

X_imb = np.vstack((X[y == 1], X[y == 0][:30]))
y_imb = np.hstack((y[y == 1], y[y == 0][:30]))

The above code creates a new Numpy array by appending 30 records vertically (numpy vstack method) whose label is 0 (malignant) to the 357 records whose label is 1 (benign) taking the total record count to 387. Similarly, it appends 30 malignant labels to 357 benign labels horizontally (numpy hstack method).

Handling Class Imbalance using Class Weight – Python Example

In this section, you will learn about technique that can be used for handling class imbalance while training the models using Python Sklearn code. Every classification algorithm has a parameter namely class_weight. The different type of inputs to this parameter allows you to handle class imbalance using different manner. By default, when no value is passed, the weight assigned to each class is equal e.g., 1. In case of class imbalance, here are different values representing different types of inputs:

  • balanced: When passing balanced as class_weight results in the values of y (label) to automatically adjust weights inversely proportional to class frequencies in the input data. The same can be calculated as n_samples / (n_classes * np.bincount(y))
  • {class_label: weight}: Let’s say, there are two classes labelled as 0 and 1. Passing input to class_weight as class_weight={0:2, 1:1} means class 0 has weight 2 and class 1 has weight 1.

In the code sample given below, the class_weight of format {class_label: weight} is illustrated. Watch out the code,

pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1, class_weight={0:3, 1:1}))

#
# Create training and test split out of imbalanced data set created above
#
X_train, X_test, y_train, y_test = train_test_split(X_imb, y_imb, test_size=0.3, random_state=1, stratify=y_imb)
#
# Create pipeline with LogisticRegression and class_weight as {0:3, 1:1} 
#
pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1, class_weight={0:3, 1:1}))
#
# Create a randomized search for finding most appropriate model
#
param_distributions = [{'logisticregression__C': sc.stats.expon(scale=100)}]
rs = RandomizedSearchCV(estimator=pipeline, param_distributions = param_distributions, cv = 10, scoring = 'accuracy', refit = True, n_jobs = 1,random_state=1)
#
# Fit the model
#
rs.fit(X_train, y_train)
#
# Find the best score, params and accuracy on the test dataset
#
print('Best Score:', rs.best_score_, '\nBest Params:', rs.best_params_)
print('Test Accuracy: %0.3f' % rs.score(X_test, y_test))
Model Accuracy on Test Data
Fig 1. Model Accuracy on Test Data

Conclusions

Here is what you learned about handling class imbalance in the imbalanced dataset using class_weight

  • An imbalanced classification problem occurs when the classes in the dataset have a highly unequal number of samples.
  • Class imbalance means the count of data samples related to one of the class is very low in comparison to other class.
  • One of the common technique is to assign class_weight=”balanced” when creating instance of the algorithm.
  • Other technique is to assign different weights to different class labels using syntax such as class_weight={0:2, 1:1}. Class 0 is assigned a weight of 2 and class 1 is assigned a weight of 1
  • Other popular techniques for handling class imbalance in machine learning classification problems include undersampling of majority class, oversampling of minority class and generating synthetic training examples.
  • Python package such as Imbalanced Learn can be used to implement techniques such as SMOTE, undersampling of majority class and oversampling of minority class.
Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.