How to deal with Class Imbalance in Python

In this post, you will learn about how to deal with class imbalance by adjusting class weight while solving a machine learning classification problem. This will be illustrated using Sklearn Python code example.

What is Class Imbalance?

Class imbalance refers to a problem in machine learning where the classes in the data are not equally represented. For example, if there are 100 data points and 90 of them belong to Class A and 10 belong to Class B, then the classes are imbalanced. Class imbalance can lead to problems with training machine learning models because the models may be biased towards the more common class. If there are more examples of one class than another, the model will be more likely to learn and predict the majority class. This can be a problem because it can lead to inaccurate results when the model is applied to data that is more evenly balanced. It has been found to be one of the most common problems when solving classification problems related to the healthcare domain, banking (fraud) domain, etc. For example, if you want to build a model which classifies a transaction to be fraudulent or otherwise, the dataset will be highly imbalanced as there won’t be many instances where fraud-related transactions are found. The challenge related to building models having high performance is to address highly skewed data class distribution, which is referred to as the imbalanced classification problems. An imbalanced classification problem occurs when the classes in the dataset have a highly unequal number of samples. Class imbalance is a common problem in machine learning and can be difficult to overcome. However, it is important to be aware of it so that steps can be taken to mitigate its effects. Here is what the class imbalance would look like:

 

Representation of Class Imbalance
Fig 1. Representation of Class Imbalance

There are different techniques such as the following for handling class imbalance when training machine learning models with datasets having imbalanced classes. 

  • Using class weight: Using class weight is a common method used to address the class imbalance in machine learning models. Class imbalance occurs when there is a discrepancy in the number of observations between classes, often resulting in one class being over-represented relative to the other. Class weighting adjusts the cost function of the model so that misclassifying an observation from the minority class is more heavily penalized than misclassifying an observation from the majority class. This approach can help to improve the accuracy of the model by rebalancing the class distribution. However, it is important to note that class weighting does not create new data points, and it cannot compensate for a lack of data. As such, it should be used in conjunction with other methods, such as oversampling.
  • Under-sampling data related to majority class: Under-sampling is a common technique used to address the issue of class imbalance in machine learning models. Class imbalance occurs when the training data is not evenly distributed between classes, which can lead to biased models. Under-sampling involves randomly removing samples from the majority class with or without replacement until the class distribution is more balanced. This can be done either before or after splitting the data into train and test sets. This is also called random undersampling. Although under-sampling can improve model performance, it may also reduce the overall accuracy of the model if the minority class is very small. Therefore, it is important to carefully consider whether under-sampling is the right approach for your data set.
  • Over-sampling data related to minority classes: Oversampling is a technique used to solve the class imbalance problem in machine learning models. It involves randomly selecting samples from the minority class and replicating them until the classes are balanced. This technique can improve the performance of machine learning models because it ensures that the model is trained on data that is representative of the test data. Moreover, oversampling can also help to reduce the variance of the model, which can further improve performance. This technique is especially useful when the dataset is small and there is a danger of overfitting. By oversampling, we can ensure that the model is trained on a balanced dataset and that all classes are represented equally.
  • Generation of synthetic training examples: One of the most widely used algorithms for synthetic training examples is Synthetic Minority Over-sampling Technique (SMOTE). SMOTE works by creating new synthetic data samples that are similar to existing data samples in the minority class. These synthetic data samples can then be used to train the machine learning model, providing a more balanced training set. In addition, SMOTE can also help to improve the generalizability of the model by increasing the number of training examples. Overall, SMOTE is a powerful tool that can be used to address the issue of class imbalance and improve the performance of machine learning models.

Python packages such as Imbalanced Learn can be used to apply techniques related to under-sampling majority classes, upsampling minority classes, and SMOTE.  In this post, techniques related to using class weight will be used for tackling class imbalance.

How to create a Sample Dataset having Class Imbalance?

In this section, you will learn about how to create an imbalanced dataset (imbalance class distribution) using the Sklearn Breast cancer dataset. Let’s take the Sklearn data set representing to breast cancer. Although the class distribution is 212 for malignant class and 357 for benign class, an imbalanced distribution could look like the following:

Benign class – 357 
Malignant class – 30

This is how you could create the above mentioned imbalanced class distribution using Python Sklearn and Numpy:

from sklearn import datasets
import numpy as np

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

X_imb = np.vstack((X[y == 1], X[y == 0][:30]))
y_imb = np.hstack((y[y == 1], y[y == 0][:30]))

The above code creates a new Numpy array by appending 30 records vertically (numpy vstack method) whose label is 0 (malignant) to the 357 records whose label is 1 (benign) taking the total record count to 387. Similarly, it appends 30 malignant labels to 357 benign labels horizontally (numpy hstack method).

Handling Class Imbalance using Class Weight – Python Example

In this section, you will learn about techniques that can be used for handling class imbalance while training the models using Python Sklearn code. Every classification algorithm has a parameter namely class_weight. The different type of inputs to this parameter allows you to handle class imbalance using a different manner. By default, when no value is passed, the weight assigned to each class is equal e.g., 1. In case of class imbalance, here are different values representing different types of inputs:

  • balanced: When passing balanced as class_weight results in the values of y (label) automatically adjusting weights inversely proportional to class frequencies in the input data. The same can be calculated as n_samples / (n_classes * np.bincount(y))
  • {class_label: weight}: Let’s say, there are two classes labeled as 0 and 1. Passing input to class_weight as class_weight={0:2, 1:1} means class 0 has weight 2 and class 1 has weight 1.

In the code sample given below, the class_weight of format {class_label: weight} is illustrated. Watch out the code,

pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1, class_weight={0:3, 1:1}))

#
# Create training and test split out of imbalanced data set created above
#
X_train, X_test, y_train, y_test = train_test_split(X_imb, y_imb, test_size=0.3, random_state=1, stratify=y_imb)
#
# Create pipeline with LogisticRegression and class_weight as {0:3, 1:1} 
#
pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1, class_weight={0:3, 1:1}))
#
# Create a randomized search for finding most appropriate model
#
param_distributions = [{'logisticregression__C': sc.stats.expon(scale=100)}]
rs = RandomizedSearchCV(estimator=pipeline, param_distributions = param_distributions, cv = 10, scoring = 'accuracy', refit = True, n_jobs = 1,random_state=1)
#
# Fit the model
#
rs.fit(X_train, y_train)
#
# Find the best score, params and accuracy on the test dataset
#
print('Best Score:', rs.best_score_, '\nBest Params:', rs.best_params_)
print('Test Accuracy: %0.3f' % rs.score(X_test, y_test))
Model Accuracy on Test Data
Fig 1. Model Accuracy on Test Data

Conclusions

Here is what you learned about handling class imbalance in the imbalanced dataset using class_weight

  • An imbalanced classification problem occurs when the classes in the dataset have a highly unequal number of samples.
  • Class imbalance means the count of data samples related to one of the classes is very low in comparison to other classes.
  • One of the common techniques is to assign class_weight=”balanced” when creating an instance of the algorithm.
  • Another technique is to assign different weights to different class labels using syntax such as class_weight={0:2, 1:1}. Class 0 is assigned a weight of 2 and class 1 is assigned a weight of 1
  • Other popular techniques for handling class imbalance in machine learning classification problems include undersampling of the majority class, oversampling of the minority class, and generating synthetic training examples (SMOTE).
  • Python packages such as Imbalanced Learn can be used to implement techniques such as SMOTE, undersampling of majority class, and oversampling of the minority class.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in Data Science, Machine Learning, Python. Tagged with , , , .

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.