Handling Class Imbalance using Sklearn Resample

In this post, you will learn about how to tackle class imbalance issue when training machine learning classification models with imbalanced dataset. This is illustrated using Python SKlearn example. In the same context, you may check out my earlier post on handling class imbalance using class_weight. As a data scientist, it is of utmost importance to learn some of these techniques as you will often come across the class imbalance problem while working on different classification problems.

Here is how the class imbalance in the dataset can be visualized:

Class imbalance in the data set
Fig 1. Class imbalance in the data set

Before going ahead and looking at the Python code example related to how to use Sklearn.utils resample method, lets create an imbalanced data set having class imbalance. We will create imbalanced dataset with Sklearn breast cancer dataset. Here is the code sample:

from sklearn import datasets
import numpy as np
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
X_imbalanced = np.vstack((X[y == 1], X[y == 0][:30]))
y_imbalanced = np.hstack((y[y == 1], y[y == 0][:30]))

The code results in creating an imbalanced dataset with 212 records labeled as malignant class reduced to 30. Thus, the total records count becomes benign tumour (357) + malignant tumour (30).

Next step is to use resample method to oversample the minority class (malignant tumour records in this example) and undersample the majority class (benign tumour records).

Resample method for Over Sampling Minority Class

The idea is to oversample the data related to minority class using replacement. One of the parameter is replace and other one is n_samples which relates to number of samples to which minority class will be oversampled. In addition, you can also use stratify to create sample in the stratified fashion. Once the sampling is done, the balanced dataset is created by appending the oversampled dataset. Here is the code representing the following aspects:

  • Oversampling of minority class
  • Creating balanced data set by appending the oversampled dataset
import numpy as np
from sklearn.utils import resample
# Create oversampled training data set for minority class
X_oversampled, y_oversampled = resample(X_imbalanced[y_imbalanced == 0], 
                                        y_imbalanced[y_imbalanced == 0],
                                        n_samples=X_imbalanced[y_imbalanced == 1].shape[0],
# Append the oversampled minority class to training data and related labels
X_balanced = np.vstack((X[y == 1], X_oversampled))
y_balanced = np.hstack((y[y == 1], y_oversampled))

Once balanced dataset is created using oversampling of minority class, the model training is carried out in the usual manner. Here is the rest of the code for training. Note that the code below used the following steps for training and scoring the model:

  • Creating training and test split
  • Create the pipeline
  • Create a randomized search (RandomizedSearchCV) for model tuning
  • Fit the randomizedSearchCV estimator
  • Score the model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
import scipy as sc
# Create training and test split using the balanced dataset 
# created by oversampling
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3, 
                                                    random_state=1, stratify=y_balanced)
# Create the pipeline
pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1))
# Create the randomized search estimator
param_distributions = [{'logisticregression__C': sc.stats.expon(scale=100)}]
rs = RandomizedSearchCV(estimator=pipeline, param_distributions = param_distributions, 
                        cv = 10, scoring = 'accuracy', refit = True, n_jobs = 1,
# Fit the model
rs.fit(X_train, y_train)
# Score the model
print('Best Score:', rs.best_score_, '\nBest Params:', rs.best_params_)
print('Test Accuracy: %0.3f' % rs.score(X_test, y_test))

Resample method for Under Sampling Majority Class

In this section, you will learn aboout how to use resample method to undersample the majority class. Here is the code for undersampling the majority class. In the code below, the majority class (label as 1) is downsampled to size 30 of minority class using the parameter, n_samples=X_imbalanced[y_imbalanced == 0].shape[0]

X_undersampled, y_undersampled = resample(X_imbalanced[y_imbalanced == 1], y_imbalanced[y_imbalanced == 1],
                n_samples=X_imbalanced[y_imbalanced == 0].shape[0],

Once that is done, the new balanced training / test data set is created and then training and test split get created using the following code.

# Create balanced training / test data set using undersampled majority class records
X_balanced = np.vstack((X_imbalanced[y_imbalanced == 0], X_undersampled))
y_balanced = np.hstack((y_imbalanced[y_imbalanced == 0], y_undersampled))
# Create training and test data split
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3, 
                                                    random_state=1, stratify=y_balanced)

The above can be following by usual code for training and scoring the model.


Here is what you learned about using Sklearn.utils resample method for creating balanced data set from imbalanced dataset.

  • Sklearn.utils resample method can be used to tackle class imbalance in the imbalanced dataset.
  • Sklearn.utils resample can be used to do both – Under sample the majority class records and oversample minority class records appropriately.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data Science and Machine Learning / Deep Learning. In addition, I am also passionate about various different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia etc and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc. I would love to connect with you on Linkedin.
Posted in Data Science, Machine Learning, Python. Tagged with , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.