
In this post, you will learn about how to tackle class imbalance issue when training machine learning classification models with imbalanced dataset. This is illustrated using Python SKlearn example. In the same context, you may check out my earlier post on handling class imbalance using class_weight. As a data scientist, it is of utmost importance to learn some of these techniques as you will often come across the class imbalance problem while working on different classification problems.
Here is how the class imbalance in the dataset can be visualized:

Before going ahead and looking at the Python code example related to how to use Sklearn.utils resample method, lets create an imbalanced data set having class imbalance. We will create imbalanced dataset with Sklearn breast cancer dataset. Here is the code sample:
from sklearn import datasets
import numpy as np
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
X_imbalanced = np.vstack((X[y == 1], X[y == 0][:30]))
y_imbalanced = np.hstack((y[y == 1], y[y == 0][:30]))
The code results in creating an imbalanced dataset with 212 records labeled as malignant class reduced to 30. Thus, the total records count becomes benign tumour (357) + malignant tumour (30).
Next step is to use resample method to oversample the minority class (malignant tumour records in this example) and undersample the majority class (benign tumour records).
Resample method for Over Sampling Minority Class
The idea is to oversample the data related to minority class using replacement. One of the parameter is replace and other one is n_samples which relates to number of samples to which minority class will be oversampled. In addition, you can also use stratify to create sample in the stratified fashion. Once the sampling is done, the balanced dataset is created by appending the oversampled dataset. Here is the code representing the following aspects:
- Oversampling of minority class
- Creating balanced data set by appending the oversampled dataset
import numpy as np
from sklearn.utils import resample
#
# Create oversampled training data set for minority class
#
X_oversampled, y_oversampled = resample(X_imbalanced[y_imbalanced == 0],
y_imbalanced[y_imbalanced == 0],
replace=True,
n_samples=X_imbalanced[y_imbalanced == 1].shape[0],
random_state=123)
#
# Append the oversampled minority class to training data and related labels
#
X_balanced = np.vstack((X[y == 1], X_oversampled))
y_balanced = np.hstack((y[y == 1], y_oversampled))
Once balanced dataset is created using oversampling of minority class, the model training is carried out in the usual manner. Here is the rest of the code for training. Note that the code below used the following steps for training and scoring the model:
- Creating training and test split
- Create the pipeline
- Create a randomized search (RandomizedSearchCV) for model tuning
- Fit the randomizedSearchCV estimator
- Score the model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
import scipy as sc
#
# Create training and test split using the balanced dataset
# created by oversampling
#
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3,
random_state=1, stratify=y_balanced)
#
# Create the pipeline
#
pipeline = make_pipeline(StandardScaler(), LogisticRegression(random_state=1))
#
# Create the randomized search estimator
#
param_distributions = [{'logisticregression__C': sc.stats.expon(scale=100)}]
rs = RandomizedSearchCV(estimator=pipeline, param_distributions = param_distributions,
cv = 10, scoring = 'accuracy', refit = True, n_jobs = 1,
random_state=1)
#
# Fit the model
#
rs.fit(X_train, y_train)
#
# Score the model
#
print('Best Score:', rs.best_score_, '\nBest Params:', rs.best_params_)
print('Test Accuracy: %0.3f' % rs.score(X_test, y_test))
Resample method for Under Sampling Majority Class
In this section, you will learn aboout how to use resample method to undersample the majority class. Here is the code for undersampling the majority class. In the code below, the majority class (label as 1) is downsampled to size 30 of minority class using the parameter, n_samples=X_imbalanced[y_imbalanced == 0].shape[0]
X_undersampled, y_undersampled = resample(X_imbalanced[y_imbalanced == 1], y_imbalanced[y_imbalanced == 1],
replace=True,
n_samples=X_imbalanced[y_imbalanced == 0].shape[0],
random_state=123)
Once that is done, the new balanced training / test data set is created and then training and test split get created using the following code.
#
# Create balanced training / test data set using undersampled majority class records
#
X_balanced = np.vstack((X_imbalanced[y_imbalanced == 0], X_undersampled))
y_balanced = np.hstack((y_imbalanced[y_imbalanced == 0], y_undersampled))
#
# Create training and test data split
#
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3,
random_state=1, stratify=y_balanced)
The above can be following by usual code for training and scoring the model.
Conclusions
Here is what you learned about using Sklearn.utils resample method for creating balanced data set from imbalanced dataset.
- Sklearn.utils resample method can be used to tackle class imbalance in the imbalanced dataset.
- Sklearn.utils resample can be used to do both – Under sample the majority class records and oversample minority class records appropriately.
Upsampling should only occur on the training set, otherwise resampled training data may also appear in the test dataset.