Python – Replace Missing Values with Mean, Median & Mode

Boxplot for deciding whether to use mean, mode or median for imputation

In this post, you will learn about how to impute or replace missing values  with mean, median and mode in one or more numeric feature columns of Pandas DataFrame while building machine learning (ML) models with Python programming. You will also learn about how to decide which technique to use for imputing missing values with central tendency measures of feature column such as mean, median or mode. This is important to understand this technique for data scientists as handling missing values one of the key aspects of data preprocessing when training ML models.

The dataset used for illustration purpose is related campus recruitment and taken from Kaggle page on Campus Recruitment.  As a first step, the data set is loaded. Here is the python code for loading the dataset once you downloaded it on your system.

import pandas as pd
import numpy as np

df = pd.read_csv("/Users/ajitesh/Downloads/Placement_Data_Full_Class.csv")

df.head()

Here is how the data looks like. Make a note of NaN value under the salary column.

Placement dataset for handling missing values using mean, median or mode
Fig 1. Placement dataset for handling missing values using mean, median or mode

Missing values are handled using different interpolation techniques which estimate the missing values from the other training examples. In the above dataset, the missing values are found in the salary column. The command such as df.isnull().sum() prints the column with missing value. The missing values in the salary column in the above example can be replaced using the following techniques:

  • Mean value of other salary values
  • Median value of other salary values
  • Mode (most frequent) value of other salary values.
  • Constant value

In this post, fillna() method on the dataframe is used for imputing missing values with mean, median, mode or constant value. However, you may also want to check out the related post titled imputing missing data using Sklearn SimpleImputer wherein sklearn.impute.SimpleImputer is used for missing values imputation using mean, median, mode, or constant value. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located. You may also want to check out Scikit-learn article РImputation of missing values.

How to decide which imputation technique to use?

One of the key points is to decide which technique out of the above-mentioned imputation techniques to use to get the most effective value for the missing values. In this post, the central tendency measure such as mean, median, or mode is considered for imputation. The goal is to find out which is a better measure of the central tendency of data and use that value for replacing missing values appropriately.

Plots such as box plots and distribution plots come very handily in deciding which techniques to use. You can use the following code to print different plots such as box and distribution plots.

import seaborn as sns
#
# Box plot
#
sns.boxplot(df.salary)
#
# Distribution plot
#
sns.distplot(df.salary)

Here is how the box plot would look like. You may note that the data is skewed. There are several or large numbers of data points that act as outliers. Outliers data points will have a significant impact on the mean and hence, in such cases, it is not recommended to use the mean for replacing the missing values. Using mean values for replacing missing values may not create a great model and hence gets ruled out. For symmetric data distribution, one can use the mean value for imputing missing values.

Thus, one may want to use either median or mode. Here is a great page on understanding boxplots.

Using Boxplot for deciding whether to use mean, mode or median for imputation or replacing missing values
Fig 1. Boxplot for deciding whether to use mean, mode or median for imputation

You can also observe a similar pattern from the plotting distribution plot. One can observe that there are several high-income individuals in the data points. The data looks to be right-skewed (long tail in the right). Here is how the plot looks like.

Distribution plot for deciding imputation technique
Fig 2. Distribution plot for deciding imputation technique

The most simple technique of all is to replace missing data with some constant value. The value can be any number that seemed appropriate.

Impute / Replace Missing Values with Mean

One of the techniques is mean imputation in which the missing values are replaced with the mean value of the entire feature column. In the case of fields like salary, the data may be skewed as shown in the previous section. In such cases, it may not be a good idea to use mean imputation for replacing the missing values. Note that imputing missing data with mean values can only be done with numerical data.

df.fillna(df.mean())

Impute / Replace Missing Values with Median

Another technique is median imputation in which the missing values are replaced with the median value of the entire feature column. When the data is skewed, it is good to consider using median value for replacing the missing values. Note that imputing missing data with median value can only be done with numerical data.

df.fillna(df.median())

Impute / Replace Missing Values with Mode

Yet another technique is mode imputation in which the missing values are replaced with the mode value or most frequent value of the entire feature column. When the data is skewed, it is good to consider using mode values for replacing the missing values. For data points such as the salary field, you may consider using mode for replacing the values. Note that imputing missing data with mode values can be done with numerical and categorical data.

Here is the python code sample where the mode of salary column is replaced in place of missing values in the column:

df['salary'] = df['salary'].fillna(df['salary'].mode()[0])

Here is how the dataframe would look like (df.head())after replacing missing values of the salary column with the mode value. Note the value of 30000 in the fourth row under the salary column. 30000 is the mode of salary column which can be found by executing commands such as df.salary.mode()

Fig 4. Mode value 30000 replaced NaN in 4th row under salary column

You may want to check other two related posts on handling missing data:

Take a Quiz

Take a quick quiz to check your understanding of concepts related with imputing missing values with mean, median or mode.

Which of the following is not a recommended technique for imputing missing values when data distribution is skewed?

Correct! Wrong!

Which of the following plots can be used to identify most appropriate technique for missing values imputation?

Correct! Wrong!

For skewed data distribution, which of the following technique (s) can be used?

Correct! Wrong!

For categorical features, which of the following technique can be used?

Correct! Wrong!

Conclusion

In this post, you learned about some of the following:

  • You can use central tendency measures such as mean, median or mode of the numeric feature column to replace or impute missing values.
  • You can use mean value to replace the missing values in case the data distribution is symmetric.
  • Consider using median or mode with skewed data distribution.
  • Pandas Dataframe method in Python such as fillna can be used to replace the missing values.
  • Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data Science and Machine Learning / Deep Learning. In addition, I am also passionate about various different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia etc and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc. I would love to connect with you on Linkedin.
Posted in Data Science, Machine Learning, Python. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.