Missing values are common in dealing with real-world problems when the data is aggregated over long time stretches from disparate sources, and reliable machine learning modeling demands for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation (mean. median, mode), matrix factorization methods like SVD, statistical models like Kalman filters, and deep learning methods. Missing value imputation or replacing techniques help machine learning models learn from incomplete data. There are three main missing value imputation techniques – mean, median and mode. Mean is the average of all values in a set, median is the middle number in a set of numbers sorted by size, and mode is the most common numerical value for two or more sets.
In this blog post, you will learn about how to impute or replace missing values with mean, median and mode in one or more numeric feature columns of Pandas DataFrame while building machine learning (ML) models with Python programming. You will also learn about how to decide which technique to use for imputing missing values with central tendency measures of feature column such as mean, median or mode. This is important to understand this technique for data scientists as handling missing values one of the key aspects of data preprocessing when training ML models.
The dataset used for illustration purpose is related campus recruitment and taken from Kaggle page on Campus Recruitment. As a first step, the data set is loaded. Here is the python code for loading the dataset once you downloaded it on your system.
import pandas as pd import numpy as np df = pd.read_csv("/Users/ajitesh/Downloads/Placement_Data_Full_Class.csv") df.head()
Here is how the data looks like. Make a note of NaN value under the salary column.
Missing values are handled using different interpolation techniques which estimate the missing values from the other training examples. In the above dataset, the missing values are found in the salary column. The command such as df.isnull().sum() prints the column with missing value. The missing values in the salary column in the above example can be replaced using the following techniques:
- Mean value of other salary values
- Median value of other salary values
- Mode (most frequent) value of other salary values.
- Constant value
In this post, fillna() method on the data frame is used for imputing missing values with mean, median, mode or constant value. However, you may also want to check out the related post titled imputing missing data using Sklearn SimpleImputer wherein sklearn.impute.SimpleImputer is used for missing values imputation using mean, median, mode, or constant value. The
SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located. You may also want to check out Scikit-learn article – Imputation of missing values.
How to decide which imputation technique to use?
One of the key points is to decide which technique out of the above-mentioned imputation techniques to use to get the most effective value for the missing values. In this post, the central tendency measure such as mean, median, or mode is considered for imputation. The goal is to find out which is a better measure of the central tendency of data and use that value for replacing missing values appropriately.
Plots such as box plots and distribution plots come very handy in deciding which techniques to use. You can use the following code to print different plots such as box and distribution plots.
import seaborn as sns # # Box plot # sns.boxplot(df.salary) # # Distribution plot # sns.distplot(df.salary)
Here is how the box plot would look like. You may note that the data is skewed. There are several or large numbers of data points that act as outliers. Outliers data points will have a significant impact on the mean and hence, in such cases, it is not recommended to use the mean for replacing the missing values. Using mean values for replacing missing values may not create a great model and hence gets ruled out. For symmetric data distribution, one can use the mean value for imputing missing values.
Thus, one may want to use either median or mode. Here is a great page on understanding boxplots.
You can also observe a similar pattern from the plotting distribution plot. One can observe that there are several high-income individuals in the data points. The data looks to be right-skewed (long tail in the right). Here is how the plot looks like.
The most simple technique of all is to replace missing data with some constant value. The value can be any number that seemed appropriate.
Impute / Replace Missing Values with Mean
One of the techniques is mean imputation in which the missing values are replaced with the mean value of the entire feature column. In the case of fields like salary, the data may be skewed as shown in the previous section. In such cases, it may not be a good idea to use mean imputation for replacing the missing values. Note that imputing missing data with mean values can only be done with numerical data.
Impute / Replace Missing Values with Median
Another technique is median imputation in which the missing values are replaced with the median value of the entire feature column. When the data is skewed, it is good to consider using the median value for replacing the missing values. Note that imputing missing data with median value can only be done with numerical data.
Impute / Replace Missing Values with Mode
Yet another technique is mode imputation in which the missing values are replaced with the mode value or most frequent value of the entire feature column. When the data is skewed, it is good to consider using mode values for replacing the missing values. For data points such as the salary field, you may consider using mode for replacing the values. Note that imputing missing data with mode values can be done with numerical and categorical data.
Here is the python code sample where the mode of salary column is replaced in place of missing values in the column:
df['salary'] = df['salary'].fillna(df['salary'].mode())
Here is how the data frame would look like (df.head())after replacing missing values of the salary column with the mode value. Note the value of 30000 in the fourth row under the salary column. 30000 is the mode of salary column which can be found by executing commands such as df.salary.mode()
You may want to check other two related posts on handling missing data:
- Missing data imputation techniques in machine learning
- Imputing missing data using Sklearn SimpleImputer
Take a Quiz
Take a quick quiz to check your understanding of concepts related with imputing missing values with mean, median or mode.
Which of the following is not a recommended technique for imputing missing values when data distribution is skewed?
Which of the following plots can be used to identify most appropriate technique for missing values imputation?
For skewed data distribution, which of the following technique (s) can be used?
For categorical features, which of the following technique can be used?
In this post, you learned about some of the following:
- You can use central tendency measures such as mean, median or mode of the numeric feature column to replace or impute missing values.
- You can use mean value to replace the missing values in case the data distribution is symmetric.
- Consider using median or mode with skewed data distribution.
- Pandas Dataframe method in Python such as fillna can be used to replace the missing values.
- Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.
- Accounts Payable Machine Learning Use Cases - October 25, 2021
- Stock Price Prediction using Machine Learning Techniques - October 24, 2021
- Type I & Type II Errors in Hypothesis Testing: Examples - October 23, 2021