In this post, you will learn about how to impute or replace missing values with mean, median and mode in one or more numeric feature columns of Pandas DataFrame while building machine learning (ML) models with Python programming. You will also learn about how to decide which technique to use for imputing missing values with central tendency measures of feature column such as mean, median or mode. This is important to understand this technique for data scientists as handling missing values one of the key aspects of data preprocessing when training ML models.
The dataset used for illustration purpose is related campus recruitment and taken from Kaggle page on Campus Recruitment. As a first step, the data set is loaded. Here is the python code for loading the dataset once you downloaded it on your system.
import pandas as pd import numpy as np df = pd.read_csv("/Users/ajitesh/Downloads/Placement_Data_Full_Class.csv") df.head()
Here is how the data looks like. Make a note of NaN value under salary column.
Missing values are handled using different interpolation techniques which estimates the missing values from the other training examples. In above dataset, the missing values are found with salary column. The command such as df.isnull().sum() prints the column with missing value. The missing values in the salary column in the above example can be replaced using the following techniques:
- Mean value of other salary values
- Median value of other salary values
- Mode (most frequent) value of other salary values.
- Constant value
How to Decide Which Imputation Technique to Use?
One of the key point is to decide which technique out of above mentioned imputation techniques to use to get the most effective value for the missing values. In this post, the central tendency measure such as mean, median or mode is considered for imputation. The goal is to find out which is a better measure of central tendency of data and use that value for replacing missing values appropriately.
Plots such as box plots and distribution plots comes very handy in deciding which techniques to use. You can use the following code to print different plots such as box and distribution plots.
import seaborn as sns # # Box plot # sns.boxplot(df.salary) # # Distribution plot # sns.distplot(df.salary)
Here is how the box plot would look like. You may note that the data is skewed. There are several or large number of data points which act as outliers. Outliers data points will have significant impact on the mean and hence, in such cases, it is not recommended to use mean for replacing the missing values. Using mean value for replacing missing values may not create a great model and hence gets ruled out. For symmetric data distribution, one can use mean value for imputing missing values.
Thus, one may want to use either median or mode. Here is a great page on understanding boxplots.
You can also observe the similar pattern from plotting distribution plot. One can observe that there are several high income individuals in the data points. The data looks to be right skewed (long tail in the right). Here is how the plot look like.
The most simple technique of all is to replace missing data with some constant value. The value can be any number which seemed appropriate.
Impute / Replace Missing Values with Mean
One of the technique is mean imputation in which the missing values are replaced with the mean value of the entire feature column. In case of fields like salary, the data may be skewed as shown in the previous section. In such cases, it may not be good idea to use mean imputation for replacing the missing values. Note that imputing missing data with mean value can only be done with numerical data.
Impute / Replace Missing Values with Median
Another technique is median imputation in which the missing values are replaced with the median value of the entire feature column. When the data is skewed, it is good to consider using median value for replacing the missing values. Note that imputing missing data with median value can only be done with numerical data.
Impute / Replace Missing Values with Mode
Yet another technique is mode imputation in which the missing values are replaced with the mode value or most frequent value of the entire feature column. When the data is skewed, it is good to consider using mode value for replacing the missing values. For data points such as salary field, you may consider using mode for replacing the values. Note that imputing missing data with mode value can be done with numerical and categorical data.
Here is the python code sample where mode of salary column is replaced in place of missing values in the column:
df['salary'] = df['salary'].fillna(df['salary'].mode())
Here is how the dataframe would look like (df.head())after replacing missing values of salary column with mode value. Note the value of 30000 in the fourth row under salary column. 30000 is mode of salary column which can be found by executing command such as df.salary.mode()
You may want to check other two related posts on handling missing data:
- Missing data imputation techniques in machine learning
- Imputing missing data using Sklearn SimpleImputer
In this post, you learned about some of the following:
- You can use central tendency measures such as mean, median or mode of the numeric feature column to replace or impute missing values.
- You can use mean value to replace the missing values in case the data distribution is symmetric.
- Consider using median or mode with skewed data distribution.
- Pandas Dataframe method in Python such as fillna can be used to replace the missing values.
- Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.