Correlation Concepts, Matrix & Heatmap using Seaborn

0

In this post, you will learn about the concepts of Correlation and how to draw Correlation Heatmap using Python Seaborn library for different columns in Pandas dataframe. The following are some of the topics covered in this post:

  • Introduction to Correlation
  • What is correlation heatmap?
  • Corrleation heatmap Pandas / Seaborn python example

Introduction to Correlation

Correlation is a term used to represent the statistical measure of linear relationship between two variables. It can also be defined as the measure of dependence between two different variables. If there are multiple variables and the goal is to find correlation between all of these variables and store them using appropriate data structure, the matrix data structure is used. Such matrix is called as correlation matrix. 

Dependence between two variables, also termed as correlation, can be measured using the following:

  • Correlation coefficient / Pearson correlation coefficient which measures how the value of two different variables vary with respect to each other. The formula given below (Fig 1) represents Pearson correlation coefficient. 
  • Rank correlation coefficient metric such as Spearman correlation coefficient is used to measure the extent to which one variable increases / decreases as the other variable increases / decreases. 

Pearson correlation coefficient between two variables X and Y can be calculated using the following formula. X bar is mean value of X and Y bar is mean value of Y. \(X_i\) and \(Y_i\) represents different values of X and Y.

correlation coefficient formula

Fig 1. Pearson correlation coefficient formula

The value of correlation coefficient can take any values from -1 to 1.

  • If the value is 1, it is said to be positive correlation between two variables. This means that when one variable increases, the other variable also increases.
  • If the value is -1, it is said to be negative correlation between two variables. This means that when one variable increases, the other variable decreases.
  • If the value is 0, there is no correlation between two variables. This means that the variables changes in a random manner with respect to each other.

Correlation between two variables can also be determined using scatter plot between these two variables. Here is the diagram representing correlation as scatterplot. The correlation of the diagram in top-left will have correlation near to 1. The correlation of the diagram in the middle row will have correlation near to 0. The correlation of the diagram in bottom-right will have correlation near to -1.

 
correlation represented using the scatterplot

Fig 2. Correlation represented using the Scatterplot

Correlation between two random variables or bivariate data does not necessary imply causal relationship.

Why must one understand correlation concepts?

As a data scientist or machine learning enthusiast, it is very important to understand the concept of correlation as it helps achieve some of the following objectives:

  • Understand predictive relationship between response and predictor variables; In case there is strong positive or negative correlation, the predictor variables can be considered as features for training the models.
  • Understand the linear relationship between predictor variables to determine multicollinearity. If the correlation between predictor variables comes out to be greater than 0.7 or less than -0.7, one of these variables can be removed as predictor variable when training the model. In presence of predictor variables having multicollinearity, the coefficients of the predictor variables in the model can be unreliable.

What is Correlation Heatmap?

Correlation heatmap is graphical representation of correlation matrix representing correlation between different variables. Here is a sample correlation heatmap created to understand the linear relationship between different variables in the housing data set. The code is discussed in the later section.

Correlation Heatmap for Housing Dataset
Fig 3. Correlation Heatmap for Housing Dataset

Correlation Heatmap Pandas / Seaborn Code Example

Here is the Python code which can be used to draw correlation heatmap for the housing data set representing the correlation between different variables including predictor and response variables. Pay attention to some of the following:

  • Pandas package is used to read the tabular data using read_table method.
  • Method corr() is invoked on the Pandas DataFrame to determine correlation between different variables including predictor and response variables.
  • Seaborn heatmap() method is used to create the heat map representing correlation matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#
#
#
df = pd.read_table('/Users/apple/Downloads/housing.data', header=None, sep='\s+')
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 
             'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 
             'LSTAT', 'MEDV']
#
# Correlation between different variables
#
corr = df.corr()
#
# Set up the matplotlib plot configuration
#
f, ax = plt.subplots(figsize=(12, 10))
#
# Generate a mask for upper traingle
#
mask = np.triu(np.ones_like(corr, dtype=bool))
#
# Configure a custom diverging colormap
#
cmap = sns.diverging_palette(230, 20, as_cmap=True)
#
# Draw the heatmap
#
sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)

Here is how the correlation heatmap will look like:

Correlation heatmap with mask for upper traingle
Fig 4. Correlation heatmap with mask for upper traingle

From the above correlation heatmap, one could get some of the following information:

  • Variables such as NOX & INDUS, AGE & NOX, TAX & RAD and MEDV & RM are having strong positive correlation. Generally speaking, pearson correlation coefficient value greater than 0.7 indicates the presence of multi-collinearity.
  • Variables such as MEDV & LSTAT, DIS & INDUS, DIS & NOX and DIS & AGE are having strong negative correlation.
  • There are several variables which have no correlation and whose correlation value is near to 0.

Conclusions

Here is the summary of what you learned about the correlation heatmap in this post:

  • Correlation heatmap is graphical representation of correlation matrix representing correlation between different variables.
  • The value of correlation can take any values from -1 to 1.
  • Correlation between two random variables or bivariate data does not necessary imply causal relationship.
  • Correlation between two variables can also be determined using scatter plot between these two variables.

Ajitesh Kumar
Follow me
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.