In this post, you will learn about the concepts of Correlation and how to draw Correlation Heatmap using Python Seaborn library for different columns in Pandas dataframe. The following are some of the topics covered in this post:
- Introduction to Correlation
- What is correlation heatmap?
- Corrleation heatmap Pandas / Seaborn python example
Introduction to Correlation
Correlation is a term used to represent the statistical measure of linear relationship between two variables. It can also be defined as the measure of dependence between two different variables. If there are multiple variables and the goal is to find correlation between all of these variables and store them using appropriate data structure, the matrix data structure is used. Such matrix is called as correlation matrix.
Dependence between two variables, also termed as correlation, can be measured using the following:
- Correlation coefficient / Pearson correlation coefficient which measures how the value of two different variables vary with respect to each other. The formula given below (Fig 1) represents Pearson correlation coefficient.
- Rank correlation coefficient metric such as Spearman correlation coefficient is used to measure the extent to which one variable increases / decreases as the other variable increases / decreases.
Pearson correlation coefficient between two variables X and Y can be calculated using the following formula. X bar is mean value of X and Y bar is mean value of Y. \(X_i\) and \(Y_i\) represents different values of X and Y.
The value of correlation coefficient can take any values from -1 to 1.
- If the value is 1, it is said to be positive correlation between two variables. This means that when one variable increases, the other variable also increases.
- If the value is -1, it is said to be negative correlation between two variables. This means that when one variable increases, the other variable decreases.
- If the value is 0, there is no correlation between two variables. This means that the variables changes in a random manner with respect to each other.
Correlation between two variables can also be determined using scatter plot between these two variables. Here is the diagram representing correlation as scatterplot. The correlation of the diagram in top-left will have correlation near to 1. The correlation of the diagram in the middle row will have correlation near to 0. The correlation of the diagram in bottom-right will have correlation near to -1.
Correlation between two random variables or bivariate data does not necessary imply causal relationship.
Why must one understand correlation concepts?
- Understand predictive relationship between response and predictor variables; In case there is strong positive or negative correlation, the predictor variables can be considered as features for training the models.
- Understand the linear relationship between predictor variables to determine multicollinearity. If the correlation between predictor variables comes out to be greater than 0.7 or less than -0.7, one of these variables can be removed as predictor variable when training the model. In presence of predictor variables having multicollinearity, the coefficients of the predictor variables in the model can be unreliable.
What is Correlation Heatmap?
Correlation heatmap is graphical representation of correlation matrix representing correlation between different variables. Here is a sample correlation heatmap created to understand the linear relationship between different variables in the housing data set. The code is discussed in the later section.
Correlation Heatmap Pandas / Seaborn Code Example
Here is the Python code which can be used to draw correlation heatmap for the housing data set representing the correlation between different variables including predictor and response variables. Pay attention to some of the following:
- Pandas package is used to read the tabular data using read_table method.
- Method corr() is invoked on the Pandas DataFrame to determine correlation between different variables including predictor and response variables.
- Seaborn heatmap() method is used to create the heat map representing correlation matrix
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # # # df = pd.read_table('/Users/apple/Downloads/housing.data', header=None, sep='\s+') df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] # # Correlation between different variables # corr = df.corr() # # Set up the matplotlib plot configuration # f, ax = plt.subplots(figsize=(12, 10)) # # Generate a mask for upper traingle # mask = np.triu(np.ones_like(corr, dtype=bool)) # # Configure a custom diverging colormap # cmap = sns.diverging_palette(230, 20, as_cmap=True) # # Draw the heatmap # sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)
Here is how the correlation heatmap will look like:
From the above correlation heatmap, one could get some of the following information:
- Variables such as NOX & INDUS, AGE & NOX, TAX & RAD and MEDV & RM are having strong positive correlation. Generally speaking, pearson correlation coefficient value greater than 0.7 indicates the presence of multi-collinearity.
- Variables such as MEDV & LSTAT, DIS & INDUS, DIS & NOX and DIS & AGE are having strong negative correlation.
- There are several variables which have no correlation and whose correlation value is near to 0.
Here is the summary of what you learned about the correlation heatmap in this post:
- Correlation heatmap is graphical representation of correlation matrix representing correlation between different variables.
- The value of correlation can take any values from -1 to 1.
- Correlation between two random variables or bivariate data does not necessary imply causal relationship.
- Correlation between two variables can also be determined using scatter plot between these two variables.
- Quantum machine learning: Concepts and Examples - September 25, 2021
- Supplier Relationship Management & Machine Learning - September 24, 2021
- Relationship: Analytics & Data-Driven Decision Making - September 23, 2021