Histogram Plots using Matplotlib & Pandas: Python

Side by side histogram plots using Matplotlib and Pandas library in Python

Histograms are a graphical representation of the distribution of data. In Python, there are several ways to create histograms. One popular method is to use the Matplotlib library. In this tutorial, we will show you how to create different types histogram plots in Python using Matplotlib. As data scientists, it is important to learn how to create visualizations to communicate our findings. Histograms are one way to do this effectively.

What are Histogram plot?

Histogram plots are a way of representing the distribution of data. A histogram is made up of bars, with each bar representing a certain range of data values. The height of the bar indicates how many data points fall within that range. Histograms can be used to compare two different data sets, or to see how the distribution of data changes over time.

Histograms are often used to visualize the distribution of data. For example, a histogram can be used to compare the distribution of data between two different groups, or to see how the distribution changes over time. Histograms can also be used to detect outliers, or to see if the data is skewed.

Plotting Histogram using Matplotlib & Pandas

There are a few things to keep in mind when creating histograms in Matplotlib:

  • We can create a histogram from a Pandas DataFrame using the Matplotlib plot() function.
  • We can specify the number of bins using the bins parameter.
  • We can specify the range of values to include in the histogram using the range parameter.
  • We can make our histogram look nicer by using colors and adding title and labels.

To create a Histogram plot using the Matplotlib and Pandas library, you first need to import the Matplotlib.pyplot and Pandas.

import matplotlib.pyplot as plt
import pandas as pd

Once the module is imported, you can create a Histogram object by passing in the data that you want to plot. The data can be passed as one column of data frame or the list of data depending upon the type of Histogram you want to plot. In this section you will learn how to create a histogram plot on a dataframe column, multiple histogram plots representing data related to different class of data and stacked histogram. The boston housing prices dataset is used for plotting Histogram in this section.

Plotting Histogram using Matplotlib on one column of a Pandas DataFrame

To plot a Histogram using Matplotlib, you need to first import the Histogram class from the Matplotlib library. The Histogram class has a plot() method which is used to plot histograms. The plot() method accepts a dataframe column as an argument. The Histogram will be plotted on the column of the dataframe.

In the code below Boston housing price sklearn.datasets has been imported and a Pandas dataframe is created. Thereafter, a histogram is plotted using Matplotlib on the target column ‘MEDV’.

import pandas as pd
import numpy as np
from sklearn import datasets

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.dpi'] = 100

from IPython.core.pylabtools import figsize

# Load boston housing price dataset
housing_prices = datasets.load_boston()

# Create Pandas dataframe
hp_data_transpose = housing_prices.data.transpose()
df_hp = pd.DataFrame({
  'CRIM': hp_data_transpose[0],
  'ZN':hp_data_transpose[1],
  'INDUS':hp_data_transpose[2],
  'CHAS':hp_data_transpose[3],
  'NOX':hp_data_transpose[4],
  'RM':hp_data_transpose[5],
  'AGE':hp_data_transpose[6],
  'DIS':hp_data_transpose[7],
  'RAD':hp_data_transpose[8],
  'TAX':hp_data_transpose[9],
  'PTRATIO':hp_data_transpose[10],
  'B':hp_data_transpose[11],
  'LSTAT':hp_data_transpose[12]
})
df_hp['MEDV'] = housing_prices.target

# Create histogram on MEDV column (target column)
figsize(7, 5)

plt.hist(df_hp['MEDV'], color='blue', edgecolor='black', bins=int(45/1))

plt.xlabel('Median value of owner-occupied homes in $1000')
plt.ylabel('No. of houses')
plt.title('Housing prices frequencies')

Executing the above code will print the following Histogram.

Plotting multiple Histograms Side-by-Side

When you want to understand the distribution of data with respect to different characteristics, you could plot the side-by-side or multiple histograms on the same plot. For example, when you want to understand the distribution of housing prices with respect to different values of accessibility to radial highways, you would want to print the histograms side-by-side on the same plot. Here is the code representing the printing of histogram plots side-by-side on the same plot: 

import pandas as pd
import numpy as np
from sklearn import datasets

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.dpi'] = 100

from IPython.core.pylabtools import figsize

# Load boston housing price dataset
housing_prices = datasets.load_boston()

# Create Pandas dataframe
hp_data_transpose = housing_prices.data.transpose()
df_hp = pd.DataFrame({
  'CRIM': hp_data_transpose[0],
  'ZN':hp_data_transpose[1],
  'INDUS':hp_data_transpose[2],
  'CHAS':hp_data_transpose[3],
  'NOX':hp_data_transpose[4],
  'RM':hp_data_transpose[5],
  'AGE':hp_data_transpose[6],
  'DIS':hp_data_transpose[7],
  'RAD':hp_data_transpose[8],
  'TAX':hp_data_transpose[9],
  'PTRATIO':hp_data_transpose[10],
  'B':hp_data_transpose[11],
  'LSTAT':hp_data_transpose[12]
})
df_hp['MEDV'] = housing_prices.target

# Create histogram on MEDV column (target column)
figsize(6, 4)

# Create list of data according to different accessibility index
#
x1 = list(df_hp[df_hp['RAD'] == 1]['MEDV'])
x2 = list(df_hp[df_hp['RAD'] == 2]['MEDV'])
x3 = list(df_hp[df_hp['RAD'] == 3]['MEDV'])

# Setting colors and names 
#
colors=['blue', 'green', 'orange']
names=['RAD-1', 'RAD-2', 'RAD-3']

# Creating plot with list values, colors and names (labels)
# Note the density value set as true which represents the 
# probability distribution 
#
plt.hist([x1, x2, x3], color=colors, label=names, density=True)

# Set the legend and labels
#
plt.legend()
plt.title('Side-by-side Histogram for housing prices')
plt.xlabel('Median value of owner-occupied homes in $1000')

Here is how the side-by-side histogram plot would look like:

Creating Stacked Histogram Plots

Another requirement can be to view the histogram plots stacked over each other. The requirement is to understand the data distribution against different attribute values while having the plots stacked over each other. This is different from side-by-side histogram in the way that the plots are stacked over each other. The difference in the code will only be an addition of another parameter, stacked = True in the plot function code used to draw side-by-side multiple histogram plots. Note the same in the code given below:

import pandas as pd
import numpy as np
from sklearn import datasets

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.dpi'] = 100

from IPython.core.pylabtools import figsize

# Load boston housing price dataset
housing_prices = datasets.load_boston()

# Create Pandas dataframe
hp_data_transpose = housing_prices.data.transpose()
df_hp = pd.DataFrame({
  'CRIM': hp_data_transpose[0],
  'ZN':hp_data_transpose[1],
  'INDUS':hp_data_transpose[2],
  'CHAS':hp_data_transpose[3],
  'NOX':hp_data_transpose[4],
  'RM':hp_data_transpose[5],
  'AGE':hp_data_transpose[6],
  'DIS':hp_data_transpose[7],
  'RAD':hp_data_transpose[8],
  'TAX':hp_data_transpose[9],
  'PTRATIO':hp_data_transpose[10],
  'B':hp_data_transpose[11],
  'LSTAT':hp_data_transpose[12]
})
df_hp['MEDV'] = housing_prices.target

# Create histogram on MEDV column (target column)
figsize(7, 5)

# Create list of data according to different accessibility index
#
x1 = list(df_hp[df_hp['RAD'] == 1]['MEDV'])
x2 = list(df_hp[df_hp['RAD'] == 2]['MEDV'])
x3 = list(df_hp[df_hp['RAD'] == 3]['MEDV'])

# Setting colors and names 
#
colors=['blue', 'green', 'orange']
names=['RAD-1', 'RAD-2', 'RAD-3']

# Creating plot with list values, colors and names (labels)
# Note the density value set as true which represents the 
# probability distribution 
# Note the parameter stacked = True which results in stacked histogram plot
#
plt.hist([x1, x2, x3], color=colors, label=names, density=True, stacked = True)

# Set the legend and labels
#
plt.legend()
plt.title('Stacked Histogram for housing prices')
plt.xlabel('Median value of owner-occupied homes in $1000')

The stacked histogram plot would look like the following:

We hope you found this introduction to histogram plots helpful. If you have any questions, please don’t hesitate to reach out to us. And be sure to check out our other tutorials for more in-depth looks at data science and Python programming.

Ajitesh Kumar
Follow me
Latest posts by Ajitesh Kumar (see all)

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in Data, Data Science, statistics. Tagged with , .

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.