Last updated: 16th Nov, 2023
Histograms are a graphical representation of the distribution of data. In Python, there are several ways to create histograms. One popular method is to use the Matplotlib library. In this tutorial, we will cover the basics of Histogram Plots and how to create different types of Histogram plots using the popular Python libraries, Matplotlib and Pandas. We will also explore some real-world examples to demonstrate the usefulness of Histogram Plots in various industries and applications. As data scientists, it is important to learn how to create visualizations to communicate our findings. Histograms are one way to do this effectively.
Histogram plots are a way of representing the distribution of data. It is an estimate of the probability distribution of a continuous or discrete variable. To construct a histogram, the first step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The Histogram Plot displays these bins as adjacent bars, with the height of each bar representing the frequency (or count) of data points within that bin.
The following represents key components of a Histogram:
Bins: These are the intervals into which the data range is divided. The choice of the number of bins and their width can greatly affect the appearance of the histogram, and thus the interpretation of the data. Generally, more bins lead to a more detailed view of the data distribution, while fewer bins provide a more generalized view.
Frequency: The frequency is the number of data points that fall into each bin. In a Histogram Plot, the height of each bar represents the frequency of the corresponding bin.
Density: In some cases, it’s useful to normalize the histogram by the total number of data points. This creates a probability density, which allows you to compare histograms with different sample sizes or bin widths.
Let’s understand the concept of Histogram plots with the help of an example. Let’s say we want to understand how age of all the passengers is distributed on any particular day within an airport. On a particular 1500 passengers visited the airport (hypothetically). Let’s say hypothetically speaking, the goal is to use this data to determine how do I see advertisement space at different places in the airport. So, this is how the histogram would look like:
Note that histograms are generated by binning the data. Their visual appearance would depend on the choice of the bin width. Thus, it is imperative that one should select most appropriate bin width rather than relying on default bind width chosen by the visualization program. If the bin width set is too small, then the histogram becomes overly peaky and visually busy as like the above diagram where the bin width is set to 5. If the bin width is too large, then smaller features in the distribution of the data may disappear.
One confuses histograms with bar plots which are used to represent the categorical data. Histograms are used to represent the data distribution of continuous or discrete data while bar plots or bar charts are used to represent comparisons between categorical data. Here are few differences you would want to keep in mind:
Creating histogram plots can sometimes pose challenges, especially for those new to data visualization or working with complex datasets. Here are some of the most common issues encountered:
There are a few things to keep in mind when creating histograms using Matplotlib and Pandas package:
To create a Histogram plot using the Matplotlib and Pandas library, you first need to import the Matplotlib.pyplot and Pandas.
import matplotlib.pyplot as plt
import pandas as pd
Once the module is imported, you can create a Histogram object by passing in the data that you want to plot. The data can be passed as one column of data frame or the list of data depending upon the type of Histogram you want to plot. In this section you will learn how to create a histogram plot on a dataframe column, multiple histogram plots representing data related to different class of data and stacked histogram. The boston housing prices dataset is used for plotting Histogram in this section.
To plot a Histogram using Matplotlib, you need to first import the Histogram class from the Matplotlib library. The Histogram class has a plot() method which is used to plot histograms. The plot() method accepts a dataframe column as an argument. The Histogram will be plotted on the column of the dataframe.
In the code below Boston housing price sklearn.datasets has been imported and a Pandas dataframe is created. Thereafter, a histogram is plotted using Matplotlib on the target column ‘MEDV’.
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.dpi'] = 100
from IPython.core.pylabtools import figsize
# Load boston housing price dataset
housing_prices = datasets.load_boston()
# Create Pandas dataframe
hp_data_transpose = housing_prices.data.transpose()
df_hp = pd.DataFrame({
'CRIM': hp_data_transpose[0],
'ZN':hp_data_transpose[1],
'INDUS':hp_data_transpose[2],
'CHAS':hp_data_transpose[3],
'NOX':hp_data_transpose[4],
'RM':hp_data_transpose[5],
'AGE':hp_data_transpose[6],
'DIS':hp_data_transpose[7],
'RAD':hp_data_transpose[8],
'TAX':hp_data_transpose[9],
'PTRATIO':hp_data_transpose[10],
'B':hp_data_transpose[11],
'LSTAT':hp_data_transpose[12]
})
df_hp['MEDV'] = housing_prices.target
# Create histogram on MEDV column (target column)
figsize(7, 5)
plt.hist(df_hp['MEDV'], color='blue', edgecolor='black', bins=int(45/1))
plt.xlabel('Median value of owner-occupied homes in $1000')
plt.ylabel('No. of houses')
plt.title('Housing prices frequencies')
Executing the above code will print the following Histogram.
When you want to understand the distribution of data with respect to different characteristics, you could plot the side-by-side or multiple histograms on the same plot. For example, when you want to understand the distribution of housing prices with respect to different values of accessibility to radial highways, you would want to print the histograms side-by-side on the same plot. Here is the code representing the printing of histogram plots side-by-side on the same plot:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.dpi'] = 100
from IPython.core.pylabtools import figsize
# Load boston housing price dataset
housing_prices = datasets.load_boston()
# Create Pandas dataframe
hp_data_transpose = housing_prices.data.transpose()
df_hp = pd.DataFrame({
'CRIM': hp_data_transpose[0],
'ZN':hp_data_transpose[1],
'INDUS':hp_data_transpose[2],
'CHAS':hp_data_transpose[3],
'NOX':hp_data_transpose[4],
'RM':hp_data_transpose[5],
'AGE':hp_data_transpose[6],
'DIS':hp_data_transpose[7],
'RAD':hp_data_transpose[8],
'TAX':hp_data_transpose[9],
'PTRATIO':hp_data_transpose[10],
'B':hp_data_transpose[11],
'LSTAT':hp_data_transpose[12]
})
df_hp['MEDV'] = housing_prices.target
# Create histogram on MEDV column (target column)
figsize(6, 4)
# Create list of data according to different accessibility index
#
x1 = list(df_hp[df_hp['RAD'] == 1]['MEDV'])
x2 = list(df_hp[df_hp['RAD'] == 2]['MEDV'])
x3 = list(df_hp[df_hp['RAD'] == 3]['MEDV'])
# Setting colors and names
#
colors=['blue', 'green', 'orange']
names=['RAD-1', 'RAD-2', 'RAD-3']
# Creating plot with list values, colors and names (labels)
# Note the density value set as true which represents the
# probability distribution
#
plt.hist([x1, x2, x3], color=colors, label=names, density=True)
# Set the legend and labels
#
plt.legend()
plt.title('Side-by-side Histogram for housing prices')
plt.xlabel('Median value of owner-occupied homes in $1000')
Here is how the side-by-side histogram plot would look like:
Another requirement can be to view the histogram plots stacked over each other. The requirement is to understand the data distribution against different attribute values while having the plots stacked over each other. This is different from side-by-side histogram in the way that the plots are stacked over each other. The difference in the code will only be an addition of another parameter, stacked = True in the plot function code used to draw side-by-side multiple histogram plots. Note the same in the code given below:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.dpi'] = 100
from IPython.core.pylabtools import figsize
# Load boston housing price dataset
housing_prices = datasets.load_boston()
# Create Pandas dataframe
hp_data_transpose = housing_prices.data.transpose()
df_hp = pd.DataFrame({
'CRIM': hp_data_transpose[0],
'ZN':hp_data_transpose[1],
'INDUS':hp_data_transpose[2],
'CHAS':hp_data_transpose[3],
'NOX':hp_data_transpose[4],
'RM':hp_data_transpose[5],
'AGE':hp_data_transpose[6],
'DIS':hp_data_transpose[7],
'RAD':hp_data_transpose[8],
'TAX':hp_data_transpose[9],
'PTRATIO':hp_data_transpose[10],
'B':hp_data_transpose[11],
'LSTAT':hp_data_transpose[12]
})
df_hp['MEDV'] = housing_prices.target
# Create histogram on MEDV column (target column)
figsize(7, 5)
# Create list of data according to different accessibility index
#
x1 = list(df_hp[df_hp['RAD'] == 1]['MEDV'])
x2 = list(df_hp[df_hp['RAD'] == 2]['MEDV'])
x3 = list(df_hp[df_hp['RAD'] == 3]['MEDV'])
# Setting colors and names
#
colors=['blue', 'green', 'orange']
names=['RAD-1', 'RAD-2', 'RAD-3']
# Creating plot with list values, colors and names (labels)
# Note the density value set as true which represents the
# probability distribution
# Note the parameter stacked = True which results in stacked histogram plot
#
plt.hist([x1, x2, x3], color=colors, label=names, density=True, stacked = True)
# Set the legend and labels
#
plt.legend()
plt.title('Stacked Histogram for housing prices')
plt.xlabel('Median value of owner-occupied homes in $1000')
The stacked histogram plot would look like the following:
We hope you found this introduction to histogram plots helpful along with how to create histogram plots using Matplotlib and Pandas packages in Python. If you have any questions, please don’t hesitate to reach out to us. And be sure to check out our other tutorials for more in-depth looks at data science and Python programming.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…