Histogram and Density Plots in Python & R

In the world of data science, visualizing data is crucial to make sense of the information at hand. One of the most popular ways to visualize data is by using histograms and density plots. These visualizations help us understand the distribution of data and identify patterns that may not be apparent from raw numbers alone. In this blog, we will explore how to create histograms and density plots in two popular programming languages, Python and R.

As a data scientist, it is important to have a good understanding of these visualizations because they allow you to communicate your findings effectively. Histograms and density plots can help you see the spread of the data, and check for normality. These are all important steps in the data analysis process, and being able to create and interpret these visualizations will make you a more effective and efficient data analyst. So, whether you are new to data science or are an experienced practitioner, understanding histograms and density plots is an essential skill for anyone who wants to work with data.

What are Histogram & Density Plots? When to use them?

A histogram and a density plot are both visualizations that show the distribution of data. However, they differ in the way they display this information.

What’s Histogram?

A histogram is a graph that displays the frequency of data in equal intervals or bins. It consists of a series of bars, where each bar represents a range of values, and the height of the bar corresponds to the number of data points that fall within that range. Histograms are commonly used to show the distribution of a single variable, such as age, income, or test scores. The bin width of a histogram is the interval between the values on the x-axis that define the edges of each bin. If the bin width is too small, the histogram may appear too “busy” visually and show too much detail. The histogram may reveal random noise in the data, rather than the underlying pattern. The main trends in the data may get obscured. Conversely, if the bin width is too large, the histogram may oversimplify the data and hide important details. The important details may be obscured, such as small peaks and valleys that provide important information about the distribution.

To determine the best bin width, data scientists use a variety of techniques, such as the Freedman-Diaconis rule, the Sturges rule, and the Scott rule. These rules use statistical methods to estimate the optimal bin width based on the number of observations in the data set, the range of values, and the variance of the data.

The following picture represents histogram with different bin width.

What’s Density plot?

A density plot shows the probability density function of a variable. It is a smoothed version of a histogram, where the bars are replaced by a continuous line. Density plots are useful for showing the shape of a distribution and identifying its mode, skewness, and kurtosis. The shape and smoothness of a density plot depend on two main factors: the choice of kernel and the bandwidth used for the density estimation.

The kernel function is a mathematical function that determines how the density estimate is calculated at each point. The choice of kernel function can have a significant impact on the shape of the density plot, as different kernel functions will give different weights to the observations in the data.

The bandwidth, as previously mentioned, controls the amount of smoothing applied to the density plot. A larger bandwidth will lead to a smoother density plot, while a smaller bandwidth will lead to a more visually busy density plot. The choice of bandwidth is important, as it can have a significant impact on the shape and accuracy of the density plot.

Together, the choice of kernel function and bandwidth determine the smoothness and shape of the density plot. The optimal choice of kernel and bandwidth depends on the data being analyzed and the goals of the analysis. Generally, a Gaussian kernel is a good choice for most datasets, as it is smooth and has a continuous derivative, making it easier to differentiate and integrate.

Here is a sample density plot with Gaussian Kernel and different bandwidth (bandwidth=0.5 and bandwidth=2):

What’s difference between Histogram & Density Plot?

In general, histograms are best suited for visualizing the distribution of a single variable, while density plots are more suitable for comparing the distribution of two or more variables. However, it is important to note that both visualizations are complementary and can be used together to gain a deeper understanding of the data.

Code Sample – Draw Histogram and Density Plot

Histrogram and density plot are very useful for examining the spread of a data variable. Following R commands with ggplot package helps in drawing histogram and density plots. As I am explaining with ggplot package, I am using diamonds data which comes with ggplot package. Pay attention to some of the following:

Histogram Plots using Python

Check out my blog on Histogram plots using Matplotlib & Pandas: Python.

Histogram Plot using R

In this code, the following is done:

• Firstly, the ggplot2 library was loaded. Then, a generate a set of random data using the rnorm() function was created.
• Next, we created a data frame called df that contains the random data in a single column called “Data”.
• We then use the ggplot() function to create the plot. The data parameter is used to specify the data frame to be used, and the aes() function is used to specify the variable to be plotted on the x-axis.
• We use the geom_histogram() function to create the histogram plot, and we set the binwidth parameter to 0.25 to specify the width of the bins. We also set the color parameter to “black” to outline the bars and the fill parameter to “blue” to fill the bars with blue color.
• A title to the plot using the ggtitle() function, a label for the x-axis using the xlab() function, and a label for the y-axis using the ylab() function was added to the plot.

The code below creates a plot with a histogram of the random data, with the bars colored blue and outlined in black. The ggplot2 package provides a lot of flexibility for customizing the appearance of the plot, such as adjusting the bin width, changing the bar color and style, and adding additional layers to the plot.

# Load ggplot2 library
library(ggplot2)

# Generate sample data
set.seed(123)
data <- rnorm(100)

# Create data frame
df <- data.frame(Data = data)

# Create histogram plot
ggplot(data = df, aes(x = Data)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "blue") +
ggtitle("Histogram Plot") +
xlab("Data") +
ylab("Frequency")


Density plot using Python

The following code creates a density plot using Python. The code represent the density plot representing the marks scored by students in a school. Note some of the following:

• A sample data set of 100 records using the np.random.randint() function from NumPy is used to generates random marks between 35 and 100.
• A data frame called df is created to contains the random marks data in a single column called “Marks”.
• The sns.kdeplot() function from Seaborn is used to create the density plot. The df[‘Marks’] parameter is used to specify the variable to be plotted. The shade parameter is set to True to fill the area under the curve with color. The color parameter is set to ‘blue’ to set the color of the plot.
• The Seaborn library is used to customize the plot style. The plot style is set to ‘darkgrid’ using sns.set_style(). The color palette is set to ‘pastel’ using sns.set_palette(). The font size of the axis labels is set using sns.set().
# Import necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(123)
data = np.random.randint(35, 101, 100)

# Create data frame
df = pd.DataFrame({'Marks': data})

# Create density plot

# Add axis labels and plot title
sns.set_style("darkgrid")
sns.set_palette("pastel")
sns.set(font_scale=1.2)
plt.xlabel("Marks Scored")
plt.ylabel("Density")
plt.title("Density Plot of Student Marks")
plt.show()


Conclusion

Histograms and Density Plots are essential visualization tools used to explore and understand the distribution of data. A histogram displays the distribution of data by dividing it into equal intervals and representing the frequency of data points in each interval using bars. On the other hand, a Density Plot displays the distribution of data by estimating the probability density function of the data and plotting it as a curve. In Python, we can use the matplotlib library to create histograms and the seaborn library to create density plots. In R, we can use the ggplot2 library to create both histograms and density plots. In summary, histograms and density plots are powerful tools that can help you to gain insights into your data. By using the code examples provided in this blog, you can start creating your own histograms and density plots in Python and R. So, go ahead and explore your data with these powerful visualization tools!