**histograms**and

**density plots**. These visualizations help us understand the distribution of data and identify patterns that may not be apparent from raw numbers alone. In this blog, we will explore how to create histograms and density plots in two popular programming languages, Python and R.

## What are Histogram & Density Plots? When to use them?

### What’s Histogram?

**histogram**is a graph that displays the frequency of data in equal intervals or bins. It consists of a series of bars, where each bar represents a range of values, and the height of the bar corresponds to the number of data points that fall within that range. Histograms are commonly used to show the distribution of a single variable, such as age, income, or test scores. The bin width of a histogram is the interval between the values on the x-axis that define the edges of each bin.

**If the bin width is too small**, the histogram may appear too “busy” visually and show too much detail. The histogram may reveal random noise in the data, rather than the underlying pattern. The main trends in the data may get obscured. Conversely,

**if the bin width is too large**, the histogram may oversimplify the data and hide important details. The important details may be obscured, such as small peaks and valleys that provide important information about the distribution.

### What’s Density plot?

**density plot**shows the probability density function of a variable. It is a smoothed version of a histogram, where the bars are replaced by a continuous line. Density plots are useful for showing the shape of a distribution and identifying its mode, skewness, and kurtosis. The

**shape and smoothness of a density plot**depend on

**two main factors**: the

**choice of kernel**and the

**bandwidth**used for the density estimation.

The kernel function is a mathematical function that determines how the density estimate is calculated at each point. The choice of kernel function can have a significant impact on the shape of the density plot, as different kernel functions will give different weights to the observations in the data.

The bandwidth, as previously mentioned, controls the amount of smoothing applied to the density plot. A larger bandwidth will lead to a smoother density plot, while a smaller bandwidth will lead to a more visually busy density plot. The choice of bandwidth is important, as it can have a significant impact on the shape and accuracy of the density plot.

Together, the choice of kernel function and bandwidth determine the smoothness and shape of the density plot. The optimal choice of kernel and bandwidth depends on the data being analyzed and the goals of the analysis. Generally, a Gaussian kernel is a good choice for most datasets, as it is smooth and has a continuous derivative, making it easier to differentiate and integrate.

Here is a sample density plot with Gaussian Kernel and different bandwidth (bandwidth=0.5 and bandwidth=2):

### What’s difference between Histogram & Density Plot?

## Code Sample – Draw Histogram and Density Plot

Histrogram and density plot are very useful for examining the spread of a data variable. Following R commands with ggplot package helps in drawing histogram and density plots. As I am explaining with ggplot package, I am using diamonds data which comes with ggplot package. Pay attention to some of the following:

### Histogram Plots using Python

Check out my blog on Histogram plots using Matplotlib & Pandas: Python.

### Histogram Plot using R

In this code, the following is done:

- Firstly, the ggplot2 library was loaded. Then, a generate a set of random data using the
**rnorm()**function was created. - Next, we created a data frame called
**df**that contains the random data in a single column called “Data”. - We then use the
**ggplot()**function to create the plot. The**data**parameter is used to specify the data frame to be used, and the**aes()**function is used to specify the variable to be plotted on the x-axis. - We use the
**geom_histogram()**function to create the histogram plot, and we set the**binwidth**parameter to 0.25 to specify the width of the bins. We also set the**color**parameter to “black” to outline the bars and the**fill**parameter to “blue” to fill the bars with blue color. - A title to the plot using the
**ggtitle()**function, a label for the x-axis using the**xlab()**function, and a label for the y-axis using the**ylab()**function was added to the plot.

The code below creates a plot with a histogram of the random data, with the bars colored blue and outlined in black. The ggplot2 package provides a lot of flexibility for customizing the appearance of the plot, such as adjusting the bin width, changing the bar color and style, and adding additional layers to the plot.

```
# Load ggplot2 library
library(ggplot2)
# Generate sample data
set.seed(123)
data <- rnorm(100)
# Create data frame
df <- data.frame(Data = data)
# Create histogram plot
ggplot(data = df, aes(x = Data)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "blue") +
ggtitle("Histogram Plot") +
xlab("Data") +
ylab("Frequency")
```

### Density plot using Python

The following code creates a density plot using Python. The code represent the density plot representing the marks scored by students in a school. Note some of the following:

- A sample data set of 100 records using the
**np.random.randint()**function from NumPy is used to generates random marks between 35 and 100. - A data frame called
**df**is created to contains the random marks data in a single column called “Marks”. - The
**sns.kdeplot()**function from Seaborn is used to create the density plot. The**df[‘Marks’]**parameter is used to specify the variable to be plotted. The**shade**parameter is set to**True**to fill the area under the curve with color. The**color**parameter is set to ‘blue’ to set the color of the plot. - The Seaborn library is used to customize the plot style. The plot style is set to ‘darkgrid’ using
**sns.set_style()**. The color palette is set to ‘pastel’ using**sns.set_palette()**. The font size of the axis labels is set using**sns.set()**.

```
# Import necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(123)
data = np.random.randint(35, 101, 100)
# Create data frame
df = pd.DataFrame({'Marks': data})
# Create density plot
sns.kdeplot(df['Marks'], shade=True, color='blue')
# Add axis labels and plot title
sns.set_style("darkgrid")
sns.set_palette("pastel")
sns.set(font_scale=1.2)
plt.xlabel("Marks Scored")
plt.ylabel("Density")
plt.title("Density Plot of Student Marks")
plt.show()
```

## Conclusion

Histograms and Density Plots are essential visualization tools used to explore and understand the distribution of data. A histogram displays the distribution of data by dividing it into equal intervals and representing the frequency of data points in each interval using bars. On the other hand, a Density Plot displays the distribution of data by estimating the probability density function of the data and plotting it as a curve. In Python, we can use the matplotlib library to create histograms and the seaborn library to create density plots. In R, we can use the ggplot2 library to create both histograms and density plots. In summary, histograms and density plots are powerful tools that can help you to gain insights into your data. By using the code examples provided in this blog, you can start creating your own histograms and density plots in Python and R. So, go ahead and explore your data with these powerful visualization tools!

- Online US Degree Courses & Programs in AI / Machine Learning - May 29, 2023
- AIC & BIC for Selecting Regression Models: Formula, Examples - May 28, 2023
- Azure OpenAI Service Details & Pricing Info - May 27, 2023

## Leave a Reply