Descriptive statistics is a branch of statistics that deals with the analysis of data. It is concerned with summarizing and describing the characteristics of a dataset. It is one of the most fundamental tool for data scientists to understand the data as they get started working on the dataset. In this blog post, I will cover the key concepts of descriptive statistics, including measures of central tendency, measures of spread and statistical moments.
Descriptive statistics is used to summarize and describe the characteristics of a dataset in terms of understanding its mean & related measures, spread or dispersion of the data around the mean, etc. The following are some of the important reasons why we need descriptive statistics in the first place:
Now that we understand what descriptive statistics is and why it is important, let’s take a closer look at some of its key components. These components include measures of central tendency, measures of dispersion, and statistical moments. By understanding these components, we can gain a more detailed understanding of the characteristics of a dataset, which can be used to make informed decisions and predictions based on data.
Measures of central tendency can be defined as statistical measures that provide information about the typical or central value of a dataset. They are used to summarize the data by identifying the central point around which the data is clustered. The three most commonly used measures of central tendency are:
The following represents some mathematical examples that would help understand the concepts of mean, median, and mode:
It is important to note that measures of central tendency can be affected by outliers or skewed data, which can lead to misleading results. Therefore, it is often useful to use measures of dispersion in combination with measures of central tendency to gain a more complete understanding of the dataset.
Measures of spread or dispersion can be defined as statistical measures that provide information about the variability or spread of a dataset. They are used to describe how far apart the values in a dataset are from each other and from the central value. Measures of spread are important because they can provide insights into the level of uncertainty or variability in the data. Some of the commonly used measures of spread include range, interquartile range, variance, and standard deviation
Variance is one of the most important concepts as a measure of spread. It is defined as the average of the squared differences between each value in the dataset and the mean of the dataset. In other words, it measures how far each value in the dataset is from the mean, on average.
The formula for variance is:
[latex]\operatorname{Var}(X) = \frac{1}{n-1}\sum_{i=1}^{n}(X_i – \bar{X})^2[/latex]
Where [latex]\operatorname{Var}(X)[/latex] represents the variance of the variable [latex]X[/latex], [latex]n[/latex] represents the sample size, [latex]X_i[/latex] represents the [latex]i[/latex]th observation in the sample, [latex]\bar{X}[/latex] represents the sample mean, and the symbol [latex]\sum[/latex] represents the sum of the terms inside the parentheses. The resulting value of variance is always positive or zero. A higher variance indicates that the data points are more spread out from the mean, while a lower variance indicates that the data points are closer to the mean.
Standard deviation is another key concept as a measure of spread. It is defined as the square root of the variance. In other words, it measures the amount of dispersion or spread of the data points from the mean. Standard deviation is expressed in the same units as the original data, making it a useful measure for comparing datasets with different scales. The formula for standard deviation for a sample is:
Standard Deviation = [latex]\sqrt{\frac{\sum_{i=1}^{n}(x_i – \bar{x})^2}{n-1}}[/latex]
where [latex]x_i[/latex] represents each value in the dataset, [latex]\bar{x}[/latex] represents the mean of the dataset, and [latex]n[/latex] represents the number of values in the dataset. In case, you want to find the standard deviation of population, it is [latex]N[/latex]. Standard deviation is a commonly used measure of spread in statistical analysis and modeling, and is often used in conjunction with the mean to summarize and describe the characteristics of a dataset.
Range is another measure of spread which determines the difference between the largest and smallest values in a dataset. It provides a simple measure of how spread out the data.
The formula for range is:
Range = Maximum value – Minimum value
Where the maximum value is the largest value in the dataset and the minimum value is the smallest value in the dataset. Range is a useful measure of spread for datasets with a small number of values or for providing a quick overview of the variability in the data. However, it does not provide as much information as other measures of spread such as variance or standard deviation.
Interquartile range (IQR) is another measure of spread which is used to describe the spread of the middle 50% of the data in a dataset. It is calculated by finding the difference between the third quartile (Q3) and the first quartile (Q1), which represent the 75th and 25th percentiles of the dataset, respectively. The formula for IQR is:
IQR = Q3 – Q1
The IQR is a useful measure of spread for datasets that contain outliers or are skewed, as it focuses on the middle portion of the data and is less affected by extreme values. The IQR can also be used to identify potential outliers in the data, as values that fall outside of the range of 1.5 times the IQR above or below the upper or lower quartile are often considered to be outliers.
Statistical moments is another key concept in descriptive statistics which are used to describe the shape and characteristics of the data distribution. A moment is a quantitative measure of a distribution, and the moments of a distribution are used to calculate various statistical properties such as the mean, variance, skewness, and kurtosis.
The formula for the nth statistical moment of a continuous probability distribution is:
In the above formula, the k represents kth moment. By replacing the value of k with 1, 2, 3, 4 we can find first moment, second moment, third moment and fourth moment. The following is how different moments will look like:
We have understood the concepts of mean and variance as measure of central tendency and measure of spread / dispersion. Here is the concept of skewness and kurtosis:
Skewness can be defined as the statistical moment which represents the degree of asymmetry of a probability distribution around its mean. Specifically, skewness measures the lack of symmetry in the tails of the distribution, with positive skewness indicating a longer or fatter tail on the positive side or right side and negative skewness indicating a longer or fatter tail on the negative side or left side. The following picture shows the right tailed (positive) and left-tailed (negative) skewed distribution.
The value of skewness can be positive or negative. The positive value would mean that the tail of the data distribution is skewed towards the right, or the positive side of the distribution as per the above diagram. This indicates that there are more extreme values on the positive side than on the negative side, and the mean is likely to be larger than the median. On the other hand, a negative value of skewness means that the tail of the distribution is skewed towards the left, or the negative side of the distribution. This indicates that there are more extreme values on the negative side than on the positive side, and the mean is likely to be smaller than the median. A value of zero for skewness indicates a perfectly symmetrical distribution (normal distribution).
Kurtosis can be defined as the statistical moment which represents the degree of tail heaviness of a probability distribution relative to a normal distribution. Specifically, kurtosis measures the extent to which the tails of a distribution differ from those of a normal distribution. The distribution with high value of kurtosis would mean that the distribution has heavier tails than a normal distribution. This is often referred to as a leptokurtic distribution. A high kurtosis indicates that there are more extreme values in the tails of the distribution than would be expected for a normal distribution. Examples of leptokurtic distributions include the Student’s t-distribution, Rayleigh distribution, Laplace distribution, exponential distribution, Poisson distribution, etc. On the other hand, a distribution with low kurtosis has lighter tails than a normal distribution. This is often referred to as a platykurtic distribution. A low kurtosis indicates that there are fewer extreme values in the tails of the distribution than would be expected for a normal distribution. Examples of platykurtic distributions include continuous and discrete uniform distributions. These distributions have a negative kurtosis, which means that they have thinner tails than the normal distribution.
A normal distribution (mesokurtic) has a kurtosis of 0, so any distribution with a kurtosis greater than 0 is considered to be leptokurtic, while any distribution with a kurtosis less than 0 is considered to be platykurtic.
Descriptive statistics is an important tool for summarizing and describing the characteristics of a dataset. It provides a set of measures that are used to describe the central tendency, spread, and shape of the data. In this article, we have covered some of the key concepts of descriptive statistics, including measures of central tendency, measures of dispersion or spread, and statistical moments.
Measures of central tendency, such as mean, median, and mode, provide information about the center of the data and are useful for summarizing the overall characteristics of the dataset. Measures of dispersion or spread, such as range, interquartile range, variance, and standard deviation, provide information about the variability or spread of the data and are useful for identifying any potential outliers or patterns in the data. Statistical moments, such as skewness and kurtosis, provide information about the shape of the distribution and are useful for characterizing the properties of the data.
Understanding these key concepts of descriptive statistics is important for data scientists, as it enables them to analyze and interpret data more effectively. By using measures of central tendency, dispersion, and statistical moments, data scientists can gain valuable insights into the data and make informed decisions based on the characteristics of the dataset. Whether you are working in finance, engineering, or any other field that involves data analysis, a solid understanding of descriptive statistics is essential for success.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…
View Comments
Low kurtosis does not imply a flatter peak. For example, the beta(.5,1) distribution is infinitely peaked but has low kurtosis.
Thank you Peter. The comment was indeed very insightful. Made the changes to the content.