Standard Deviation of Population vs Sample

Have you ever wondered what the difference between standard deviation of population and a sample? Or why it’s important to measure the standard deviation of both? In this blog post, we will explore what standard deviation is, the differences between the standard deviation of population and samples, and how to calculate the standard deviation of both using with the help of Python code example. By the end of this post, you should have a better understanding of standard deviation in general and why it’s important to calculate it for both populations and samples. In this post, you will learn about the statistics concepts of standard deviation .

What is Standard Deviation?

The standard deviation (SD) is a measure of variability of data distribution. The standard deviation of a data set is a measure of how spread out the data is. It is used as part of z-score, confidence interval, hypothesis testing, etc. Standard deviation is used as the variability measure for both population and sample data. In statistics, population refers to the entire set of objects or individuals about which we want to draw conclusions. A sample can be thought of as a subset of a larger population that allows us to make conclusions about the overall population, based on just one smaller group. Read about the concepts of population and sample in this post – Population and Samples in Statistics: Examples.

The formula of standard deviation for population is following:

standard deviation of population

The formula of standard deviation for the sample is following:

standard deviation of sample

Take a look at the following example using two different samples of 4 numbers whose mean are same but the standard deviation (data spread) are different.

arr1 = [10, 16, 8, 22]

arr2 = [12, 18, 12, 14]

Here is the code for calculating the mean of the above sample. One can either write Python code for calculating the mean or use statistics library methods such as mean. The mean of the above two samples comes out to be 14.

from statistics import mean
#
# Calculate mean
#
mean(arr1), mean(arr2)
#
# Custom code in Python for calculating the mean
#
def mean(arr):
    sum = 0;
    for i in range(len(arr)):
        sum += arr[i]
    return sum / len(arr)

Here is the Python code for calculating the standard deviation. Note the following aspects in the code given below:

  • For calculating the standard deviation of a sample of data (by default in the following method), the Bessel’s correction is applied to the size of the data sample (N) as a result of which 1 is subtracted from the sample size (such as N – 1). The idea is that the calculation of standard deviation of sample includes a little bias due to the fact that the deviation is calculated based on the sample mean rather than the population mean. Thus, the bias is removed by subtracting 1 from the sample size.
  • For calculating the standard deviation of the population (passing dist = ‘population’ in stddev method), the size of the data N is used. Here is the formula:
import math
'''
Calculate the biased and unbiased estimation of 
standard deviation
'''
def stddev(arr, dist='sample'):
    squaredSum = 0.0
    meanArr = mean(arr)
    for i in range(len(arr)):
        squaredSum += math.pow((arr[i] - meanArr),2)
        i += 1
    sdVal = 0
    if dist == 'sample':
        #
        # For biased estimation, the formula becomes
        # SQRT(((Xi - Xmean)**2)/N)
        #
        sdVal = math.sqrt(squaredSum/(len(arr) - 1))
    elif dist == 'population':
        #
        # For unbiased estimation, the formula becomes
        # SQRT(((Xi - Xmean)**2)/(N-1))
        #
        sdVal = math.sqrt(squaredSum/(len(arr)))
    else:
        return -1
    return sdVal

stddev(arr1), stddev(arr2)

When the standard deviation is calculated by passing arr1 and arr2 to stddev method, the standard deviation values came out to be 6.32, 2.83 respectively. You can note that although the mean value was found to be same, the standard deviation came out to be different representing the nature of the data set.

Different techniques for calculating Standard Deviation

Standard deviation can also be calculated some of the following techniques:

  • Using custom python method as shown in the previous section
  • Using statistics library method such as stdev and pstdev
  • Using numpy library method such as stdev

Statistics Library for calculating Standard Deviation

using statistics library in the following manner. Note that stdev calculates the standard deviation of the sample while pstdev calculates the standard deviation of the population.

from from statistics import stdev, pstdev

stdev(arr1), stdev(arr2)

Numpy Library for calculating Standard Deviation

One can also use Numpy library to calculate the standard deviation. The std() method by default calculates the standard deviation of the population. However, if one has to calculate the standard deviation of the sample, one needs to pass the value of ddof (delta degrees of freedom) to 1.

narr1 = np.array(arr1)
narr2 = np.array(arr2)
#
# Calculates the standard deviation taking arr1 and arr2 as population
#
narr1.std(), narr2.std()
#
# Calculates the standard deviation taking arr1 and arr2 as sample
#
narr1.std(ddof=1), narr2.std(ddof=1)

Standard deviation of Population vs Sample

In this section, you will learn about when to use standard deviation population formula vs standard deviation sample formula.

When the data size is small, one would want to use the standard deviation formula with Bessel’s correction (N-1 instead of N) for calculation purpose. For statistics package, one would want to use stdev method. For Numpy std() method, you would want to pass the parameter ddof as 1. When the data size is decently large enough, one could use default std() method of Numpy or pstdev() method of statistics package.

Conclusion

Here is what you learned in this post:

  • Standard deviation is about determining or measuring the spread of a given data set (sample or population)
  • While calculating standard deviation of a sample of data, Bessel’s correction is applied (usage of N-1 instead of N) for calculating the average of squared difference of data points from its mean.
  • You can calculate the standard deviation of population and sample using pstdev() and stdev() methods rspectively of statistics library
  • You can calculate the standard deviation using std() method of Numpy library. For calculating standard deviation of sample of data, the value of ddof parameter is passed as 1.
  • Use the standard deviation formula for sample when data size is small else use standard deviation formula for population.

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Data Science, Python, statistics. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *