In this post, you will learn about the concepts of confidence intervals in relation to machine learning models and related concepts with the help of an example and Python code examples.
When you get a hypothesis function by training a machine learning classification model, you evaluate the hypothesis/model by calculating the classification error. The classification error is calculated on the sample of the data used for training the model. However, does this classification error for the sample (sample error) also represent (same as) the classification error of the hypothesis/model for the entire population (true error)? How can the true error be represented as a function of the sample error? This is where the concept of confidence interval comes into the picture. You may want to check one of my related posts on true and sample error – The difference between true and sample error.
What is Confidence Interval?
The confidence interval is used to represent the interval or range of values needed to match a confidence level for estimating the parameter of the entire population or population proportion. Recall that Statistics is about estimation. When there is a need to estimate the statistics about the population parameter, it is considered as a good practice to represent the estimate as a confidence interval. The statistics of the population parameter generally represents the mean or median. And, the confidence level is represented using the number such as 98% confidence, 95% confidence etc.
The confidence interval is associated with the confidence level represented using a number, say, N, and termed as an N% confidence interval. N can take values such as 95, 90, etc. An N% confidence interval would mean the following – If an experiment to find the average height of male out of 100 male, is performed for, say, 50 times, the interval in which the average height will fall for 95% of times (45 times) will be between, say, 173 and 179 cm. Thus, a 95% confidence interval for average height will be 173 and 179 cm.
Confidence interval is used to estimate the statistics such as population mean or median (such as mean height shown in the above example), and, population proportion. Here is an example of the population proportion.
The error of the model prediction is a classic example of proportion. The error can be represented as the proportion of misclassification in prediction done by the model. So, the error found in the model trained on sample data is termed as sample error. The objective is to estimate the true error of the model given population data. This can be represented using a confidence interval. Confidence interval can be used to estimate the true error of the model as a function of the sampling error.
Here is a diagram which can be used to understand the confidence interval concepts. The diagram is taken from the website, spss-tutorials.com.
In the above diagram, the ask is to estimate the salary. The mean salary is calculated from different samples (note sample1, sample 2, sample 3, sample 4).
Why is the confidence interval measurement needed?
Simply speaking, confidence interval measurement is needed to find out the range in which the population parameter will fall based on the outcomes from one or more experiments performed on different samples taken from the population. It is used to communicate the accuracy of the estimate of the population parameters. For any outcome to be found using the experiments, you are never going to be 100% confident about the population parameter based on the experiments. Thus, you need confidence intervals to represent the range in which the population parameter will fall. If you’re 95% confident, or 98% confident, that’s usually considered “good enough” in statistics. That percentage of confidence is the confidence interval. For the N% confidence interval, we are saying that given numerous experiments performed, for N times, the population parameter (P) will fall in the range of P + m and P – m. And, the value of m will change with N. Let’s understand with an example. Confidence intervals are usually reported in the context of a margin of error, though they are two unique values.
Let’s say we want to estimate the mean height of the male population in the 20-30 age group in India. Gathering and calculating the height of every individual in the 20-30 age group in India is a real herculean task. Here the statistics of population parameter is mean height. Is there a way in which we can get a fair estimate of this population parameter, mean height? One of the ways is to take a sample of 1000 male individuals from the key cities, gather their heights, and calculate the mean. The objective is to estimate the mean height of the population based on the mean height calculated from the sample. The estimation of the mean height of population is done using confidence interval. Let’s say the procedure is followed 50 times by taking different samples of 1000 male individuals and the following got observed:
- In 48 times, the mean height fell in the range of 175 and 178 cm. Note that 48 is approx 95% of 50. Thus, with 95% confidence level, one could say that the mean height of the population will be in the range of 175 and 178 cm.
- In 45 times, the mean height fell in the range of 173 and 179 cm. Note that 45 is approx 90% of 50. Thus, with 90% confidence level, one could say that the mean height of the population will be in the range of 173 and 179 cm.
What affects the width of the confidence interval?
The width of the confidence intervals depend upon the following:
- Variation: Greater is the variation in the population, larger is the width of the confidence interval and vice versa.
- Sample size: Smaller is the sample size, larger is the width of the confidence interval and vice versa. For the smaller sample, the information contained will be less. Thus, there will be larger confidence interval width.
How to calculate confidence interval?
Confidence interval can be calculated using a normal distribution (Z-distribution) or T-distribution. T-distribution is used if the sample size is smaller (less than 30) or the information about the distribution is not known.
For calculating confidence interval for statistics such as population mean, the following formula can be used. s represents standard deviation and n represents the size of the sample. X bar represents sample mean and t represents t-distribution. T-distribution is used as in most cases the population distribution is not known. In case, the population distribution is known in advance and is found to be the normal distribution, one can use z in place of t.
The following formula can be used to calculate confidence intervals for estimating the population proportion. For determining the estimate of the population proportion, the normal distribution is used and, thus, z. p represents the mean proportion of the sample. n represents the size of the sample.
Taking above into consideration, one can calculate the true error of the classification model if the sample error for the model is known.
Here are some good pages and videos from which the concepts on confidence intervals can be learned:
Here is another great video on the confidence interval.