In this post, you will learn about some of the following in relation to evaluating a discrete-valued hypothesis when learning hypothesis (building models) using different machine learning algorithms. The discrete-valued hypothesis could also be understood as classification models built using machine learning algorithms and used to classify an instance drawn at random.

• What is a true error or true risk?
• What is a sample error or empirical risk?
• Difference between true error and sample error
• How to estimate the true error?

In case you are a data scientist, you will want to understand the concept behind the true error and sample error. These concepts are key to understand for evaluating a hypothesis.

## What is a True Error or True Risk?

The true error or true risk of a hypothesis is the probability (or proportion) that the learned hypothesis will misclassify a single randomly drawn instance from the population. The population simply means all the data taken from the world. Let’s say the hypothesis learned using the given data is used to predict whether a person suffers from a disease. Note that this is a discrete-valued hypothesis meaning that the learned hypothesis will result in the discrete outcome (person suffers from the disease or otherwise).

Mathematically, if the target function is f(x) and the learned hypothesis is h(x), then the true error can be represented as the following:

True Error = Probability [ f(x) is NOT EQUAL TO h(x) ] for any single instance drawn from the population at random.

In other words, True Error can be represented as proportion of misclassification for the entire dataset or population.

Hypothesis h(x) can be used to represent a machine learning model. Note that there can be multiple different hypotheses which can be learned using different hyper-parameter settings or different training data set. And, all possible hypotheses form what is called hypothesis space. Learn about these terminologies from my post – ML Terminologies for beginners. Let’s say that the hypothesis is that function h predicts whether a person suffers from a disease given parameter x. Other possible hypotheses can be learned using different machine learning algorithms such as logistic regression, decision tree, random forest, gradient boosting trees, etc. Let’s say that the random forest algorithm with a particular set of hyperparameter and training dataset is used to learn the hypothesis.

The true error will then represent the probability that random forest-based hypothesis h(x) misclassifies a person suffering from a disease for the entire population. True error is also termed as True Risk.

The question is how to calculate the true error or true risk. This is where sample error or empirical risk comes into the picture. The goal is to understand how good an estimate of true error is provided by the sample error?

## What is a Sample Error or Empirical Risk?

The sample error or empirical risk of a hypothesis with respect to some sample S of instances drawn from the population is the fraction of S that it misclassifies. The sample error is also called a sampling error. Intuitively, sample error represents variation in the parameter (such as the mean) due to sampling.

Let’s say that a sample S consists of 50 instances. Out of 50 instances, 15 are misclassified. Thus, the sample error could be calculated as the following:

Sample error = (count of instances misclassified) / (total count of instances) = 15/50 = 0.3 (30%)

The sample error can also be represented in terms of the following:

$$Sample Error = \frac{False Positive + False Negative}{True Positive + False Positive + True Negative + False Negative}$$

The above can also be represented as the following:

$$Sample Error = 1 – \frac{True Positive + True Negative}{True Positive + False Positive + True Negative + False Negative}$$

The above can further be represented as the following:

$$Sample Error = 1 – Accuracy$$

## Difference between True Error & Sample Error

The following represents the differences between true error and sample error:

• The true error represents the probability that a randomly drawn instance from the population (distribution) is misclassified while the sample error is the fraction of the sample which is misclassified
• The true error is used for the population while sample error is used for the sample
• True error is difficult to calculate. Thus, the true error is calculated as a function of the sample error. This is where the confidence interval comes into the picture. The confidence interval of true error means what is the range in which the true error will be if the sample error is some X.

## Confidence Interval – How to Estimate the True Error?

The true error is very complex to be calculated. However, it could be estimated as a function of the sample error for the discreet-value hypothesis given the following assumptions:

• The sample S contains n examples which are drawn independent of one another and also independent of the hypothesis
• Size of the sample is greater than or equal to 30
• Hypothesis h misclassifies r instances out of total n instances

Given the above assumptions, the statistical theory allows the following assertions:

• Given no other information, the most probable value of the true error is the sample error
• With an approximate 95% probability, the value of true error lies in the following interval
$$SampleError \pm 1.96*\sqrt{\frac{SampleError*(1 – SampleError)}{SampleSize}}$$

The above means that if the experiment is repeated over and over again, for 95% of experiments, the true error will fall in the interval calculated based on the above formula. Thus, this interval is called a 95% confidence interval estimate for true error.

For other value of confidence interval the following table can be used to substitute 1.96 with the appropriate value:

You may note that as the confidence interval increases, the interval value increases. Intuitively, the idea is to capture all possible values of true error based on the sample error.

The following represents the rule of thumb on whether the true error could be estimated from the sample error

• Sample error rate is not close to zero (0) or one (1)
• SampleCount * SampleError(1 – SsampleError) >= 5

## References

You may want to check out the following related video to understand about true error, sample error and confidence intervals.