In this post, you will learn about some of the following in relation to evaluating a discrete-valued hypothesis when learning hypothesis (building models) using different machine learning algorithms. The discrete-valued hypothesis could also be understood as classification models built using machine learning algorithms and used to classify an instance drawn at random.

• What is a true error or true risk?
• What is a sample error or empirical risk?
• Difference between true error and sample error
• How to estimate the true error?

In case you are a data scientist, you will want to understand the concept behind the true error and sample error. These concepts are key to understand for evaluating a hypothesis.

## What is a True Error or True Risk?

The true error or true risk of a hypothesis is the probability that the learned hypothesis will misclassify a single randomly drawn instance from the population. Let’s say the hypothesis learned using the given data is used to predict whether a person suffers from a disease. Note that this is a discrete-valued hypothesis meaning that the learned hypothesis will result in the discrete outcome (person suffers from the disease or otherwise).

Mathematically, if the target function is f(x) and the learned hypothesis is h(x), then the true error can be represented as the following:

True Error = Probability [ f(x) is NOT EQUAL TO h(x) ] for any single instance drawn from the population at random.

Hypothesis h(x) can be used to represent a machine learning model. Let’s say that the hypothesis is that function h predicts whether a person suffers from a disease given parameter x. In order to learn the hypothesis (function h(x)), we use different machine learning algorithms such as logistic regression, decision tree, random forest, gradient boosting trees, etc. Let’s say that the random forest algorithm is used to learn the hypothesis.

The true error will then represent the probability that random forest based hypothesis h(x) misclassifies a person suffering from a disease. True error is also termed as True Risk.

The question is how to calculate the true error or true risk. This is where sample error or empirical risk comes into the picture. The goal is to understand how good an estimate of true error is provided by the sample error?

## What is a Sample Error or Empirical Risk?

The sample error or empirical risk of a hypothesis with respect to some sample S of instances drawn from the population is the fraction of S that it misclassifies.

Let’s say that a sample S consists of 50 instances. Out of 50 instances, 15 are misclassified. Thus, the sample error could be calculated as the following:

Sample error = (count of instances misclassified) / (total count of instances) = 15/50 = 0.3 (30%)

The sample error can also be represented in terms of the following:

$$Sample Error = \frac{False Positive + False Negative}{True Positive + False Positive + True Negative + False Negative}$$

The above can also be represented as the following:

$$Sample Error = 1 – \frac{True Positive + True Negative}{True Positive + False Positive + True Negative + False Negative}$$

The above can further be represented as the following:

$$Sample Error = 1 – Accuracy$$

## Difference between True Error & Sample Error

The following represents the differences between true error and sample error:

• The true error represents the probability that a randomly drawn instance from the population (distribution) is misclassified while the sample error is the fraction of sample which is misclassified
• The true error is used for the population while sample error is used for the sample
• True error is difficult to calculate. Thus, the true error is calculated as a function of the sample error. This is where the confidence interval comes into the picture.

## How to Estimate the True Error?

The true error is very complex to be calculated. However, it could be estimated as a function of the sample error for discreet-value hypothesis given the following assumptions:

• The sample S contains n examples which are drawn independent of one another and also independent of the hypothesis
• Size of the sample is greater than or equal to 30
• Hypothesis h misclassifies r instances out of total n instances

Given the above assumptions, the statistical theory allows the following assertions:

• Given no other information, the most probable value of the true error is the sample error
• With an approximate 95% probability, the value of true error lies in the following interval
$$SampleError \pm 1.96*\sqrt{\frac{SampleError*(1 – SampleError)}{SampleSize}}$$

The above means that if the experiment is repeated over and over again, for 95% of experiments, the true error will fall in the interval calculated based on the above formula. Thus, this interval is called as 95% confidence interval estimate for true error.

For other value of confidence interval the following table can be used to substitute 1.96 with appropriate value:

You may note that as the confidence interval increases, the interval value increases. Intuitively, the idea is to capture all possible values of true error based on the sample error.

The following represents the rule of thumb on whether the true error could be estimated from the sample error

• Sample error rate is not close to zero (0) or one (1)
• SampleCount * SampleError(1 – SsampleError) >= 5

## References

### Ajitesh Kumar

Ajitesh has been recently working in the area of AI and machine learning. Currently, his research area includes Safe & Quality AI. In addition, he is also passionate about various different technologies including programming languages such as Java/JEE, Javascript and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc.

He has also authored the book, Building Web Apps with Spring 5 and Angular.