# Hypothesis Testing Explained with Real-life Examples Hypothesis testing is a statistical technique that helps researchers test the validity of their theories. It’s often used in statistics and data science to analyze whether an event has occurred, or if it will occur based on past events.  This blog post will cover some of the key statistical concepts along with examples in relation to how to formulate a hypothesis for hypothesis testing. The knowledge of hypothesis formulation and hypothesis testing would prove key to building various different machine learning models. In later articles, hypothesis formulation for machine learning algorithms such as linear regression, logistic regression models, etc., will be explained.

## What is a Hypothesis?

Simply speaking, hypothesis testing is a statistical framework that can be used to answer simple “yes” and “no” questions about the data. For example:

• As part of a linear regression model, is there a relationship between the response variables and predictor variables? Let’s say, the housing price depends upon the average income of people already staying in the locality. Is this true?
• Taking a real-world scenario. It is claimed that a 500 gm sugar packet for a particular brand, say XYZA, is found to contain around 480 gm.  Is this true?
• Is this true that leaving smoking increases the lifespan?

As per the Dictionary page on Hypothesis, Hypothesis means a proposition or set of propositions, set forth as an explanation for the occurrence of some specified group of phenomena, either asserted merely as a provisional conjecture to guide investigation (working hypothesis) or accepted as highly probable in the light of established facts.

As per the above statement, there are two important aspects to pay attention to, when one is considering the formulation of a hypothesis. The following represents different types of scenarios that could be put to hypothesis testing:

• Well-established fact or the statement assumed to be true: The case in which a fact is well-established, or accepted as truth. In other words, the fact is given. For example, when you buy a packet of 500 gm of sugar, you assume that the packet does contain at the minimum 500 gm of sauce and not any less, based on the label of 500 gm on the packet. In this case, the fact is given or assumed to be the truth. Such cases could be considered for hypothesis testing if this is claimed that the assumption or the default state of being is not true. That the sugar packet is found to consist of around 480 gm of sugar.
• The statement that is claimed to be true: The case in which there is some claim made about the fact, or in other words, the fact cannot be considered as true or proven. For example, the fact that the housing price depends upon the average income of people already staying in the locality can be considered as a claim and not assumed to be true. Another example could be the claim that running 5 miles a day would result in a reduction of 10 kg of weight within a month. There could be varied such claims which when required to be proved as true has to go through hypothesis testing.

The first step to hypothesis testing is defining or stating a hypothesis. Once the hypothesis is defined or stated, the next step is to formulate the null and alternate hypothesis in order to begin hypothesis testing as described above. Based on the above considerations, the following hypothesis can be stated for doing hypothesis testing:

• The packet of 500 gm of sugar does contain 480 gm of sugar.
• The housing price depends upon the average income of the people staying in the locality.
• Running 5 miles a day results in a reduction of 10 kg of weight within a month.

Now that the hypothesis is stated, let’s go ahead and formulate the hypothesis as the null and alternate hypothesis.

## How to Formulate Hypothesis as Null or Alternate Hypothesis?

Given the above information, one could formulate the hypothesis accordingly and call it the null hypothesis or alternate hypothesis. In the case where the given statement is a well-established fact or default state of being in the real world, one can call it a null hypothesis (in the simpler word, nothing new). In case the given statement is a claim (unexpected event in real-world) and not yet proven, one can call/formulate it as an alternate hypothesis and accordingly define a null hypothesis. One should note that null and alternate hypotheses are mutually exclusive and at the same time asymmetric. The following are some examples of the null Hypothesis and alternate Hypothesis.

• Take the example of sugar with the label 500 gm. As per the above, this represents the scenario when the statement made is assumed to be true. Thus, it is assumed to be true (based on the given label) that the canned sauce weighs 500 gm. However, we want to do hypothesis testing to ascertain that the label mentioned as 500 gm is true because there is a claim that sugar packets consisted of 480 gm. Thus, the Null Hypothesis would get formulated as the statement that the weight of canned sugar is equal to 500 gm. The alternate hypothesis will thus get formulated as the statement that the weight of the sugar packet is less than 500 gm.
 Null hypothesis The weight of the sugar packet is 500 gm. Alternate hypothesis The weight of the sugar packet is less than 500 gm.
• Take the example of a claim that running 5 miles a day will lead to a reduction of 10 kg of weight within a month. Now, this is only a claim which is required to be proved. Thus, the alternate hypothesis will be formulated first as the statement that “running 5 miles a day will lead to a reduction of 10 kg of weight within a month”. Hence, the null hypothesis will be the opposite of the alternate hypothesis and stated as the fact that “running 5 miles a day does not lead to a reduction of 10 kg of weight within a month”.
 Null hypothesis Running 5 miles a day does not result in the reduction of 10 kg of weight within a month. Alternate hypothesis Running 5 miles a day results in the reduction of 10 kg of weight within a month.
• Take another example of a claim that the housing price depends upon the average income of people staying in the locality. This is only a claim which is required to be proved. Thus, Alternate Hypothesis will be formulated first as the statement that “housing price depends upon the average income of people staying in the locality”. Hence, the null hypothesis will be formulated as the statement that housing price does NOT depend upon the average income of people staying in the locality.
 Null hypothesis The housing price does not depend upon the average income of people staying in the locality. Alternate hypothesis The housing price depends upon the average income of people staying in the locality.

## Perform Hypothesis Testing with P-Value

Once you formulate the hypotheses, there is the need to test those hypotheses. Meaning, say that the null hypothesis is set as the statement that housing price does not depend upon the average income of people staying in the locality, it would be required to be tested by taking samples of housing prices and, based on the test results, this Null hypothesis could either be rejected or failed to be rejected. In hypothesis testing, the following two are the outcomes:

• Reject the Null hypothesis
• Fail to Reject the Null hypothesis

Take the above example of the sugar packet weighing 500 gm. The Null hypothesis is set as the statement that the sugar packet weighs 500 gm. After taking a sample of 20 sugar packets and testing/taking its weight, it was found that the average weight of the sugar packets came to 490 gm. The test statistics (t-statistics) were calculated for this sample and the P-value was determined. Let’s say the P-value was found to be 15%. Assuming that the level of significance is selected to be 5%, the test statistic is not statistically significant (P-value > 5%) and thus, the null hypothesis fails to get rejected. Thus, one could safely conclude that the sugar packet does weigh 500 gm. However, if the average weight of canned sauce would have found to be 475 gm, this is way beyond/away from the mean value of 500 gm and one could have ended up rejecting the Null Hypothesis based on the P-value.

Here is the diagram which represents the workflow of Hypothesis Testing. Figure 1. Hypothesis Testing Workflow

Based on the above, the following are some of the steps to be taken when doing hypothesis testing:

• State the hypothesis: First and foremost, the hypothesis needs to be stated. The hypothesis could either be the statement that is assumed to be true or the claim which is made to be true.
• Formulate the hypothesis: This step requires one to identify Null and Alternate hypothesis or in simple words, formulate the hypothesis. Take an example of the canned sauce weighing 500 gm as the Null Hypothesis.
• Set the criteria for a decision: Identify test statistics that could be used to assess the Null Hypothesis. The test statistics with the above example would be the average weight of canned sauce, and t-statistics would be used to determine the P-value.
• Identify the level of significance (alpha): Before starting the hypothesis testing, one would be required to set the significance level (also called as alpha) which represents the value for which a P-value less than or equal to alpha is considered statistically significant. Typical values of alpha are 0.1, 0.05, and 0.01. In case the P-value is evaluated as statistically significant, the null hypothesis is rejected. In case, the P-value is more than the alpha value, the null hypothesis is failed to be rejected.
• Compute the test statistics: Next step is to calculate the test statistics (z-test, t-test) to determine the P-value. If the sample size is more than 30, it is recommended to use z-statistics. Otherwise, t-statistics could be used. In the current example where 20 packets of canned sauce is selected for hypothesis testing, t-statistics will be calculated for the mean value of 505 gm (sample mean). The t-statistics would then be calculated as the difference of 505 gm (sample mean) and the population mean (500 gm) divided by the sample standard deviation divided by square root of sample size (20).
• Calculate the P-value of the test statistics: Once the test statistics have been calculated, find the P-value using either of t-table or z-table. P-value is the probability of obtaining a test statistic (t-score or z-score) equal to or more extreme than the result obtained from the sample data, given that the null hypothesis H0 is true.
• Compare P-value with the level of significance: The significance level is set as the allowable range within which if the value appears, one will be failed to reject the Null Hypothesis. This region is also called as Non-rejection region. The value of alpha is compared with the p-value. If the p-value is less than the significance level, the test is statistically significant and hence, the null hypothesis will be rejected.

## References 