Data Science – Hypothesis Testing Explained with Examples

0
This article represents some of the key statistical concepts along with examples in relation with how to formulate a hypothesis for hypothesis testing. The knowledge of hypothesis formulation and hypothesis testing would prove key to building various different machine learning models. In later articles, hypothesis formulation for machine learning algorithms such as linear regression, logistic regression models etc., will be explained. Please feel free to comment/suggest if I missed mentioning one or more important points. Also, sorry for the typos.

Following are the key points described later in this article:

  • What is a hypothesis?
  • How to formulate a hypothesis as Null or Alternate Hypothesis?
  • What is hypothesis testing?

What is a Hypothesis?

As per the Dictionary page on Hypothesis, Hypothesis means a proposition or set of propositions, set forth as an explanation for the occurrence of some specified group of phenomena, either asserted merely as a provisional conjecture to guide investigation (working hypothesis) or accepted as highly probable in the light of established facts.

As per the above statement, there are two important aspects to pay attention to, when one is considering the formulation of a hypothesis. The following represents different types of scenarios which could be put to hypothesis testing:

  • Well-established fact or the statement assumed to be true: The case in which a fact is well-established, or accepted as truth. In other words, the fact is given. For example, when you buy a packet of 500 gm of sauce, you assume that the packet does contain at the minimum 500 gm of sauce and not any less, based on the label of 500 gm on the packet. In this case, the fact is given or assumed to be the truth. Such cases could be considered for hypothesis testing if required to be proved to be true.
  • The statement that is claimed to be true: The case in which there is some claim made about the fact, or in other words, the fact cannot be considered as true or proven. For example, the fact that the housing price depends upon the average income of people already staying in the locality can be considered as a claim and not assumed to be true. Another example could be the claim that running 5 miles a day would result in a reduction of 10 kg of weight within a month. There could be varied such claims which when required to be proved as true has to go through hypothesis testing.

The first step to hypothesis testing is defining or stating hypothesis. Once the hypothesis is defined or stated, the next step is to formulate the null and alternate hypothesis in order to begin hypothesis testing. This is described in the next section. Based on the above considerations, the following hypothesis can be stated for doing hypothesis testing:

  • The packet of 500 gm of sauce does contain a minimum of 500 gm of sauce and above tea and no lesser.
  • The housing price depends upon the average income of the people staying in the locality.
  • Running 5 miles a day result in a reduction of 10 kg of weight within a month.

Now that the hypothesis is stated, let’s go ahead and formulate the hypothesis as the null and alternate hypothesis.

How to Formulate Hypothesis as Null or Alternate Hypothesis?

Given the above information, one could formulate the hypothesis accordingly and call it the Null Hypothesis or Alternate Hypothesis. In the case where the given statement is a well-established fact which is assumed to be true, one can call it as Null Hypothesis (in the simpler word, Nothing New). In case the given statement is a claim and not yet proven, one can call/formulate it as an Alternate Hypothesis and accordingly define a Null Hypothesis. One should note that Null and Alternate Hypothesis are mutually exclusive. The following are some examples for the Null Hypothesis and Alternate Hypothesis.

  • Take the example of canned sauce with label 500 gm. As per the above, this represents the scenario when the statement made is assumed to be true. Thus, it is assumed to be true (based on given label) that the canned sauce weighs 500 gm. However, we want to do hypothesis testing to ascertain that the label mentioned as 500 gm is true. Thus, the Null Hypothesis would get formulated as the statement that the weight of canned sauce is equal to 500 gm. The alternate hypothesis will thus get formulated as the statement that the weight of canned sauce is NOT EQUAL to 500 gm.
    Null hypothesis The weight of the canned sauce is 500 gm.
    Alternate hypothesis The weight of the canned sauce is not equal to 500 gm.
  • Take the example of a claim that running 5 miles a day will lead to a reduction of 10 kg of weight within a month. Now, this is only a claim which is required to be proved. Thus, Alternate Hypothesis will be formulated first as the statement that “running 5 miles a day will lead to a reduction of 10 kg of weight within a month”. Hence, the null hypothesis will be opposite of the alternate hypothesis and stated as the fact that “running 5 miles a day does not lead to a reduction of 10 kg of weight within a month”.
    Null hypothesis Running 5 miles a day does not result in the reduction of 10 kg of weight within a month.
    Alternate hypothesis Running 5 miles a day results in the reduction of 10 kg of weight within a month.
  • Take another example of a claim that the housing price depends upon the average income of people staying in the locality. This is only a claim which is required to be proved. Thus, Alternate Hypothesis will be formulated first as the statement that “housing price depends upon the average income of people staying in the locality”. Hence, the null hypothesis will be formulated as the statement that housing price does NOT depend upon average income of people staying in the locality.
    Null hypothesis The housing price does not depend upon the average income of people staying in the locality.
    Alternate hypothesis The housing price depends upon the average income of people staying in the locality.

What is Hypothesis Testing?

Once you formulate the hypotheses, there is the need to test those hypotheses. Meaning, say that null hypothesis is set as the statement that housing price does not depend upon average income of people staying in the locality, it would be required to be tested by taking samples of housing prices and, based on the test results, this Null hypothesis could either be rejected or failed to be rejected. In hypothesis testing, the following two are the outcomes:

  • Reject the Null hypothesis
  • Fail to Reject the Null hypothesis

Take the above example of canned sauce weighing 500 gm. The Null hypothesis is set as the statement that canned sauce weighs as 500 gm. After taking a sample of 20 sauce bottles and testing/taking its weight, it was found that the average weight of canned sauce came to 505 gm. The test statistics (t-statistics) was calculated for this sample and the P-value was determined. Let’s say the P-value was found to be 15%. Assuming that the level of significance is selected to be 5%, the test statistic is not statistically significant (P-value > 5%) and thus, the null hypothesis fails to get rejected. Thus, one could safely conclude that the canned sauce does weight 500 gm. However, if the average weight of canned sauce would have found to be 575 gm, this is way beyond/away from the mean value of 500 gm and one could have ended up rejecting the Null Hypothesis based on the P-value.

Here is the diagram which represents the workflow of Hypothesis Testing.

Hypothesis Testing Workflow

Figure 1. Hypothesis Testing Workflow

Based on the above, the following are some of the common steps to be taken when doing hypothesis testing:

  • State the hypothesis: First and foremost, the hypothesis needs to be stated. The hypothesis could either be the statement which is assumed to be true or the claim which is made to be true.
  • Formulate the hypothesis: This step requires one to identify Null and Alternate hypothesis or in simple words, formulate the hypothesis. Take an example of the canned sauce weighing 500 gm as the Null Hypothesis.
  • Set the criteria for a decision: Identify test statistics that could be used to assess the Null Hypothesis. The test statistics with the above example would be the average weight of canned sauce, and t-statistics would be used to determine the P-value.
  • Identify the level of significance (alpha): Before starting the hypothesis testing, one would be required to set the significance level (also called as alpha) which represents the value for which a P-value less than or equal to alpha is considered statistically significant. Typical values of alpha are 0.1, 0.05, and 0.01. In case the P-value is evaluated as statistically significant, the null hypothesis is rejected. In case, the P-value is more than the alpha value, the null hypothesis is failed to be rejected.
  • Compute the test statistics: Next step is to calculate the test statistics (z-test, t-test) to determine the P-value. If the sample size is more than 30, it is recommended to use z-statistics. Otherwise, t-statistics could be used. In the current example where 20 packets of canned sauce is selected for hypothesis testing, t-statistics will be calculated for the mean value of 505 gm (sample mean). The t-statistics would then be calculated as the difference of 505 gm (sample mean) and the population mean (500 gm) divided by the sample standard deviation divided by square root of sample size (20).
  • Calculate the P-value of the test statistics: Once the test statistics have been calculated, find the P-value using either of t-table or z-table. P-value is the probability of obtaining a test statistic (t-score or z-score) equal to or more extreme than the result obtained from the sample data, given that the null hypothesis H0 is true.
  • Compare P-value with the level of significance: The significance level is set as the allowable range within which if the value appears, one will be failed to reject the Null Hypothesis. This region is also called as Non-rejection region. The value of alpha is compared with the p-value. If the p-value is less than the significance level, the test is statistically significant and hence, the null hypothesis will be rejected.

References

Summary

In this post, you learned about the hypothesis testing and related nuances such as hypothesis formulation techniques, ways to go about doing hypothesis testing etc. In data science, one of the reasons why one needs to understand the concepts of hypothesis testing is the need to verify the relationship between the dependent (response) and independent (predictor) variables. One would, thus, need to understand the related concepts such as hypothesis formulation into null and alternate hypothesis, level of significance, test statistics calculation, P-value etc. Give that the relationship between dependent and independent variables is a sort of claim, the null hypothesis could be set as the scenario where there is no relationship between dependent and independent variables.

 

Ajitesh Kumar

Ajitesh Kumar

Ajitesh has been recently working in the area of AI and machine learning. Currently, his research area includes Safe & Quality AI. In addition, he is also passionate about various different technologies including programming languages such as Java/JEE, Javascript and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc.

He has also authored the book, Building Web Apps with Spring 5 and Angular.
Ajitesh Kumar

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.