In order to understand the need for data scientists to be very good at the statistical concepts, one needs to clearly understand some of the following:
- Who are data scientists?
- What is the need for statistics in data scientists’ day-to-day work?
Who are Data Scientists?
Data Scientists are the primarily Scientists who do experiments to find some of the following:
- Whether there exists a relationship between data
- Whether the function approximated (machine learning or statistical learning model) from a given sample of data could be generalized for the entire population
- In case there are multiple function approximations for predicting outcomes given a set of input, which one of the function approximation is most appropriate.
A Scientist becomes a data scientist when he/she learns statistics fundamentals in order to analyze the experiment outcomes and reach to a conclusion whether the test outcomes are statistically significant to prove his hypothesis or otherwise.
Need for Statistics in Scientists’ Day-to-Day Work
A scientist in his day-to-day work performs one or more experiments in relation to hypothesis testing. The following are some of the steps performed by the scientists:
- Determine the experiment requirements: First and foremost, he/she sets the goal of hypothesis testing by finalizing on what he is out to prove. Whether there is a need to prove relationship between the data (inference), or determine whether the function approximated for making prediction is appropriate enough.
- Once the goal is set, the null and alternate hypothesis is determined as a next step. For example, let’s say a group of scientists created medicine for curing diabetes once and for all. Once the medicine is created, they will need to prove the effectiveness of the medicine in curing diabetes by testing the medicine on a different set of patients.
- Determine Null & Alternate Hypothesis: Before going ahead and start testing the effect of said medicine on diabetic patients, they work on defining the null and the alternate hypothesis. As a next practice, the alternate hypothesis is the one which the scientists are out to prove. The opposite is set to the null hypothesis. In the current example, the null hypothesis will be the following:
- \(H_0\): There is no effect of medicine on diabetic patients. They do not get cured.
- The alternate hypothesis will be: \(H_a\): There is an effect of the medicine on diabetic patients such that they get cured.
- Perform the experiment: Once the hypothesis is defined, the next step is to perform the experiment in order to reject the null hypothesis. The experiment here is taking up a set of diabetic patients and have them take medicines for a given period of time. For example, take 5 samples of 100 patients each in different age groups and serve them medicines for 90 days. At regular intervals such as 30 days, perform the blood sugar test and record the result.
- Analyze the results: After analyzing the test results, an average of only 60 patients out of 100 showed the control in blood sugar. The question is to assess whether the test outcome has happened by chance or the results are statistically significant. In order to reject the null hypothesis, the scientists would require to perform statistical analysis (statistical tests) come to a conclusion whether to reject the null hypothesis or fail to reject the null hypothesis.
- Knowledge of Statistics is Must: Given the above, it is imperative that the scientists would need help from statisticians to come to a conclusion whether the medicines has no effect on the diabetic patients (fail to reject null hypothesis) or the medicine does have positive effect on diabetic patients (reject the null hypothesis). As it is difficult to get statisticians, it become key to learn some fundamental concepts of statistics to determine the appropriateness of the experiment outcomes. The following are some of the statistical concepts, data scientists would need to know.
- Probability distributions to understand the data
- Statistical tests such as Z-statistics, T-statistics, F-statistics, ANOVA test, Chi-square tests etc
- Concepts such as one-tailed and two-tailed tests, P-value
If you are planning to become a great data scientist, it would become important for you to learn statistical concepts in order to analyze the relationship between the data (independent and dependent variables), suitability of the statistical models, selection of the best statistical model etc.
- Sklearn Machine Learning Pipeline – Python Example - August 13, 2020
- Imputing Missing Data using Sklearn SimpleImputer - August 11, 2020
- When to use LabelEncoder – Python Example - August 10, 2020