Categories: Big Data

Data Science – Key Probability & Statistics Topics to Master

This article represents a list of key probability & statistics topics that one may need to master if he is aiming to become a data scientist. This article lists topics that has worked for me so far in relation with working on a data science problem. One could also see the below list as table of content for key probability and statistics topics for data science. However, I do believe that there are some topics that I might not have mentioned. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Probability & Statistics Topics

Following are some of the key topics listed under categories such as Probability and Statistics, that one would want to master to get good at data science.

  • Probability: Following are probability-related topics which once mastered would prove very helpful while working on various machine learning algorithms:
    • Introduction to Probability: This topic covers the basics related with concept of Probability including the basic formulae.
    • Probability concepts: This topic covers the basic fundamentals in relation with some of the following. Note that these concepts would prove very handy in classification related machine learning algorithms such as Logistic Regression, naïve Bayes classification:
      • Union (Probability of union of two or more events)
      • Intersection (Probability of the intersection of the two or more events)
      • Complement (Probability that the event does not occur)
      • Bayes rule (Probability that an event occurs given that another event has occurred)
    • Random variables: This topic defines random variables and cover some of the following concepts in this relation:
      • Types of variables (Discreet, Continuous)
      • Mean, Median, Variance
    • Probability Distributions: This topic, being one of the most important one, covers fundamentals related with different probability distributions that would prove handy while working on different machine learning algorithms.
      • Probability distribution types (Discreet, Continuous)
      • Discreet Probability Distributions: Following are some of the key discreet probability distributions:
        • Binomial, Negative Binomial
        • Poisson
      • Continuous Probability Distributions: Following are some of the key continuous probability distribution examples which would help in evaluating different machine learning algorithms such as linear regression (T-value, F-value), logistic regression (Z-value, Chi-square):
        • Normal distribution
        • Z-distribution
        • T-distribution
        • F-distribution
        • Chi-Square distribution
        • Gamma distribution
    • Sampling theory (Sampling methods such as SRS/Stratified/Cluster, Sampling distribution)
  • Statistics: Following are some of the key topics related with Statistics which will prove very helpful while working with different machine learning algorithms:
    • Quantitative data analysis: Following are some of the key concepts that one may not be able to live without while doing statistical analysis:
      • Mean, Median, Mode
      • Variance
    • Plots: Following are some of the key plots that are useful in understanding patterns of data based on center, spread, shape etc.
      • Histogram
      • Boxplot
      • Scatterplot
    • Estimation (Standard error, Error margin, Confidence intervals)
    • Hypothesis testing: Following are some of the sub-topics that would be covered as part of this topic. Understanding following concepts is key to understanding the evaluation techniques for some of the machine learning models including linear regression, logistic regression etc.
      • Null hypothesis, Alternate hypothesis
      • Type I & Type II error
      • Region of acceptance, Statistical significance, P-value
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

Mean Squared Error vs Cross Entropy Loss Function

Last updated: 28th April, 2024 As a data scientist, understanding the nuances of various cost…

14 hours ago

Cross Entropy Loss Explained with Python Examples

Last updated: 28th April, 2024 In this post, you will learn the concepts related to…

16 hours ago

Logistic Regression in Machine Learning: Python Example

Last updated: 26th April, 2024 In this blog post, we will discuss the logistic regression…

2 days ago

MSE vs RMSE vs MAE vs MAPE vs R-Squared: When to Use?

Last updated: 22nd April, 2024 As data scientists, we navigate a sea of metrics to…

4 days ago

Gradient Descent in Machine Learning: Python Examples

Last updated: 22nd April, 2024 This post will teach you about the gradient descent algorithm…

7 days ago

Loss Function vs Cost Function vs Objective Function: Examples

Last updated: 19th April, 2024 Among the terminologies used in training machine learning models, the…

1 week ago