Mastering f-statistics in Linear Regression: Formula, Examples

linear regression R-squared concepts

In this blog post, we will take a look at the concepts and formula of f-statistics in linear regression models and understand with the help of examples. F-test and F-statistics are very important concepts to understand if you want to be able to properly interpret the summary results of training linear regression machine learning models. We will start by discussing the importance of f-statistics in building linear regression models and understand how they are calculated based on the formula of f-statistics. We will, then, understand the concept with some real-world examples. As data scientists, it is very important to understand both the f-statistics and t-statistics and how they help in coming up with most appropriate linear regression model.

Linear Regression Model & Need for F-test / F-statistics 

Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also known as predictors or features). The main goal of linear regression is to find the best-fitting straight line through the data points, known as the regression line, which minimizes the sum of squared differences between the observed values and the predicted values. There are different types of hypothesis tests such as t-tests and f-test which are used for assessing the suitability of the linear regression model. You may want to check this blog to learn more – linear regression hypothesis testing example

The question that needs to be asked or the hypothesis that needs to be tested is whether a linear regression model exists for the function approximation representing response variable as a linear function of predictor variables. This is tested by setting the null hypothesis that the response variable can not be represented as a function of any of the predictor variables. Thus, if the following is a linear regression model or function:

y = β0 + β1×1 + β2×2 + β3×3,

Where

  • y is the response variable
  • x1, x2, and x3 are predictor variables
  • β1, β2, β3 are coefficients or parameters to be estimated for x1, x2, and x3 predictor variables

Then, the null and alternate hypotheses can be written as:

H0: β1 = β2 = β3 = 0 (Regression model does not exist)

Ha: Any one of the coefficients is not equal to zero; At least one βi is not equal to 0

The above hypothesis can be tested using statistical test such as F-test. And, the test statistics is called f-statistics. F-statistics helps assess the significance of the entire regression model. In other words, it tests whether the model as a whole (including all the predictor variables) explains a significant amount of the variation in the dependent variable, compared to a model with no predictors (known as the null model).

F-statistics is based on the ratio of two variances: the explained variance (due to the model) and the unexplained variance (residuals). In other words, F-statistics compares the explained variance (due to the model) and the unexplained variance (residuals). By comparing these variances, F-statistics helps us determine whether the regression model significantly explains the variation in the dependent variable or if the variation can be attributed to random chance. A larger F-statistic might indicate that the model accounts for a substantial portion of the total variance, while a smaller F-statistic suggests that the model might not explain much of the variance and thus, may not be seen as useful model. The f-statistic is calculated from the following formula:

f = MSR / MSE

= Mean sum of squares regression / Mean sum of squares error

The F-statistic follows an F-distribution, and its value helps to determine the probability (p-value) of observing such a statistic if the null hypothesis is true (i.e., no relationship between the dependent and independent variables). If the p-value is smaller than a predetermined significance level (e.g., 0.05), the null hypothesis is rejected, and we conclude that the regression model is statistically significant.

Let’s learn the concept of mean sum of squares regression (MSR) and mean sum of squares error / residual (MSE) in terms of explained and unexplained variance using the diagram shown below:

linear regression f-statistics formula examples

In the above diagram, the variance explained by the regression model is represented using the sum of squares for the model or sum of squares regression (SSR). The variance not explained by the regression model is the sum of squares for error (SSE) or the sum of squares for residuals. The f-statistics is defined as a function of SSR and SSE in the following manner:

[latex]f = (SSR/DF_{ssr}) / (SSE/DF_{sse})[/latex]

[latex]DF_{ssr}[/latex] = Degree of freedom for regression model; The value is equal to the number of parameters or coefficients

[latex]DF_{ssr}[/latex] = p

[latex]DF_{sse}[/latex] = Degree of freedom for error; The value is equal to the total number of records (N) minus the number of coefficients (p)

[latex]DF_{sse}[/latex] = N – p – 1

Thus, the formula for f-statistics can be written as the following:

f = (SSR/p) / (SSE/(N-p -1))

Importance of understanding F-statistics vis-a-vis Linear Regression Model

Understanding F-statistics is crucial for anyone working with linear regression models for several reasons:

  • Model significance: F-statistics allows you to assess the overall significance of the model, which helps determine whether the model is worth interpreting further or needs improvement.
  • Variable selection: F-statistics can be used as a criterion for variable selection, helping you identify the most important predictor variables and build a parsimonious model.
  • Model comparison: F-statistics can be employed to compare the performance of different models, especially when adding or removing predictor variables.

Example: f-statistics & Linear Regression Model

Let’s say we have a problem estimating the sales in terms of the household income, age of head of the house, and the household size. We have a data set of 200 records. The following is the linear regression model:

y = β0 + β1*Income + β2*HH.size + β3*Age

Where y is the estimated sales, Income is the household income (in $1000s), Age is the age of head of house (in years) and HH.size is the household size (number of people in the household).

The following represents the hypothesis test for the linear regression model:

H0: β1 = β2 = β3 = 0

Ha: At least one of the coefficients is not equal to zero.

Now, let’s perform the hypothesis testing by calculating f-statistics for this problem.

DFssr = p = 3 (Number of coefficients)

SSR is calculated as 770565.1

MSR = SSR/DFssr = 770565.1 / 3 = 256855.033

DFsse = N – p – 1 = 200 – 3 – 1 = 196

SSE is calculated as 1557415.4

MSE = SSE/DFsse = 1557415.4 / 196 = 7945.99

The f-statistic can be calculated using the following formula:

f = MSR / MSE

= 256855.033 / 7945.99

= 32.325

The f-statistics can be represented as the following:

f = 32.325 at the degree of freedom as 3, 196.

The next step will be to find out the critical value of F-statistics at the level of significance as 0.05 with the degree of freedom as 3, 196.

f (critical value) = 2.651.

As the f-statistics of 32.325 is greater than the critical value of 2.651, it means that there’s statistical evidence for rejecting H0: β1=β2=β3=0. We can reject the null hypothesis that the value of all coefficients = 0. Thus, the alternate hypothesis holds good which means that at least one of the coefficients related to the predictor variables such as income, age, and HH.size is non-zero.

Summary

f-statistics is a statistic used to test the significance of regression coefficients in linear regression models. f-statistics can be calculated as MSR/MSE where MSR represents the mean sum of squares regression and MSE represents the mean sum of squares error. MSR can be calculated as SSR/DFssr where SSR is the sum of squares regression and DFssr represents the degree of freedom for the regression model. MSE can be calculated as SSE/DFsse where SSE is the sum of squares error and DFsse represents the degree of freedom for error. The critical value of f-statistics can be found using the f(critical value) formula. If the value of f-statistics is greater than the critical value, we can reject the null hypothesis and concludes that there’s a significant relationship between the predictor variables and the response variable.

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Data Science, Machine Learning, statistics. Tagged with , , .

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *