In this blog post, we will take a look at the concepts and formula of f-statistics in linear regression models and understand with the help of examples. F-test and F-statistics are very important concepts to understand if you want to be able to properly interpret the summary results of training linear regression machine learning models. We will start by discussing the importance of f-statistics in building linear regression models and understand how they are calculated based on the formula of f-statistics. We will, then, understand the concept with some real-world examples. As data scientists, it is very important to understand both the f-statistics and t-statistics and how they help in coming up with most appropriate linear regression model.
How are f-statistics used in the linear regression model?
The linear regression model represents the response variable as a function of one or more predictor variables. There are different types of hypothesis tests such as t-tests and f-test which are used for assessing the suitability of the linear regression model. You may want to check this blog to learn more – linear regression hypothesis testing example. One of the claims or hypotheses that need to be tested in relation to the linear regression model is whether the linear regression model is a valid one. This is tested by setting the null hypothesis that the response variable can not be represented as a function of any of the predictor variables. Thus, if the following is a linear regression model or function:
y = β0 + β1×1 + β2×2 + β3×3,
- y is the response variable
- x1, x2, and x3 are predictor variables
- β1, β2, β3 are coefficients or parameters to be estimated for x1, x2, and x3 predictor variables
Then, the null and alternate hypotheses can be written as:
H0: β1 = β2 = β3 = 0 (Regression model does not exist)
Ha: Any one of the coefficients is not equal to zero; At least one βi is not equal to 0
The above hypothesis can be tested using statistical test such as F-test. And, the test statistics is called f-statistics. The f-statistic is calculated from the following formula:
f = MSR / MSE
= Mean sum of squares regression / Mean sum of squares error
Let’s learn the concept of MSR and MSE in terms of explained and unexplained variance using the diagram shown below:
In the above diagram, the variance explained by the regression model is represented using the sum of squares for the model or sum of squares regression (SSR). The variance not explained by the regression model is the sum of squares for error (SSE) or the sum of squares for residuals. The f-statistics is defined as a function of SSR and SSE in the following manner:
f = (SSR/DFssr) / (SSE/DFsse)
DFssr = Degree of freedom for regression model; The value is equal to the number of parameters or coefficients
DFssr = p
DFsse = Degree of freedom for error; The value is equal to the total number of records (N) minus the number of coefficients (p)
DFsse = N – p – 1
Thus, the formula for f-statistics can be written as the following:
f = (SSR/p) / (SSE/(N-p -1))
Example: f-statistics with a linear regression model
Let’s say we have a problem estimating the sales in terms of the household income, age of head of the house, and the household size. We have a data set of 200 records. The following is the linear regression model:
y = β0 + β1*Income + β2*HH.size + β3*Age
Where y is the estimated sales, Income is the household income (in $1000s), Age is the age of head of house (in years) and HH.size is the household size (number of people in the household).
The following represents the hypothesis test for the linear regression model:
H0: β1 = β2 = β3 = 0
Ha: At least one of the coefficients is not equal to zero.
Now, let’s perform the hypothesis testing by calculating f-statistics for this problem.
DFssr = p = 3 (Number of coefficients)
SSR is calculated as 770565.1
MSR = SSR/DFssr = 770565.1 / 3 = 256855.033
DFsse = N – p – 1 = 200 – 3 – 1 = 196
SSE is calculated as 1557415.4
MSE = SSE/DFsse = 1557415.4 / 196 = 7945.99
The f-statistic can be calculated using the following formula:
f = MSR / MSE
= 256855.033 / 7945.99
The f-statistics can be represented as the following:
f = 32.325 at the degree of freedom as 3, 196.
The next step will be to find out the critical value of F-statistics at the level of significance as 0.05 with the degree of freedom as 3, 196.
f (critical value) = 2.651.
As the f-statistics of 32.325 is greater than the critical value of 2.651, it means that there’s statistical evidence for rejecting H0: β1=β2=β3=0. We can reject the null hypothesis that the value of all coefficients = 0. Thus, the alternate hypothesis holds good which means that at least one of the coefficients related to the predictor variables such as income, age, and HH.size is non-zero.
f-statistics is a statistic used to test the significance of regression coefficients in linear regression models. f-statistics can be calculated as MSR/MSE where MSR represents the mean sum of squares regression and MSE represents the mean sum of squares error. MSR can be calculated as SSR/DFssr where SSR is the sum of squares regression and DFssr represents the degree of freedom for the regression model. MSE can be calculated as SSE/DFsse where SSE is the sum of squares error and DFsse represents the degree of freedom for error. The critical value of f-statistics can be found using the f(critical value) formula. If the value of f-statistics is greater than the critical value, we can reject the null hypothesis and concludes that there’s a significant relationship between the predictor variables and the response variable.