
In this blog post, we will take a look at the concepts and formula of f-statistics in linear regression models and understand with the help of examples. F-test and F-statistics are very important concepts to understand if you want to be able to properly interpret the summary results of training linear regression machine learning models. We will start by discussing the importance of f-statistics in building linear regression models and understand how they are calculated based on the formula of f-statistics. We will, then, understand the concept with some real-world examples. As data scientists, it is very important to understand both the f-statistics and t-statistics and how they help in coming up with most appropriate linear regression model.
Linear Regression Model & Need for F-test / F-statistics
Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable (also known as the response or target variable) and one or more independent variables (also known as predictors or features). The main goal of linear regression is to find the best-fitting straight line through the data points, known as the regression line, which minimizes the sum of squared differences between the observed values and the predicted values. There are different types of hypothesis tests such as t-tests and f-test which are used for assessing the suitability of the linear regression model. You may want to check this blog to learn more – linear regression hypothesis testing example.
The question that needs to be asked or the hypothesis that needs to be tested is whether a linear regression model exists for the function approximation representing response variable as a linear function of predictor variables. This is tested by setting the null hypothesis that the response variable can not be represented as a function of any of the predictor variables. Thus, if the following is a linear regression model or function:
y = β0 + β1×1 + β2×2 + β3×3,
Where
- y is the response variable
- x1, x2, and x3 are predictor variables
- β1, β2, β3 are coefficients or parameters to be estimated for x1, x2, and x3 predictor variables
Then, the null and alternate hypotheses can be written as:
H0: β1 = β2 = β3 = 0 (Regression model does not exist)
Ha: Any one of the coefficients is not equal to zero; At least one βi is not equal to 0
The above hypothesis can be tested using statistical test such as F-test. And, the test statistics is called f-statistics. F-statistics helps assess the significance of the entire regression model. In other words, it tests whether the model as a whole (including all the predictor variables) explains a significant amount of the variation in the dependent variable, compared to a model with no predictors (known as the null model).
F-statistics is based on the ratio of two variances: the explained variance (due to the model) and the unexplained variance (residuals). In other words, F-statistics compares the explained variance (due to the model) and the unexplained variance (residuals). By comparing these variances, F-statistics helps us determine whether the regression model significantly explains the variation in the dependent variable or if the variation can be attributed to random chance. A larger F-statistic might indicate that the model accounts for a substantial portion of the total variance, while a smaller F-statistic suggests that the model might not explain much of the variance and thus, may not be seen as useful model. The f-statistic is calculated from the following formula:
f = MSR / MSE
= Mean sum of squares regression / Mean sum of squares error
The F-statistic follows an F-distribution, and its value helps to determine the probability (p-value) of observing such a statistic if the null hypothesis is true (i.e., no relationship between the dependent and independent variables). If the p-value is smaller than a predetermined significance level (e.g., 0.05), the null hypothesis is rejected, and we conclude that the regression model is statistically significant.
Let’s learn the concept of mean sum of squares regression (MSR) and mean sum of squares error / residual (MSE) in terms of explained and unexplained variance using the diagram shown below:
In the above diagram, the variance explained by the regression model is represented using the sum of squares for the model or sum of squares regression (SSR). The variance not explained by the regression model is the sum of squares for error (SSE) or the sum of squares for residuals. The f-statistics is defined as a function of SSR and SSE in the following manner:
[latex]f = (SSR/DF_{ssr}) / (SSE/DF_{sse})[/latex]
[latex]DF_{ssr}[/latex] = Degree of freedom for regression model; The value is equal to the number of parameters or coefficients
[latex]DF_{ssr}[/latex] = p
[latex]DF_{sse}[/latex] = Degree of freedom for error; The value is equal to the total number of records (N) minus the number of coefficients (p)
[latex]DF_{sse}[/latex] = N – p – 1
Thus, the formula for f-statistics can be written as the following:
f = (SSR/p) / (SSE/(N-p -1))
Importance of understanding F-statistics vis-a-vis Linear Regression Model
Understanding F-statistics is crucial for anyone working with linear regression models for several reasons:
- Model significance: F-statistics allows you to assess the overall significance of the model, which helps determine whether the model is worth interpreting further or needs improvement.
- Variable selection: F-statistics can be used as a criterion for variable selection, helping you identify the most important predictor variables and build a parsimonious model.
- Model comparison: F-statistics can be employed to compare the performance of different models, especially when adding or removing predictor variables.
Example: f-statistics & Linear Regression Model
Let’s say we have a problem estimating the sales in terms of the household income, age of head of the house, and the household size. We have a data set of 200 records. The following is the linear regression model:
y = β0 + β1*Income + β2*HH.size + β3*Age
Where y is the estimated sales, Income is the household income (in $1000s), Age is the age of head of house (in years) and HH.size is the household size (number of people in the household).
The following represents the hypothesis test for the linear regression model:
H0: β1 = β2 = β3 = 0
Ha: At least one of the coefficients is not equal to zero.
Now, let’s perform the hypothesis testing by calculating f-statistics for this problem.
DFssr = p = 3 (Number of coefficients)
SSR is calculated as 770565.1
MSR = SSR/DFssr = 770565.1 / 3 = 256855.033
DFsse = N – p – 1 = 200 – 3 – 1 = 196
SSE is calculated as 1557415.4
MSE = SSE/DFsse = 1557415.4 / 196 = 7945.99
The f-statistic can be calculated using the following formula:
f = MSR / MSE
= 256855.033 / 7945.99
= 32.325
The f-statistics can be represented as the following:
f = 32.325 at the degree of freedom as 3, 196.
The next step will be to find out the critical value of F-statistics at the level of significance as 0.05 with the degree of freedom as 3, 196.
f (critical value) = 2.651.
As the f-statistics of 32.325 is greater than the critical value of 2.651, it means that there’s statistical evidence for rejecting H0: β1=β2=β3=0. We can reject the null hypothesis that the value of all coefficients = 0. Thus, the alternate hypothesis holds good which means that at least one of the coefficients related to the predictor variables such as income, age, and HH.size is non-zero.
Summary
f-statistics is a statistic used to test the significance of regression coefficients in linear regression models. f-statistics can be calculated as MSR/MSE where MSR represents the mean sum of squares regression and MSE represents the mean sum of squares error. MSR can be calculated as SSR/DFssr where SSR is the sum of squares regression and DFssr represents the degree of freedom for the regression model. MSE can be calculated as SSE/DFsse where SSE is the sum of squares error and DFsse represents the degree of freedom for error. The critical value of f-statistics can be found using the f(critical value) formula. If the value of f-statistics is greater than the critical value, we can reject the null hypothesis and concludes that there’s a significant relationship between the predictor variables and the response variable.
Can you tell me what is the relationship between null hypothesis and F test? Here we can see that in F test there is no sign of coefficient becoming zero? Or we don,t use the null hypothesis in F test formula (like t test or Z test) to justify whether it is true or not!!
May be you didn,t understand my words. Lets think that null hypothesis is al coefficients are more than 3. Then how will you perform the F test?