There are two measures of the strength of linear regression models: adjusted r-squared and r-squared. While they are both important, they measure different aspects of model fit. In this blog post, we will discuss the differences between adjusted r-squared and r-squared, as well as provide some examples to help illustrate their meanings. As a data scientist, it is of utmost importance to understand the differences between adjusted r-squared and r-squared in order to select the most appropriate linear regression model out of different regression models.
What is R-squared?
R-squared is a measure of what proportion of the variance in the value of the dependent or response variable is explained by the regression model built using one or more independent or predictor variables. Mathematically, the value of R-squared can be calculated as the following:
R-squared = sum of squares regression (SSR) / sum of squares total (SST)
It can also be calculated using the following formula as a function of residuals.
R-squared =1 – (sum of squares residuals error (SSE) / sum of squares total (SST))
Note that SST = SSR + SSE
Check out this blog for knowing R-squared in greater detail – R-squared in Linear Regression: Concepts, Examples.
The value of R-squared can range from 0 to 1. In general, a value of R-squared greater than 0.50 indicates that a significant proportion of the variance in the response variable is explained by the regression model.
The value of R-squared increases with the increase in the number of independent variables used in the regression model. However, it does not mean that more is always better because adding an additional variable to the regression model might increase R-squared but at the same time not necessarily improve the model performance model. This is where the concept of adjusted r-squared comes into the picture. In the next section, we will learn about the adjusted R-squared.
What is Adjusted R-squared?
Adjusted r-squared can be defined as the proportion of variance explained by the model while taking into account both the number of predictor variables and the number of samples used in the regression analysis. The adjusted r-squared increases only when adding an additional variable to the model improves its predictive capability more than expected by chance alone.
Mathematically, adjusted r-squared can be calculated as the function of R-squared in the following manner:
RSS represents the residual sum of squares or sum of squares residual error (SSE)
TSS represents the total sum of squares or sum of squares total (SST)
DFmodel = Degrees of freedom for the regression model = N – P – 1, where P is the number of predictor variables and N is the number of records
DFmean_model = Degrees of freedom for the model representing the mean of the values of response variables = N – 1, where N is the number of records.
The above formula will thus become the following:
Note that (1 – R-squared) is the same as RSS/TSS or SSE / SST.
Difference between R-squared and Adjusted R-squared
The following is the difference between the R-squared and Adjusted R-squared:
- The adjusted R-squared takes into account the number of predictor variables and the number of records used while calculating the value of R-squared. Hence, it is a better measure than R-squared in terms of how much variance in the response variable is explained by the regression model. The adjusted R-squared is a better measure of how well the model actually fits the data than just the R-squared value, which can be misleading if there are many predictor variables included in the regression. It is important to use both measures when assessing a linear regression model.
- While adding a predictor variable will increase the value of R-squared, it is not necessary that the value of adjusted R-squared will also increase. In fact, the value of adjusted R-squared can as well decrease with the increase in the number of predictor variables.
- The value of R-squared can never be negative. However, the value of adjusted r-squared can become negative. When R-squared is small, the adjusted R-squared will become negative.
An adjusted r-squared is a more accurate measure than r-squared about how much variance in the response or dependent variable (Y) is explained by the regression model. An adjusted R-squared takes into account both the number of predictor variables used and the number of records, whereas an r-squared does not take these two factors into consideration when calculating its value. For example, if you add a new independent variable to your linear regression model that has no effect on predicting Y at all, then this may have no impact on adjusted R-squared although the R-squared value can increase. It would lead to less predictive accuracy for your original linear regression because there are now more predictors but with little additional information about their possible effects on the response variable.