In linear regression, R-squared (R2) is a measure of how close the data points are to the fitted line. It is also known as the coefficient of determination. In this post, you will learn about the concept of R-Squared in relation to assessing the performance of multilinear regression machine learning model with the help of some real-world examples explained in a simple manner.
Before doing a deep dive, you may want to access some of the following blog posts in relation to concepts of linear regression:
- Linear regression explained with real-world examples
- Linear regression hypothesis testing: concepts, examples
- Linear regression t-test: formula, examples
- Interpreting f-statistics in linear regression: formula, examples
What is R-Squared?
R-squared or R2 or coefficients of determination is defined as the proportion of variation of data points explained by the regression line or model. It can be determined as a ratio of total variation of data points explained by the regression line (Sum of squared regression) and total variation of data points from the mean (also termed as sum of squares total or total sum of squares). The following formula represents the ratio. y_hat represents the prediction or a point on the regression line, y_bar represents the mean of all the values and y_i represents the actual values or the points.
It can also be calculated as a function of total variation of data points from the regression line (also termed as sum of square residual) and total variation of data points from the mean. The following represents the formula. y_hat represents the prediction or a point on the regression line, y_i represents the actual values or the points and y_bar represents the mean of all the values
Let’s understand the concepts and formulas in detail. Once we have built a multilinear regression model, the next thing is to determine the model performance. The model predictability performance can be evaluated in terms of R-squared or coefficient of determination, although the more suitable measure is adjusted R-squared. The concepts of adjusted R-squared and how it is different from R-squared will be dealt in another blog. Let’s look at the following diagram to understand the concepts of R-squared.
Note some of the following in the above diagram in relation to learning the concepts of R-squared / R2.
- The horizontal red line represents the mean of all the values of the response variable of the regression model. In the diagram, it is represented as mean of actual / response variable value.
- The variation of the actual values from the mean or the horizontal line is represented as a function of variance of the points from the mean. Thus, the variation of the values from the mean is calculated as the sum of squared distance of individual points from the mean. This is also called sum of squares total (SST). Recall that the variance (σ2) is used to measurehow data points in a specific population or sample are spread outand is calculated as the squared sum of distance of individual points from the mean divided by N or N – 1 depending upon whether the population variance or the sample variance needs to be calculated respectively. At times, SST is also referred to as the total error. Mathematically, SST is represented as the following where y_bar represents the mean value and the y_i represents the actual value.
- The total variation of actual values from the regression line represents the prediction error in terms of sum of squared distance between the actual values and the prediction made by regression line. It is also termed as sum of squared residuals error (SSE). Mathematically, SSE is represented as the following where y_hat represents the prediction and the y_i represents the actual value.
- The total variation of prediction (represented using regression line) from the mean is represented as sum of squared distance between the prediction and the mean. It is also termed as sum of squared regression (SSR). At times, the SSR is also termed as the explained error or explained variance. Mathematically, SSR is represented as the following where y_hat represents the prediction and the y_bar represents the mean value.
R-Squared Concepts & Best-fit Regression Line
The following are important concepts to be understood in relation to the value of R-squared and how is it used to determine the best-fit line or regression model performance.
- Greater the value of SSR or sum of squared regression, better is the regression line. In other words, closer the value of SSR to SST (sum of squared total), better is the regression line. That would mean the value of R-squared to be closer to 1 as R-squared = SSR / SST
- Lesser the value of SSE of sum of squared residuals, better is the regression line. In other words, closer the value of SSE to zero (0), better is the regression line. That would mean that the value of R–squared is closer to 1 as R-squared = 1 – (SSE/SST).
When you fit the linear regression model using R programming, the following gets printed out as summary of regression model. Note the value of R-squared as 0.6929. We can look for more predictor variables in order to appropriately increase the value of R-squared and adjusted R-squared. The data below represents the regression model built to predict the housing price in terms of predictor variables such as crim, chas, rad, lstat. You can load the BostonHousing data as part of mlbench package in R.
In this post, you learned about the concept of R-Squared and how it is used to determine how well the multilinear regression model fit the data. The value of R-Squared lies in the range of 0 and 1. Closer the value of R-Squared to 1, better is the regression model. The value of R-Squared increases with the addition of features. However, one should consider the value of adjusted R-Squared for deciding whether to add the features or not. The concept of adjusted R-squared will be dealt in the next blog.