In linear regression, Rsquared (R2) is a measure of how close the data points are to the fitted line. It is also known as the coefficient of determination. Understanding the concept of Rsquared is crucial for data scientists as it helps in evaluating the goodness of fit in linear regression models, compare the explanatory power of different models on the same dataset and communicate the performance of their models to stakeholders. In this post, you will learn about the concept of RSquared in relation to assessing the performance of multilinear regression machine learning model with the help of some realworld examples explained in a simple manner.
Before doing a deep dive, you may want to access some of the following blog posts in relation to concepts of linear regression:
 Linear regression explained with realworld examples
 Linear regression hypothesis testing: concepts, examples
 Linear regression ttest: formula, examples
 Interpreting fstatistics in linear regression: formula, examples
What is RSquared?
Rsquared or R2 or coefficients of determination is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. Lets understand the concepts of Rsquared visually using scatter plots with linear regression lines and different Rsquared values.
The plots above visually represent three different scenarios of Rsquared in linear regression:

High Rsquared (Left Plot):
 The data points (blue) are closely aligned with the regression line (red).
 This indicates a strong linear relationship between the independent and dependent variables.
 The Rsquared value is high (close to 1), suggesting that the model explains a significant proportion of the variance in the dependent variable.

Moderate Rsquared (Middle Plot):
 Here, the data points (green) show more dispersion around the regression line.
 The linear relationship is still evident but not as strong as in the high Rsquared scenario.
 The Rsquared value is moderate, indicating that the model explains a decent but not overwhelming portion of the variance.

Low Rsquared (Right Plot):
 The data points (purple) are widely scattered around the regression line.
 This indicates a weak linear relationship between the variables.
 The low Rsquared value suggests that the model does not explain much of the variance in the dependent variable.
Mathematical Explanation of RSquared
Mathematically, it can be determined as a ratio of total variation of data points explained by the regression line (Sum of squared regression) and total variation of data points from the mean (also termed as sum of squares total or total sum of squares). The following formula represents the ratio. y_hat represents the prediction or a point on the regression line, y_bar represents the mean of all the values and y_i represents the actual values or the points.
It can also be calculated as a function of total variation of data points from the regression line (also termed as sum of square residual) and total variation of data points from the mean. The following represents the formula. y_hat represents the prediction or a point on the regression line, y_i represents the actual values or the points and y_bar represents the mean of all the values
Let’s understand the concepts and formulas in detail. Once we have built a multilinear regression model, the next thing is to determine the model performance. The model predictability performance can be evaluated in terms of Rsquared or coefficient of determination, although the more suitable measure is adjusted Rsquared. The concepts of adjusted Rsquared and how it is different from Rsquared will be dealt in another blog. Let’s look at the following diagram to understand the concepts of Rsquared.
Note some of the following in the above diagram in relation to learning the concepts of Rsquared / R2.
 The horizontal red line represents the mean of all the values of the response variable of the regression model. In the diagram, it is represented as mean of actual / response variable value.
 The variation of the actual values from the mean or the horizontal line is represented as a function of variance of the points from the mean. Thus, the variation of the values from the mean is calculated as the sum of squared distance of individual points from the mean. This is also called sum of squares total (SST). Recall that the variance (σ^{2}) is used to measurehow data points in a specific population or sample are spread outand is calculated as the squared sum of distance of individual points from the mean divided by N or N – 1 depending upon whether the population variance or the sample variance needs to be calculated respectively. At times, SST is also referred to as the total error. Mathematically, SST is represented as the following where y_bar represents the mean value and the y_i represents the actual value.
 The total variation of actual values from the regression line represents the prediction error in terms of sum of squared distance between the actual values and the prediction made by regression line. It is also termed as sum of squared residuals error (SSE). Mathematically, SSE is represented as the following where y_hat represents the prediction and the y_i represents the actual value.
 The total variation of prediction (represented using regression line) from the mean is represented as sum of squared distance between the prediction and the mean. It is also termed as sum of squared regression (SSR). At times, the SSR is also termed as the explained error or explained variance. Mathematically, SSR is represented as the following where y_hat represents the prediction and the y_bar represents the mean value.
RSquared Concepts & Bestfit Regression Line
The following are important concepts to be understood in relation to the value of Rsquared and how is it used to determine the bestfit line or regression model performance.
 Greater the value of SSR or sum of squared regression, better is the regression line. In other words, closer the value of SSR to SST (sum of squared total), better is the regression line. That would mean the value of Rsquared to be closer to 1 as Rsquared = SSR / SST
 Lesser the value of SSE of sum of squared residuals, better is the regression line. In other words, closer the value of SSE to zero (0), better is the regression line. That would mean that the value of R–squared is closer to 1 as Rsquared = 1 – (SSE/SST).
When you fit the linear regression model using R programming, the following gets printed out as summary of regression model. Note the value of Rsquared as 0.6929. We can look for more predictor variables in order to appropriately increase the value of Rsquared and adjusted Rsquared. The data below represents the regression model built to predict the housing price in terms of predictor variables such as crim, chas, rad, lstat. You can load the BostonHousing data as part of mlbench package in R.
RSquared for Regression Models: Python Code Example
In this section, we will look into Python code example that demonstrates how to use Rsquared in the context of linear regression. In this example, we’ll:
 Fit a linear regression model to the data.
 Calculate the Rsquared value to assess the model’s performance.
import numpy as np from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score import matplotlib.pyplot as plt # Generate synthetic data # np.random.seed(0) x = np.random.rand(100, 1) * 10 # Independent variable y = 2 * x + np.random.randn(100, 1) * 2 # Dependent variable with some noise # Fit Regression Model # model = LinearRegression() model.fit(x, y) # Predict and calculate Rsquared # y_pred = model.predict(x) r2 = r2_score(y, y_pred) # Plotting the results # plt.scatter(x, y, color='blue', label='Data Points') plt.plot(x, y_pred, color='red', label='Regression Line') plt.title(f'Rsquared: {r2:.2f}') plt.xlabel('Independent Variable') plt.ylabel('Dependent Variable') plt.legend() plt.show()
Here is the plot representing regression model for a specific Rsquared value.
The executed Python code produces a scatter plot that visualizes the linear regression model’s fit to the synthetic data. Here’s a breakdown of the result:
 Data Points (Blue Dots): Represent the synthetic data generated for the independent variable (Xaxis) and the dependent variable (Yaxis). The dependent variable has been constructed to have a linear relationship with the independent variable, plus some random noise.
 Regression Line (Red Line): This is the line of best fit determined by the linear regression model. It represents the model’s prediction of the dependent variable based on the independent variable.
Summary
In this post, you learned about the concept of RSquared and how it is used to determine how well the multilinear regression model fit the data. The value of RSquared lies in the range of 0 and 1. Closer the value of RSquared to 1, better is the regression model. The value of RSquared increases with the addition of features. However, one should consider the value of adjusted RSquared for deciding whether to add the features or not. The concept of adjusted Rsquared will be dealt in the next blog.
 Model Parallelism vs Data Parallelism: Examples  April 11, 2024
 Model Complexity & Overfitting in Machine Learning: How to Reduce  April 10, 2024
 6 GameChanging Features of ChatGPT’s Latest Upgrade  April 9, 2024
Leave a Reply