In linear regression, R-squared (R2) is a measure of how close the data points are to the fitted line. It is also known as the coefficient of determination. Understanding the concept of R-squared is crucial for data scientists as it helps in evaluating the goodness of fit in linear regression models, compare the explanatory power of different models on the same dataset and communicate the performance of their models to stakeholders. In this post, you will learn about the concept of R-Squared in relation to assessing the performance of multilinear regression machine learning model with the help of some real-world examples explained in a simple manner.
Before doing a deep dive, you may want to access some of the following blog posts in relation to concepts of linear regression:
R-squared or R2 or coefficients of determination is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. Lets understand the concepts of R-squared visually using scatter plots with linear regression lines and different R-squared values.
The plots above visually represent three different scenarios of R-squared in linear regression:
High R-squared (Left Plot):
Moderate R-squared (Middle Plot):
Low R-squared (Right Plot):
Mathematically, it can be determined as a ratio of total variation of data points explained by the regression line (Sum of squared regression) and total variation of data points from the mean (also termed as sum of squares total or total sum of squares). The following formula represents the ratio. y_hat represents the prediction or a point on the regression line, y_bar represents the mean of all the values and y_i represents the actual values or the points.
It can also be calculated as a function of total variation of data points from the regression line (also termed as sum of square residual) and total variation of data points from the mean. The following represents the formula. y_hat represents the prediction or a point on the regression line, y_i represents the actual values or the points and y_bar represents the mean of all the values
Let’s understand the concepts and formulas in detail. Once we have built a multilinear regression model, the next thing is to determine the model performance. The model predictability performance can be evaluated in terms of R-squared or coefficient of determination, although the more suitable measure is adjusted R-squared. The concepts of adjusted R-squared and how it is different from R-squared will be dealt in another blog. Let’s look at the following diagram to understand the concepts of R-squared.
Note some of the following in the above diagram in relation to learning the concepts of R-squared / R2.
The following are important concepts to be understood in relation to the value of R-squared and how is it used to determine the best-fit line or regression model performance.
When you fit the linear regression model using R programming, the following gets printed out as summary of regression model. Note the value of R-squared as 0.6929. We can look for more predictor variables in order to appropriately increase the value of R-squared and adjusted R-squared. The data below represents the regression model built to predict the housing price in terms of predictor variables such as crim, chas, rad, lstat. You can load the BostonHousing data as part of mlbench package in R.
In this section, we will look into Python code example that demonstrates how to use R-squared in the context of linear regression. In this example, we’ll:
import numpy as np from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score import matplotlib.pyplot as plt # Generate synthetic data # np.random.seed(0) x = np.random.rand(100, 1) * 10 # Independent variable y = 2 * x + np.random.randn(100, 1) * 2 # Dependent variable with some noise # Fit Regression Model # model = LinearRegression() model.fit(x, y) # Predict and calculate R-squared # y_pred = model.predict(x) r2 = r2_score(y, y_pred) # Plotting the results # plt.scatter(x, y, color='blue', label='Data Points') plt.plot(x, y_pred, color='red', label='Regression Line') plt.title(f'R-squared: {r2:.2f}') plt.xlabel('Independent Variable') plt.ylabel('Dependent Variable') plt.legend() plt.show()
Here is the plot representing regression model for a specific R-squared value.
The executed Python code produces a scatter plot that visualizes the linear regression model’s fit to the synthetic data. Here’s a breakdown of the result:
In this post, you learned about the concept of R-Squared and how it is used to determine how well the multilinear regression model fit the data. The value of R-Squared lies in the range of 0 and 1. Closer the value of R-Squared to 1, better is the regression model. The value of R-Squared increases with the addition of features. However, one should consider the value of adjusted R-Squared for deciding whether to add the features or not. The concept of adjusted R-squared will be dealt in the next blog.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…