R-Squared Explained for Indian Grandma

0

In this post, you will learn about the concept of R-Squared in relation to assessing the performance of multilinear regression machine learning model with the help of some real-world examples explained in a simple manner.

Background

Once we have built a multilinear regression model, the next thing is to find out the model performance. The model performance can be found out by calculating the value of the Residual Standard Error (RSE) or the value of R-Squared. Residual Standard Error can be defined as the difference between the mean value of the prediction made by the model and the population mean value. In this article, we will learn the technique of evaluating model performance using the value of R-Squared.

Deep-dive on understanding R-Squared

Let’s take an example of a real-world scenario where we went out for shopping some sarees, for the upcoming festival, with our Grandma. We went to a local shop and asked the shopkeeper to show us some sarees. He showed us the Banarasi Sarees. We got doubts on whether this is an authentic Banarasi Saree and what should be its probable price in which we could buy the Saree irrespective of the price quoted by the shopkeeper. We thought the shopkeeper was quoting a very high price. By the way, the diagram below represents a sample Banarasi Saree. 

R-Squared - Predicting the Saree Price

Banarasi Saree Price Prediction vis-a-vis R-Squared

 

Notice that we are talking about two problems related to making predictions using machine learning / statistical learning methods. Pay attention to the fact that these are examples of supervised machine learning methods. They are as following:

  • Whether the saree is an authentic and genuine Banarasi Saree (Binary Classification – Predict Yes or No for Banarasi Saree)
  • What should be the probable price of the Saree? (Regression – Predicting price)

Grandma will use her past experience of several years to answer the above questions by using different criteria (features). In order for a machine to make predictions, we will need to transform the problem into mathematical functions. The challenge is to approximate mathematical functions which would help us classify whether or not the saree is a banarasi saree, and predict the price of the saree. For this post, we will deal with the problem related to predicting the price of the saree.

This is what is done by Grandma to predict the price of the Saree. The following could also be seen as the features in the machine learning terminologies.

  • She might consider the number of years back she bought the Banarasi sarees and the related prices. 
  • She might consider the place (Geography) from where she bought the sarees and proximity of the place to Banaras from where these Sarees are manufactured.
  • She might consider the season in which she bought the Sarees (or came to know the pricing) and whether it was a festival season vis-a-vis pricing and related discounts.
  • She might consider the shops from which she bought the Sarees.
  • She might consider the Saree trends at different points in time when she bought the Sarees or came to know their pricing.

After all the above considerations, she predicted a couple of prices for the Sarees. We wanted to buy the Saree in the price predicted by our Grandma. Given this scenario, the shopkeeper might choose to reject the predicted prices based on the following:

  • There is a significantly large error between the Banarasi Sarees’ market price (aka population mean value) vis-a-vis the average (mean) of the probable prices quoted (predicted) by Grandma. This could also be thought of as the Residual Standard Error (RSE). Mathematically, RSE could be defined as the following:

    \(\sqrt{\frac{RSS}{(N – P – 1)}}\)

    In the above equation, RSS is the residual sum of squares, N is the number of observations, P is the number of parameters (coefficients)

  • Alternatively, the shopkeeper could explain the Grandma the reason of the large probable price (prediction) error she quoted because of the fact that, while quoting the probable prices, she did not consider some of the recent advancements used to manufacture the Saree, overall pricing associated with procurement of raw materials, transportation, etc. Maybe, if she had considered prior mentioned criteria, she would have quoted a little larger probable price. In other words, her quote of probable price would have taken into account all of the important criteria (features) and thus, would have resulted in the quote fairly close to the price point of the shopkeeper (market price aka population mean value). This aspect of the decision making process of Grandma by considering all of the key attributes would have resulted in the probable price with a lesser error or higher accuracy in relation to market price (aka population mean).  In other words, the prediction would have covered a greater portion of the variability of the market price. Higher the portion of the variability of market price covered by the prediction, better is the decision process (multilinear regression model). This phenomenon could also be represented in the form of the R-Squared value. 

Mathematically speaking, the value of R-squared is used to represent the proportion of variability explained by the regression model (recall Grandma’s decision-making process where she needed to consider additional parameters as suggested by the shopkeeper). R-Squared is the ratio of variability explained by the regression model to the total sum of squares value. R-Squared is also termed as coefficient of determinationThe value of R-Squared lies in the range of 0 and 1. An \(R^2\) value close to 1 indicates that the model explains a large portion of the variance in the response variable.  Greater is the value of R-Squared, better is the regression model. One may note that adding predictors to the regression model would result in an increase in the value of R-Squared.  Thus, more the Grandma considers different or all criteria (features) in predicting the probable price, better is the chance that she would be predicting the price which will be more acceptable to the shopkeeper.

R-Squared for simple linear regression can also be defined in terms of square of correlation between response and the predictor variable. For multilinear regression model, R-Squared could be defined as square of correlation between response and fitted model \(Cor(Y, \hat{Y})^2\)

The variability explained by the regression model can also be represented in terms of the difference between the total sum of squares (TSS) and the residual sum of squares (RSS). 

R-Squared = \(\frac{TSS-RSS}{TSS}\)

                   = \(1 – \frac{RSS}{TSS}\)

In the above equation, TSS is the Total sum of squares, RSS is the Residual sum of squares. 

TSS = \(\sum{(Y – \bar{Y})^2}\)

RSS = \(\sum{(Y – \hat{Y})^2}\)

In the above equation, 

  • Y is the observed value (The price quoted by the shopkeeper).
  • \(\bar{Y}\) is the mean value of Y (could be termed as the market price aka population mean value). 
  • \(\hat{Y}\) is the predicted value of the Saree.

One may note that adding predictors to the regression model would always result in an increase in the value of R-Squared, even if those variables are only weakly associated with the response. Thus, there is another related concept known as Adjusted R-Squared which increases only if the predictor added to the regression model has a significant impact. Otherwise, the value of adjusted R-Squared decreases with the addition of predictor variables.

Summary

In this post, you learned about the concept of R-Squared and how it is used to determine how well the multilinear regression model fit the data. The value of R-Squared lies in the range of 0 and 1. Closer the value of R-Squared to 1, better is the regression model. The value of R-Squared increases with the addition of features. However, one should consider the value of adjusted R-Squared for deciding whether to add the features or not.

Ajitesh Kumar

Ajitesh Kumar

Ajitesh has been recently working in the area of AI and machine learning. Currently, his research area includes Safe & Quality AI. In addition, he is also passionate about various different technologies including programming languages such as Java/JEE, Javascript and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc.

He has also authored the book, Building Web Apps with Spring 5 and Angular.
Ajitesh Kumar

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.