In this post, you will learn about the concepts of mean-squared error (MSE) and R-squared, difference between them and which one to use when working with regression models such as linear regression model. You also learn Python examples to understand the concepts in a better manner. In this post, the following topics are covered:
- Introduction to Mean Squared Error (MSE) and R-Squared
- Difference between MSE and R-Squared
- MSE or R-Squared – Which one to use?
- MSE and R-Squared Python code example
Introduction to Mean Square Error (MSE) and R-Squared
In this section, you will learn about the concepts of mean squared error and R-squared. These are used for evaluating the performance of regression models such as linear regression model.
What is Mean Squared Error (MSE)?
Mean squared error (MSE) is the average of sum of squared difference between actual value and the predicted or estimated value. It is also termed as mean squared deviation (MSD). This is how it is represented mathematically:
The value of MSE is always positive or greater than zero. A value close to zero will represent better quality of the estimator / predictor (regression model). An MSE of zero (0) represents the fact that the predictor is a perfect predictor. When you take a square root of MSE value, it becomes root mean squared error (RMSE). In the above equation, Y represents the actual value and the Y’ is predicted value. Here is the diagrammatic representation of MSE:
What is R-Squared?
R-Squared is the ratio of Sum of Squares Regression (SSR) and Sum of Squares Total (SST). Sum of Squares Regression is amount of variance explained by the regression line. R-squared value is used to measure the goodness of fit. Greater the value of R-Squared, better is the regression model. However, we need to take a caution. This is where adjusted R-squared concept comes into picture. This would be discussed in one of the later posts. R-Squared is also termed as the coefficient of determination. For the training dataset, the is bounded between 0 and 1, but it can become negative for the test dataset if the SSE is greater than SST. If the value of R-Squared is 1, the model fits the data perfectly with a corresponding MSE = 0.
Here is a visual representation to understand the concepts of R-Squared in a better manner.
Pay attention to the diagram and note that greater the value of SSR, more is the variance covered by the regression / best fit line out of total variance (SST). R-Squared can also be represented using the following formula:
R-Squared = 1 – (SSE/SST)
Pay attention to the diagram and note that smaller the value of SSE, smaller is the value of (SSE/SST) and hence greater will be value of R-Squared.
R-Squared can also be expressed as a function of mean squared error (MSE). The following equation represents the same.
Difference between Mean Square Error & R-Squared
The similarity between mean-squared error and R-Squared is that they both are a type of metrics which are used for evaluating the performance of the regression models, especially statistical model such as linear regression model. The difference is that MSE gets pronounced based on whether the data is scaled or not. For example, if the response variable is housing price in the multiple of 10K, MSE will be different (lower) than when the response variable such as housing pricing is not scaled (actual values). This is where R-Squared comes to the rescue. R-Squared is also termed as the standardized version of MSE. R-squared represents the fraction of variance of response variable captured by the regression model rather than the MSE which captures the residual error.
MSE or R-Squared – Which one to Use?
It is recommended to use R-Squared or rather adjusted R-Squared for evaluating the model performance of the regression models. This is primarily because R-Squared captures the fraction of response variance captured by the regression and tend to give better picture of quality of regression model. Also, MSE values differ based on whether the values of the response variable is scaled or not. A better measure is root mean squared error (RMSE) which takes care of the fact related to whether the values of the response variable is scaled or not.
One can alternatively use MSE or R-Squared based on what is appropriate and need of the hour.
MSE or R-Squared Python Code Example
Here is the python code representing how to calculate mean squared error or R-Squared value while working with regression models. Pay attention to some of the following in the code given below:
- Sklearn.metrics mean_squared_error and r2_score is used for measuring the MSE and R-Squared values. Input to this methods are actual values and predicted values.
- Sklearn Boston housing dataset is used for training a multiple linear regression model using Sklearn.linear_model LinearRegression
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.metrics import mean_squared_error, r2_score from sklearn import datasets # # Load the Sklearn Boston Dataset # boston_ds = datasets.load_boston() X = boston_ds.data y = boston_ds.target # # Create a training and test split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # # Fit a pipeline using Training dataset and related labels # pipeline = make_pipeline(StandardScaler(), LinearRegression()) pipeline.fit(X_train, y_train) # # Calculate the predicted value for training and test dataset # y_train_pred = pipeline.predict(X_train) y_test_pred = pipeline.predict(X_test) # # Mean Squared Error # print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred), mean_squared_error(y_test, y_test_pred))) # # R-Squared # print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred), r2_score(y_test, y_test_pred)))
Here is the summary of what you learned in this post regarding mean square error (MSE) and R-Squared and which one to use?
- MSE represents the residual error which is nothing but sum of squared difference between actual values and the predicted / estimated values.
- R-Squared represents the fraction of response variance captured by the regression model
- The disadvantage of using MSE is that the value of MSE varies based on whether the values of response variable is scaled or not. If scaled, MSE will be lower than the unscaled values.
- Data Storytelling Explained with Examples - October 21, 2020
- How to Setup / Install MLFlow & Get Started - October 20, 2020
- Python – How to Add Trend Line to Line Chart / Graph - October 20, 2020