In this post, you will learn about concepts of linear regression along with Python Sklearn examples for training linear regression models. Linear regression belongs to class of parametric models and used to train supervised models.
The following topics are covered in this post:
- Introduction to linear regression
- Linear regression concepts / terminologies
- Linear regression python code example
Introduction to Linear Regression
Linear regression is a machine learning algorithm used to predict the value of continuous response variable. The predictive analytics problems that are solved using linear regression models are called as supervised learning problems as it requires that the value of response / target variables must be present and used for training the models. Also, recall that “continuous” represents the fact that response variable is numerical in nature and can take infinite different values. Linear regression models belong to a class of parametric models.
Linear regression models work great for data which are linear in nature. In other words, the predictor / independent variables in the data set have linear relationship with the target / response / dependent variable. The following represents the linear relationship between response and the predictor variable in a simple linear regression model.
The red line in the above diagram is termed as best-fit line and can be found by training the model such as Y = mX + c
Linear regression models is of two different kinds. They are simple linear regression and multiple linear regression.
- Simple linear regression: When there is just one independent or predictor variable such as that in this case, Y = mX + c, the linear regression is termed as simple linear regression.
- Multiple linear regression: When there are more than one independent or predictor variables such as \(Y = w_1x_1 + w_2x_2 + … + w_nx_n\), the linear regression is called as multiple linear regression.
Linear Regression Concepts / Terminologies
In this section, you will learn about some of the key concepts related to training linear regression models.
- Residual Error: Residual error is difference between actual value and the predicted value. When visualizing in terms of best fit line, if the actual value is above the best-fit line, it is called as the positive residual error and if the actual value is below the best fit line, it is called as the negative residual error. The figure below represents the same.
- SST, SSE, SSR: The following are key concepts when dealing with linear regression model. The following diagram is representation of SST, SSE and SSR
- Sum of Square Total (SST): Sum of Squares Total is equal to sum of squared difference between actual values related to response variable and the mean of actual values. It is also called as variance of the response. Recall how you calculate variance – sum of squared difference between observations and their mean of all observations. It is also termed as Total Sum of Squares (TSS).
- Sum of Square Error (SSE): Sum of Square Error or Sum of Square Residual Error is the sum of squared difference between actual value and the predicted value related to response variable against each of the predictor variables. It is also termed as Residual Sum of Squares.
- Sum of Square Regression (SSR): Sum of Square Regression is the sum of squared difference between the predicted value and mean of actual values. It is also termed as Explained Sum of Squares (ESS)
- How are SST, SSR and SSE related?
Here is how SST, SSR and SSE related. The same could be comprehended using the diagram in fig 3.
SST = SSR + SSE
- R-Squared: R-squared is measure of how good is the regression or best fit line. It is also termed as coefficient of determination. Mathematically, it is represented as the ratio of Sum of Squares Regression (SSR) and Sum of Squares Total (SST).
R-Squared = SSR / SST = 1 – (SSE / SST)
Greater the value of R-Squared, better is the regression line as higher is the variance explained by the regression line. However, one needs to take caution which will be discussed in the later posts. In other words, the value of R-squared is a statistical measure of goodness of fit for a linear regression model. Alternatively, R-squared represents how close the prediction is to actual value.
Linear Regression Python Code Example
Here is the Python code for linear regression where a regression model is trained on housing dataset for predicting the housing prices. Pay attention to some of the following in the code given below:
- Sklearn.linear_model LinearRegression is used to create an instance of implementation of linear regression algorithm.
- Sklearn.datasets Boston dataset is used as housing dataset
- Sklearn.pipeline make_pipeline is used to create an instance of pipeline which takes input steps for standardizing the dataset (StandardScaler) and fitting the model using linear regression algorithm (LinearRegression)
- Model performance evaluation metrics used are Mean Squared Error (MSE) and R-Squared.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.metrics import mean_squared_error, r2_score from sklearn import datasets # # Load the Sklearn Boston Dataset # boston_ds = datasets.load_boston() X = boston_ds.data y = boston_ds.target # # Create a training and test split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # # Fit a pipeline using Training dataset and related labels # pipeline = make_pipeline(StandardScaler(), LinearRegression()) pipeline.fit(X_train, y_train) # # Calculate the predicted value for training and test dataset # y_train_pred = pipeline.predict(X_train) y_test_pred = pipeline.predict(X_test) # # Mean Squared Error # print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred), mean_squared_error(y_test, y_test_pred))) # # R-Squared # print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred), r2_score(y_test, y_test_pred)))
In this post, you learned some of the following concepts in relation to linear regression:
- Linear regression is a supervised machine learning algorithm used to predict the value of continuous random variable.
- When there is just one predictor or independent variable, it is called simple linear regression.
- When there are two or more predictor or independent variables, it is called multiple linear regression
- R-Squared is a metric which can be used to evaluate the linear regression model performance. It explains the variability of the response variable which is explained by the regression model. Higher the R-squared value, better is the variability explained by the regression model. However, one would need to take caution.
- R-Squared can be expressed as a function of SSE (Sum of Squares Residual Error) and SST (Sum of Squares Total).
- Data Storytelling Explained with Examples - October 21, 2020
- How to Setup / Install MLFlow & Get Started - October 20, 2020
- Python – How to Add Trend Line to Line Chart / Graph - October 20, 2020