In this post, you will learn about concepts of linear regression along with Python Sklearn examples for training linear regression models. Linear regression belongs to class of parametric models and used to train supervised models.
The following topics are covered in this post:
- Introduction to linear regression
- Linear regression concepts / terminologies
- Linear regression python code example
Introduction to Linear Regression
Linear regression is a machine learning algorithm used to predict the value of continuous response variables. The predictive analytics problems that are solved using linear regression models are called supervised learning problems as it requires that the value of response/target variables must be present and used for training the models. Also, recall that “continuous” represents the fact that the response variable is numerical in nature and can take infinite different values. Linear regression models belong to a class of parametric models.
Linear regression models work great for data that are linear in nature. In other words, the predictor / independent variables in the data set have a linear relationship with the target/response / dependent variable. The following represents the linear relationship between response and the predictor variable in a simple linear regression model.
The red line in the above diagram is termed as best-fit line and can be found by training the model such as Y = mX + c
Linear regression models is of two different kinds. They are simple linear regression and multiple linear regression.
- Simple linear regression: When there is just one independent or predictor variable such as that in this case, Y = mX + c, the linear regression is termed as simple linear regression.
- Multiple linear regression: When there are more than one independent or predictor variables such as \(Y = w_1x_1 + w_2x_2 + … + w_nx_n\), the linear regression is called as multiple linear regression.
Linear Regression Concepts / Terminologies
In this section, you will learn about some of the key concepts related to training linear regression models.
- Residual Error: Residual error is the difference between the actual value and the predicted value. When visualizing in terms of best fit line, if the actual value is above the best-fit line, it is called the positive residual error and if the actual value is below the best fit line, it is called the negative residual error. The figure below represents the same.
- SST, SSE, SSR: The following are key concepts when dealing with the linear regression model. The following diagram is the representation of SST, SSE, and SSR
- Sum of Square Total (SST): Sum of Squares Total is equal to the sum of the squared difference between actual values related to the response variable and the mean of actual values. It is also called the variance of the response. Recall how you calculate variance – the sum of the squared difference between observations and the mean of all observations. It is also termed as Total Sum of Squares (TSS).
- Sum of Square Error (SSE): Sum of Square Error or Sum of Square Residual Error is the sum of the squared difference between the actual value and the predicted value. It is also termed as Residual Sum of Squares.
- Sum of Square Regression (SSR): Sum of Square Regression is the sum of the squared difference between the predicted value and the mean of actual values. It is also termed as Explained Sum of Squares (ESS)
- How are SST, SSR, and SSE related?
Here is how SST, SSR, and SSE are related. The same could be comprehended using the diagram in fig 3.
SST = SSR + SSE
R-Squared: R-squared is a measure of how good is the regression or best fit line. It is also termed as the coefficient of determination. Mathematically, it is represented as the ratio of Sum of Squares Regression (SSR) and Sum of Squares Total (SST).
R-Squared = SSR / SST = (SST – SSE) / SST = 1 – (SSE / SST)
The greater the value of R-Squared, the better the regression line as higher is the variation explained by the regression line. However, one needs to take caution which will be discussed in the later posts. In other words, the value of R-squared is a statistical measure of goodness of fit for a linear regression model. Alternatively, R-squared represents how close the prediction is to the actual value.
Linear Regression Python Code Example
Here is the Python code for linear regression where a regression model is trained on a housing dataset for predicting housing prices. Pay attention to some of the following in the code given below:
- Sklearn.linear_model LinearRegression is used to create an instance of an implementation of a linear regression algorithm.
- Sklearn.datasets Boston dataset is used as a housing dataset
- Sklearn.pipeline make_pipeline is used to create an instance of a pipeline that takes input steps for standardizing the dataset (StandardScaler) and fitting the model using a linear regression algorithm (LinearRegression)
- Model performance evaluation metrics used are Mean Squared Error (MSE) and R-Squared.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.metrics import mean_squared_error, r2_score from sklearn import datasets # # Load the Sklearn Boston Dataset # boston_ds = datasets.load_boston() X = boston_ds.data y = boston_ds.target # # Create a training and test split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # # Fit a pipeline using Training dataset and related labels # pipeline = make_pipeline(StandardScaler(), LinearRegression()) pipeline.fit(X_train, y_train) # # Calculate the predicted value for training and test dataset # y_train_pred = pipeline.predict(X_train) y_test_pred = pipeline.predict(X_test) # # Mean Squared Error # print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred), mean_squared_error(y_test, y_test_pred))) # # R-Squared # print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred), r2_score(y_test, y_test_pred)))
You can also generate synthetic data sets if you want to work with regression models. The code below is a PyTorch code for generating synthetic datasets.
import torch def synthetic_data(w, b, num_records): X = torch.normal(0, 1, (num_records, len(w))) y = torch.matmul(X, w) + b y += torch.normal(0, 0.01, y.shape) return X, y.reshape((-1, 1)) w = torch.tensor([2.5, 3]) b = 1.5 num_records = 100 features, labels = synthetic_data(w, b, num_records)
Note some of the following in the code given above:
- There are two features whose parameters are set to 2.5 and 3. The bias element has a value of 1.5
- There are 100 records
- The function synthetic_data is used to generate features data and associated labels
- Error element is added assuming error is normally distributed (see the code – torch.normal(0, 0.01, y.shape)
- Matrix multiplication (torch.matmul) is used to calculate the labels which is further added to an error element
In this post, you learned some of the following concepts in relation to linear regression:
- Linear regression is a supervised machine learning algorithm used to predict the value of the continuous random variables.
- When there is just one predictor or independent variable, it is called simple linear regression.
- When there are two or more predictors or independent variables, it is called multiple linear regression
- R-Squared is a metric that can be used to evaluate the linear regression model performance. It explains the variability of the response variable which is explained by the regression model. The higher the R-squared value, the better the variability explained by the regression model. However, one would need to take caution.
- R-Squared can be expressed as a function of SSE (Sum of Squares Residual Error) and SST (Sum of Squares Total).
Read a detailed post on linear regression explained with real-world examples.