In this post, you will learn about concepts of **linear regression **along with **Python Sklearn **examples for training linear regression models. Linear regression belongs to a class of **parametric models** and is used to train **supervised machine learning models**.

## Introduction to Linear Regression

Linear regression is a machine learning algorithm used to predict the value of continuous response variables. The predictive analytics problems that are solved using linear regression models are called supervised learning problems as they require that the value of response/target variables must be present and used for training the models. Also, recall that “continuous” represents the fact that the response variable is numerical in nature and can take infinite different values. Linear regression models belong to a class of **parametric models.**

Linear regression models work great for data that are linear in nature. In other words, the predictor / independent variables in the data set have a linear relationship with the target/response / dependent variable. The following represents the linear relationship between the response and the predictor variable in a simple linear regression model.

The red line in the above diagram is termed as **best-fit line** and can be found by training the model such as **Y = mX + c **

Linear regression models are of two different kinds. They are simple linear regression and multiple linear regression.

**Simple linear regression**: When there is just one independent or predictor variable such as that in this case, Y = mX + c, the linear regression is termed simple linear regression. Suppose you want to predict the price of a pizza based on its diameter. In this case, the price of the pizza would be the dependent variable,**Y**, and the diameter would be the independent variable,**X**. The equation might look something like:

Pizza Price=2×(Diameter)+5

In this equation, m = 2 represents the price increase per inch of the pizza diameter, and c = 5 represents the base cost of the pizza.**Multiple linear regression**: When there is more than one independent or predictor variable such as [latex]Y = w_1x_1 + w_2x_2 + … + w_nx_n[/latex], the linear regression is called multiple linear regression. Suppose you’re a hospital administrator trying to predict patient waiting times based on multiple factors like the number of incoming patients, the number of available doctors, and the time of day. The equation could look like:

Waiting Time=0.5×(Patient Inflow)−3×(Number of Doctors)+0.2×(Time of Day)+10

Here,*w*1=0.5,*w*2=−3, and*w*3=0.2 are the weights or coefficients for the independent variables “Patient Inflow,” “Number of Doctors,” and “Time of Day,” respectively. The constant term 10 represents the base waiting time irrespective of these variables.

## Linear Regression Concepts / Terminologies

In this section, you will learn about some of the key concepts related to linear regression models.

**Residual Error**: Residual error is the difference between the actual value and the predicted value. When visualizing in terms of the best-fit line, if the actual value is above the best-fit line, it is called the positive residual error and if the actual value is below the best-fit line, it is called the**negative residual error**. The figure below represents the same.

**SST, SSE, SSR**: The following are key concepts when dealing with the linear regression model. The following diagram is the representation of SST, SSE, and SSR**Sum of Square Total (SST)**: Sum of Squares Total is equal to the sum of the squared difference between actual values related to the response variable and the mean of actual values. It is also called the variance of the response. Recall how you calculate variance – the sum of the squared difference between observations and the mean of all observations. It is also termed as Total Sum of Squares (TSS).**Sum of Square Error (SSE):**Sum of Square Error or Sum of Square Residual Error is the sum of the squared difference between the actual value and the predicted value. It is also termed as Residual Sum of Squares.**Sum of Square Regression (SSR)**: Sum of Square Regression is the sum of the squared difference between the predicted value and the mean of actual values. It is also termed as**Explained Sum of Squares (ESS)**

**How are SST, SSR, and SSE related?**

Here is how SST, SSR, and SSE are related. The same could be comprehended using the diagram in fig 3.

SST = SSR + SSE

**R-Squared**: R-squared is a measure of how good is the regression or best-fit line. It is also termed as the **coefficient of determination.** Mathematically, it is represented as the ratio of Sum of Squares Regression (SSR) and Sum of Squares Total (SST).

R-Squared = SSR / SST = (SST – SSE) / SST = 1 – (SSE / SST)

The greater the value of R-Squared, the better the regression line as higher is the variation explained by the regression line. However, one needs to take caution which will be discussed in the later posts. In other words, the value of R-squared is a statistical measure of goodness of fit for a linear regression model. Alternatively, R-squared represents how close the prediction is to the actual value.

## Linear Regression Python Code Example

Here is the **Python** code for **linear regression** where a regression model is trained on a housing dataset for predicting housing prices. Pay attention to some of the following in the code given below:

- Sklearn.linear_model LinearRegression is used to create an instance of an implementation of a linear regression algorithm.
- Sklearn.datasets Boston dataset is used as a housing dataset
- Sklearn.pipeline make_pipeline is used to create an instance of a pipeline that takes input steps for standardizing the dataset (StandardScaler) and fitting the model using a linear regression algorithm (LinearRegression)
- Model performance evaluation metrics used are
**Mean Squared Error (MSE)**and**R-Squared.**

```
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import datasets
#
# Load the Sklearn Boston Dataset
#
boston_ds = datasets.load_boston()
X = boston_ds.data
y = boston_ds.target
#
# Create a training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#
# Fit a pipeline using Training dataset and related labels
#
pipeline = make_pipeline(StandardScaler(), LinearRegression())
pipeline.fit(X_train, y_train)
#
# Calculate the predicted value for training and test dataset
#
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)
#
# Mean Squared Error
#
print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
#
# R-Squared
#
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred), r2_score(y_test, y_test_pred)))
```

You can also generate synthetic data sets if you want to work with regression models. The code below is a PyTorch code for generating synthetic datasets.

```
import torch
def synthetic_data(w, b, num_records):
X = torch.normal(0, 1, (num_records, len(w)))
y = torch.matmul(X, w) + b
y += torch.normal(0, 0.01, y.shape)
return X, y.reshape((-1, 1))
w = torch.tensor([2.5, 3])
b = 1.5
num_records = 100
features, labels = synthetic_data(w, b, num_records)
```

Note some of the following in the code given above:

- There are two features whose parameters are set to 2.5 and 3. The bias element has a value of 1.5
- There are 100 records
- The function synthetic_data is used to generate feature data and associated labels
- Error element is added assuming the error is normally distributed (see the code – torch.normal(0, 0.01, y.shape)
- Matrix multiplication (torch.matmul) is used to calculate the labels which is further added to an error element

## Conclusions

In this post, you learned some of the following concepts in relation to **linear regression**:

- Linear regression is a supervised machine learning algorithm used to predict the value of continuous random variables.
- When there is just one predictor or independent variable, it is called simple linear regression.
- When there are two or more predictors or independent variables, it is called multiple linear regression
- R-Squared is a metric that can be used to evaluate the linear regression model performance. It explains the variability of the response variable which is explained by the regression model. The higher the R-squared value, the better the variability explained by the regression model. However, one would need to take caution.
- R-Squared can be expressed as a function of SSE (Sum of Squares Residual Error) and SST (Sum of Squares Total).

Read a detailed post on linear regression explained with real-world examples.

Wrong: R-Squared = SSR / SST = (SST – SST) / SST = 1 – (SSE / SST)

Correction: R-Squared = SSR / SST = (SST – SSE) / SST = 1 – (SSE / SST)

Thanks for pointing that out.