Predicting house prices accurately is crucial in the real estate industry. However, it can be challenging to determine the factors that significantly impact house prices. Without a clear understanding of these factors, accurate predictions are difficult to achieve. The Boston Housing Dataset addresses this problem by providing a comprehensive set of variables that influence house prices in the Boston area. However, effectively utilizing this dataset and building robust predictive models require appropriate techniques and evaluation methods.
In this blog, we will provide an overview of the Boston Housing Dataset and explore linear regression, LASSO, and Ridge regression as potential models for predicting house prices. Each model has its unique properties that address specific challenges. By understanding the differences between these models, we can choose the most appropriate model for production. We will learn about how to use Mean Squared Error (MSE) for each model to evaluate their performance.
Before we get into building regression models, it is important to understand the data we will be working with. In this section, we will understand the Boston Housing Dataset (https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv) including its structure, and, explain the meaning of each column label.
The Boston Housing Dataset is a widely used dataset in machine learning and predictive analytics. It contains housing information for various neighborhoods in Boston. This dataset contains information collected by the U.S. Census Service concerning housing in the area of Boston, Massachusetts. The dataset has 506 rows, each representing a different house in the area. There are 14 columns, each representing a different aspect of the houses and their surroundings. Following is the details of the columns:
In this section, we will dive into building a predictive linear regression model using Boston housing dataset while using Python programming language. Operations such as loading of the dataset, data preprocessing, split the data into training and testing sets, and training of the linear regression model will be done. The following is the code.
# Import the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the dataset
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)
# Split the data into features (X) and target variable (y)
X = data.drop('medv', axis=1)
y = data['medv']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an instance of the Linear Regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
# Print the MSE
print("Linear Test MSE:", mse)
Note some of the following in the above code:
When building a predictive model, it is crucial to consider regularization techniques before finalizing the final model. Regularization is a method used to prevent overfitting and improve the generalization ability of the model. It achieves this by adding a penalty term to the loss function, which helps control the complexity of the model.
In the context of linear regression, two commonly used regularization techniques are LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge regression. These algorithms introduce regularization parameters using which regularization is applied to the model in terms of making some coefficients value to zero or adjusting the impact of coefficients. By adjusting these parameters, we can find a balance between fitting the training data well and avoiding overfitting.
Here is the code for training with LASSO and Ridge
# Import the required libraries
import pandas as pd
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load the Boston Housing dataset
url = 'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'
data = pd.read_csv(url)
# Split the data into features and target variable
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# LASSO regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, y_pred_lasso)
print('LASSO Test MSE:', lasso_mse)
# Ridge regression
ridge = Ridge(alpha=0.5)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
ridge_mse = mean_squared_error(y_test, y_pred_ridge)
print('Ridge Test MSE:', ridge_mse)
Executing the above code of training the model with linear, LASSO and ridge regression will print the following MSE value:
Linear Test MSE: 24.2911
LASSO Test MSE: 25.1556
Ridge Test MSE: 24.3776
The model with the lowest MSE is generally considered the best performer, as it indicates a better fit to the test data. In this case, the Linear Regression model has the lowest test MSE of 24.2911, followed by Ridge Regression with a test MSE of 24.3776, and LASSO Regression with a test MSE of 25.1556.
If the MSE of the Lasso model is lower, it suggests that some of the features are not informative and it’s better to set their coefficients to zero. If the MSE of the Ridge model is lower, it suggests that all features are somewhat informative, but their coefficients need to be regularized to prevent overfitting.
The test mean squared error (MSE) is lowest for the simple linear regression model, so if prediction accuracy on current test set is the only concern, we choose the simple linear regression model. However, it’s important to keep in mind a couple of points:
It’s important to note that selecting a model based solely on the test MSE may not provide a complete picture of its performance. It’s recommended to further evaluate the models using training, validation, and test datasets to make a more informed decision.
By splitting the data into training, validation, and test sets, you can assess how well the models generalize to unseen data. Training the models on the training set, tuning hyperparameters using the validation set, and evaluating their performance on the test set helps to simulate real-world scenarios and provides a more robust assessment of the models.
Lets train the model once by splitting dataset into training, validation and test set to evaluate generalization aspect of the model. We can then select the final model. Here is the Python code:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
# Load the Boston Housing dataset
url = 'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'
data = pd.read_csv(url)
# Split the data into features (X) and target variable (y)
X = data.drop('medv', axis=1)
y = data['medv']
# Split the data into training, validation, and test sets
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, random_state=42)
# Fit the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Fit the LASSO Regression model
lasso_model = Lasso(alpha=0.1) # Adjust the alpha value as needed
lasso_model.fit(X_train, y_train)
# Fit the Ridge Regression model
ridge_model = Ridge(alpha=1.0) # Adjust the alpha value as needed
ridge_model.fit(X_train, y_train)
# Evaluate the models using mean squared error (MSE) on validation and test sets
linear_val_mse = mean_squared_error(y_val, linear_model.predict(X_val))
lasso_val_mse = mean_squared_error(y_val, lasso_model.predict(X_val))
ridge_val_mse = mean_squared_error(y_val, ridge_model.predict(X_val))
linear_test_mse = mean_squared_error(y_test, linear_model.predict(X_test))
lasso_test_mse = mean_squared_error(y_test, lasso_model.predict(X_test))
ridge_test_mse = mean_squared_error(y_test, ridge_model.predict(X_test))
# Print the MSE values
print("Linear Regression:")
print("Validation MSE:", linear_val_mse)
print("Test MSE:", linear_test_mse)
print()
print("LASSO Regression:")
print("Validation MSE:", lasso_val_mse)
print("Test MSE:", lasso_test_mse)
print()
print("Ridge Regression:")
print("Validation MSE:", ridge_val_mse)
print("Test MSE:", ridge_test_mse)
Executing the above code will print the following:
Linear Regression:
Validation MSE: 22.3651
Test MSE: 25.298
LASSO Regression:
Validation MSE: 22.6615
Test MSE: 26.1317
Ridge Regression:
Validation MSE: 22.1929
Test MSE: 25.4528
Based on the MSE values, it can be seen that the Ridge Regression model performs slightly better than the Linear Regression model on the validation set, as it has the lowest validation MSE. However, on the test set, the Linear Regression model has a lower MSE compared to both LASSO and Ridge Regression models. This suggests that the Linear Regression model generalizes better to unseen data.
It’s important to note that the choice of the best model depends on the specific problem and the trade-off between bias and variance. In this case, although the Linear Regression model performs better on the test set, it is recommended to further evaluate the models using additional datasets (such as training and validation sets) to gain more insights into their performance and generalization ability. Cross-validation can also be employed to assess the models’ stability and select the most robust one.
Linear regression models can be built using Boston Housing Dataset to estimate house prices. While building models, we can use python code to preprocess the data, split it into training, validation, and test sets, and train the models. It is recommended to apply regularization techniques, namely LASSO and Ridge regression, to overcome potential issues such as overfitting and multicollinearity. By evaluating the models using metrics like Mean Squared Error (MSE), we can select the most appropriate model training using LinearRegression, Lasso, Ridge, LassoCV or RidgeCV. While the Linear Regression model showed promising results with lower MSE on the test set, further evaluation using additional datasets is recommended to assess their generalization aspect or we can also use cross validation (LassoCV, RidgeCV).
If you have any further questions, need clarification on any aspect discussed, or would like to explore similar topics in more detail, please don’t hesitate to reach out. I will be happy to help and provide further guidance.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…