**Are you grappling with the complexities of choosing the right regression model for your data?** You are not alone. When working with **regression models**, **selecting the most appropriate machine learning model** is a critical step toward understanding the relationships between variables and making accurate predictions. With numerous regression models available, it becomes essential to employ robust criteria for **model selection**. This is where the two most widely used criteria come to the rescue. They are the **Akaike Information Criterion (AIC)** and the **Bayesian Information Criterion (BIC)**. In this blog, we will learn about the concepts of AIC, BIC and how they can be used to select the most appropriate machine learning regression models.

## AIC & BIC Concepts Explained with Formula

In model selection for regression analysis, we often face the challenge of choosing the most appropriate model that strikes a balance between model fit and complexity. Two widely used criteria for this purpose are the **Akaike Information Criterion (AIC)** and the **Bayesian Information Criterion (BIC)**. Both AIC and BIC provide a quantitative measure to evaluate and compare different models, enabling data scientists to make informed decisions. Let’s delve into the formulas and explanations of these criteria:

**Akaike Information Criterion (AIC)**

**AIC **is a **model selection criterion** developed by **Hirotugu Akaike** that aims to estimate the relative quality of different models while penalizing for model complexity. Here is the original paper on AIC concept by Akaike – A New Look at the Statistical Modeling Identification. The purpose of AIC is to find a model that maximizes the likelihood of the data while taking into account the number of parameters used. The formula for AIC is as follows:

**AIC = -2 * log(L) + 2 * k**

In the formula, L represents the maximized likelihood of the model, which measures how well the model fits the data. The term k represents the number of parameters in the model, including the intercept and any additional predictors.

By incorporating both the likelihood and the number of parameters, AIC strikes a balance between model fit and complexity. It encourages the selection of models that fit the data well but avoids excessive complexity, preventing overfitting and reducing the risk of capturing noise or irrelevant features in the data.

* Many times, it becomes difficult to calculate or determine the value of Likelihood*, and

**. This formulation of AIC is commonly used in situations where it is challenging or impractical to calculate the likelihood of the model. In such cases, the formula for AIC can be modified using the SSE as a proxy for the likelihood. The formula for AIC with SSE is as follows:**

*the sum of squared errors (SSE) is used instead***AIC = n * ln(SSE/n) + 2 * k**

In the formula, n represents the sample size, SSE represents the sum of squared errors (residuals) from the model, and k represents the number of parameters in the model.

For a smaller sample of the dataset, what is used is AICc. Here is the details for AICc

### AICc for Smaller Samples of Data

While the Akaike Information Criterion (AIC) provides a useful measure for model selection, it has a tendency to favor more complex models, especially when dealing with smaller samples. To address this issue, a corrected version of AIC called **AICc** (**AIC corrected**) was developed. AICc adjusts the AIC values to account for the potential bias introduced by limited data.

In situations where the sample size is relatively small compared to the number of parameters in the model, AICc becomes particularly relevant. The formula for AICc incorporates an additional correction term that penalizes model complexity to a greater extent than AIC. The formula for AICc is as follows:

**AICc = AIC + (2 * k * (k + 1)) / (n – k – 1)**

In the formula, AIC represents the original Akaike Information Criterion value, k represents the number of parameters in the model, and n represents the sample size. When we can’t determine the value of likelihood, we use SSE version of AIC as shown above.

The additional correction term (2 * k * (k + 1)) / (n – k – 1) in AICc becomes larger as the number of parameters increases relative to the sample size, effectively penalizing more complex models. By incorporating this correction, AICc provides a more accurate measure of model fit, particularly for smaller sample sizes. *As the sample size increases, the correction term in AICc becomes negligible, and the original AIC can be used without the need for correction*.

### Bayesian Information Criterion (BIC)

Similar to AIC, the Bayesian Information Criterion (BIC) is another model selection criterion that considers both model fit and complexity. BIC is based on Bayesian principles and provides a more stronger penalty for model complexity compared to AIC. Gideon Schwarz’s foundational paper on BIC is titled “Estimating the Dimension of a Model” and was published in 1978. The formula for BIC is as follows:

**BIC = -2 * log(L) + k * log(n)**

In the formula, the terms log(L) and k have the same meaning as in AIC. Additionally, the term log(n) represents the logarithm of the sample size (n). The log(n) term in BIC introduces a stronger penalty for model complexity compared to AIC, as the penalty term scales with the sample size.

The main difference between AIC and BIC lies in the penalty term for model complexity. While AIC penalizes complexity to some extent with the term 2 * k, BIC’s penalty increases logarithmically with the sample size, resulting in a more pronounced penalty. Therefore, BIC tends to favor simpler models compared to AIC, promoting a more parsimonious approach to model selection.

## AIC & BIC for Model Selection: Example

**AIC and BIC** serve as powerful metrics for **model selection** in **regression analysis**. When confronted with more than one regression model, these criteria aid in identifying the most suitable one. **AIC **considers the **trade-off between model fit and model complexity** by incorporating the **likelihood**, **number of parameters**, and **sample size**. On the other hand, **BIC **introduces a **stronger penalty** for model complexity, prioritizing simpler models. By evaluating AIC and BIC values, data scientists can strike the balance between model performance and complexity, thereby making informed decisions on which regression model to select.

The following is a sample dataset representing different regression models and their corresponding evaluation metrics including RMSE, AIC, BIC, Adjusted R2, and Mallow’s Cp. Let’s understand how we go about selecting a model given all these metrics.

Model Name | Algorithm | RMSE | Adjusted R2 | AIC | BIC | Cp |
---|---|---|---|---|---|---|

Model 1 | Linear | 5.23 | 0.75 | 1200.56 | 1250.89 | 10.23 |

Model 2 | Polynomial | 4.92 | 0.78 | 1185.32 | 1220.45 | 8.91 |

Model 3 | Ridge | 4.85 | 0.79 | 1180.24 | 1212.78 | 8.55 |

Model 4 | Lasso | 5.10 | 0.77 | 1195.12 | 1235.67 | 9.34 |

Model 5 | Random Forest | 4.97 | 0.78 | 1187.43 | 1223.89 | 8.98 |

Here is the analysis for evaluating and selecting a model out of different models shown above based on the different metrics including AIC and BIC.

- Comparing AIC values:
- Model 3 (Ridge regression) has the lowest AIC value of 1180.24, indicating a better fit compared to other models.
- Model 2 (Polynomial regression) and Model 5 (Random Forest) also have relatively low AIC values, suggesting a good model fit.

- Comparing BIC values:
- Model 3 (Ridge regression) has the lowest BIC value of 1212.78, indicating a better fit compared to other models.
- Model 2 (Polynomial regression) and Model 5 (Random Forest) also have relatively low BIC values, suggesting a good model fit.

It looks like Model 3 has both lower AIC and BIC values than Model 2 and Model 5. Thus, we can go with model 3. However, one can raise an issue about model 2 and the model having a slightly lower adjusted R2 which requires some thought before model 3 gets selected.

The selection of Model 3 based on AIC and BIC, despite having a slightly higher Adjusted R-squared (Adjusted R2) compared to Model 2 and Model 5, can be attributed to the trade-off between model fit and model complexity.

Adjusted R2 measures the proportion of the variation in the target variable that is explained by the predictors in the model, accounting for the number of predictors and sample size. A higher Adjusted R2 indicates a better fit, suggesting that the model captures more of the variability in the data.

On the other hand, AIC and BIC take into account not only the model fit but also the model complexity. Both criteria penalize models for having a larger number of parameters or predictors. They aim to strike a balance between the goodness of fit and model simplicity, favoring models that explain the data well without excessive complexity.

In the case of Model 3, even though its Adjusted R2 is slightly higher compared to Model 2 and Model 5, the AIC and BIC values for Model 3 are lower. This implies that Model 3 achieves a good balance between model fit and complexity, outperforming the other models in terms of these criteria.

The lower AIC and BIC values for Model 3 indicate that it provides a relatively better fit while using fewer parameters or predictors compared to the other models. This suggests that Model 3 is a more parsimonious choice, avoiding overfitting and reducing the risk of capturing noise or irrelevant features in the data.

Therefore, despite having a slightly higher Adjusted R2, Model 3 is selected based on AIC and BIC because it offers a favorable compromise between model fit and complexity, providing a more reliable and interpretable model for the given data.

## AIC & BIC Calculation Python Example

The following Python code demonstrates how you can calculate AIC & BIC value for linear regression models. Note that SSE version of formula are used.

```
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# Method for calculating AIC
def calculate_aic(n, sse, k):
aic = n * np.log(sse / n) + 2 * k
return aic
# Method for calculating BIC
def calculate_bic(n, sse, k):
bic = n * np.log(sse / n) + k * np.log(n)
return bic
# Load the Boston Housing Pricing Data
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)
# Split the data into predictors (X) and target variable (y)
X = data.drop("medv", axis=1) # Remove the target variable from predictors
y = data["medv"]
# Fit linear regression model using scikit-learn
reg = LinearRegression()
reg.fit(X, y)
# Calculate SSE
y_pred = reg.predict(X)
sse = np.sum((y - y_pred) ** 2)
# Get number of parameters and sample size
k = reg.coef_.size
n = X.shape[0]
# Calculate AIC and BIC
aic = calculate_aic(n, sse, k)
bic = calculate_bic(n, sse, k)
print("AIC:", aic)
print("BIC:", bic)
```

## Conclusion

The **Akaike Information Criterion (AIC) **and **Bayesian Information Criterion (BIC)** are important metrics for **model selection** in** regression analysis**. By considering both model fit and complexity, AIC and BIC provide quantitative measures that help researchers choose the most appropriate model for their data. AIC strikes a balance between goodness of fit and model complexity, while BIC introduces a stronger penalty for model complexity. By evaluating AIC and BIC, data scientists can assess the trade-off between model performance and simplicity, ultimately selecting a model that optimally explains the data without overfitting. However, it is important to consider these criteria alongside other factors such as domain knowledge, interpretability, and specific research goals. The ultimate aim is to choose a regression model that not only provides a good fit but also yields meaningful insights and facilitates reliable predictions. AIC and BIC serve as reliable guides in this process, enabling data scientists to make informed decisions and enhance the overall quality of their regression modeling.

- OKRs vs KPIs vs KRAs: Differences and Examples - February 21, 2024
- CEP vs Traditional Database Examples - February 2, 2024
- Retrieval Augmented Generation (RAG) & LLM: Examples - February 1, 2024

Can we assume safely that we can replace the Maximum likelihood with the negative sum of squared errors for models such as random forest regressor? Please let me know.

For linear regression models that assume normally distributed errors, using the SSE as part of the AIC or BIC calculation is considered to be a good solution because minimizing SSE is equivalent to maximizing the likelihood under these assumptions. The AIC and BIC are then calculated using the likelihood (or SSE in this case), penalized by a function of the number of parameters and the sample size to account for model complexity and overfitting.

However, for models like the random forest regressor, the situation is more complex. Randon forest regressor is a non-parametric, ensemble learning method based on decision trees. It does not assume any specific distribution for the error terms and does not directly optimize a likelihood function during training. Instead, it minimizes the overall prediction error through a process of bootstrapping and aggregation of decision trees (bagging).

Given above, the direct replacement of MLE with SSE in the calculation of AIC or BIC for random forest models is not straightforward. While you cannot directly replace MLE with SSE in the AIC or BIC calculations for models like random forest in the traditional sense, there are approximation methods that use error metrics as a stand-in for likelihood. These methods require careful consideration of how model complexity is defined and penalized.