Have you as a data scientist ever been challenged by choosing the best logistic regression model for your data? As we all know, the difference between a good and the best model while training machine learning model can be subtle yet impactful. Whether it’s predicting the likelihood of an event occurring or classifying data into distinct categories, logistic regression provides a robust framework for analysts and researchers. However, the true power of logistic regression is harnessed not just by building models, but also by selecting the right model. This is where the Akaike Information Criterion (AIC) comes into play.
In this blog, we’ll delve into different aspects of AIC, decode its formula, learn with real-world examples including Python & R example, and unveil the best practices and common pitfalls.
What is Akaike Information Criterion (AIC)?
The Akaike Information Criterion (AIC), named after its creator, Hirotugu Akaike in 1970, is one of the most popular tool for comparing different models. Unlike traditional methods that might focus solely on the goodness of fit, AIC introduces a balance, considering both the complexity of the model and how well it aligns with the observed data.
AIC is based on the concept of entropy, a measure of uncertainty or randomness. In simple terms, AIC evaluates how much “information” a model loses when it approximates reality. The lesser the information loss, the better the model. AIC embodies the idea that among models with a comparable fit, the simpler one is preferable. This principle is crucial in avoiding the trap of overfitting, where a model might perform well on the training data but poorly on new, unseen data.
When using AIC, it’s important to remember that it’s a relative measure. The absolute value of AIC is not as informative as the difference in AIC between models. There is no absolute “good” value of AIC in isolation. A smaller AIC value indicates a better model, but the “best” model is the one with the lowest AIC among the set of models being compared.
If we have two logistic regression models having different AIC values such as $AIC_1$ and $AIC_2$, and if $AIC_1 < AIC_2$, then model with $AIC_1$ is selected. For that matter, this holds good for models trained with any classification algorithm.
When comparing models, the difference in AIC values is important. A general guideline is that a difference of less than 2 might not be significant, while a difference of 2 to 6 suggests a substantial difference, and a difference of more than 10 indicates a strong difference between models.
AIC Formula
At its core, AIC is calculated using the following formula:
AIC = −2×log-likelihood+2×K
Here, the log-likelihood represents the probability of the data given the model, essentially measuring how well the model fits the data. The second term, 2 x K (where K is the number of parameters), penalizes model complexity. The formula ensures that adding more parameters to improve the model fit is only justified if it significantly enhances the likelihood.
AIC in Logistic Regression
In logistic regression, where models can become complex rapidly, AIC helps with model selection. It aids in comparing different logistic models applied to the same dataset, helping to make informed decisions about which model to use. AIC is particularly well-suited for logistic regression for several reasons, some of which are unique to the nature of logistic regression as a statistical modeling tool:
- Likelihood-Based Model: The likelihood, particularly the log-likelihood, is a central concept in logistic regression. It represents the probability of observing the given data under the specified model. AIC relies on the log-likelihood as a key component of its calculation, making it inherently compatible with logistic regression models. AIC thus directly uses a key output of logistic regression models, making it a natural fit for evaluating these models.
- Model Comparison on Log-Likelihood Basis: Logistic regression models often involve comparing various combinations of predictors to find the most effective model. AIC facilitates this comparison by quantifying model quality in terms of log-likelihood, adjusted for the number of parameters. This allows for a direct comparison of different logistic regression models based on their likelihood estimates, considering both the fit and the complexity of the models.
- Fit vs. Complexity Balance: AIC helps to balance the fit of the model (how well the model explains the observed data) against its complexity (number of parameters). Logistic regression models, which are built around maximizing the likelihood, benefit from this balance. The AIC ensures that adding more predictors to the logistic model (thus increasing complexity) is only beneficial if it significantly improves the fit.
- Sensitivity to Overfitting: Logistic regression models can be prone to overfitting, especially with many predictors or complex interactions. Overfitting occurs when a model is too closely tailored to the training data and may not perform well on new data. AIC’s penalty for additional parameters naturally guards against overfitting by discouraging unnecessarily complex models.
Evaluating Logistic Regression Models using AIC
In this section, we will demonstrate how AIC can be used for evaluating two logistic regression models. We will demonstrate using both Python and R code.
Python Code for Comparing Logistic Regression Models using AIC
I will work with breast cancer dataset from sklearn. I will first create two logistic regression models with different sets of predictors. Then, I will calculate the log-likelihood for each model and use this to calculate AIC and compare them. In the Python code below, I created two logistic regression models using two different solvers such as liblinear and newton-cg.
from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import numpy as np # Load the dataset data = load_breast_cancer() X = data.data y = data.target # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create two logistic regression models with different solvers model1 = LogisticRegression(solver='liblinear') model1.fit(X_train[:, :10], y_train) model2 = LogisticRegression(solver='newton-cg') model2.fit(X_train[:, :10], y_train) # Predict log probabilities for each model log_prob1 = model1.predict_log_proba(X_test[:, :10]) log_prob2 = model2.predict_log_proba(X_test[:, :10]) # Calculate log-likelihood for each model log_likelihood1 = log_prob1[np.arange(len(y_test)), y_test].sum() log_likelihood2 = log_prob2[np.arange(len(y_test)), y_test].sum() # Compare the models print(f"Log-Likelihood for Model 1: {log_likelihood1}") print(f"Log-Likelihood for Model 2: {log_likelihood2}") # Calculate AIC for each model k1 = 10 + 1 # Number of parameters in model1 k2 = 10 + 1 # Number of parameters in model2 aic1 = 2 * k1 - 2 * log_likelihood1 aic2 = 2 * k2 - 2 * log_likelihood2 # Compare the models print(f"AIC for Model 1: {aic1}") print(f"AIC for Model 2: {aic2}")
The following output is printed:
Log-Likelihood for Model 1: -28.103983686589725
Log-Likelihood for Model 2: -26.20160164908428
AIC for Model 1: 78.20796737317946
AIC for Model 2: 74.40320329816856
Based on the above output, here’s how to interpret the results for the selection of the model:
- Log-Likelihood Values:
- Model 1: -28.10
- Model 2: -26.20
- AIC Values:
- Model 1: 78.21
- Model 2: 74.40
Model Selection: Given these results, Model 2 is the better choice between the two. It not only has a better fit (higher log-likelihood) but also maintains a balance between fitting the data well and not being overly complex (lower AIC).
R Code for AIC in Logistic Regression
The following is the R code for evaluating logistic regression model using AIC. In the code below, both models use the same predictors, but model2 employs a different link function (probit instead of the default logit). The AIC values for both models are then calculated and compared. The model with the lower AIC is generally preferred.
# Load the necessary libraries library(MASS) # Load the biopsy dataset (similar to the breast cancer dataset) data(biopsy) # Clean the dataset (remove NA values) biopsy_clean <- na.omit(biopsy) # Define the response variable response <- as.factor(biopsy_clean$class) # Define the same set of predictors for both models predictors <- biopsy_clean[, c("V1", "V2", "V3", "V4", "V5")] # Model 1: Standard logistic regression model model1 <- glm(response ~ ., data = predictors, family = binomial()) # Model 2: Logistic regression with a different link function (e.g., probit) model2 <- glm(response ~ ., data = predictors, family = binomial(link = "probit")) # Calculate AIC for each model aic1 <- AIC(model1) aic2 <- AIC(model2) # Output the AIC values print(paste("AIC for Model 1:", aic1)) print(paste("AIC for Model 2:", aic2))
The above code when executed prints the following output:
- AIC for Model 1: 160.66
- AIC for Model 2: 159.51
We can select Model 2. The reason for this selection is grounded in the principle behind the AIC.
- Agentic Reasoning Design Patterns in AI: Examples - October 18, 2024
- LLMs for Adaptive Learning & Personalized Education - October 8, 2024
- Sparse Mixture of Experts (MoE) Models: Examples - October 6, 2024
I found it very helpful. However the differences are not too understandable for me