Have you as a data scientist ever been challenged by choosing the best logistic regression model for your data? As we all know, the difference between a good and the best model while training machine learning model can be subtle yet impactful. Whether it’s predicting the likelihood of an event occurring or classifying data into distinct categories, logistic regression provides a robust framework for analysts and researchers. However, the true power of logistic regression is harnessed not just by building models, but also by selecting the right model. This is where the Akaike Information Criterion (AIC) comes into play.
In this blog, we’ll delve into different aspects of AIC, decode its formula, learn with real-world examples including Python & R example, and unveil the best practices and common pitfalls.
The Akaike Information Criterion (AIC), named after its creator, Hirotugu Akaike in 1970, is one of the most popular tool for comparing different models. Unlike traditional methods that might focus solely on the goodness of fit, AIC introduces a balance, considering both the complexity of the model and how well it aligns with the observed data.
AIC is based on the concept of entropy, a measure of uncertainty or randomness. In simple terms, AIC evaluates how much “information” a model loses when it approximates reality. The lesser the information loss, the better the model. AIC embodies the idea that among models with a comparable fit, the simpler one is preferable. This principle is crucial in avoiding the trap of overfitting, where a model might perform well on the training data but poorly on new, unseen data.
When using AIC, it’s important to remember that it’s a relative measure. The absolute value of AIC is not as informative as the difference in AIC between models. There is no absolute “good” value of AIC in isolation. A smaller AIC value indicates a better model, but the “best” model is the one with the lowest AIC among the set of models being compared.
If we have two logistic regression models having different AIC values such as $AIC_1$ and $AIC_2$, and if $AIC_1 < AIC_2$, then model with $AIC_1$ is selected. For that matter, this holds good for models trained with any classification algorithm.
When comparing models, the difference in AIC values is important. A general guideline is that a difference of less than 2 might not be significant, while a difference of 2 to 6 suggests a substantial difference, and a difference of more than 10 indicates a strong difference between models.
At its core, AIC is calculated using the following formula:
AIC = −2×log-likelihood+2×K
Here, the log-likelihood represents the probability of the data given the model, essentially measuring how well the model fits the data. The second term, 2 x K (where K is the number of parameters), penalizes model complexity. The formula ensures that adding more parameters to improve the model fit is only justified if it significantly enhances the likelihood.
In logistic regression, where models can become complex rapidly, AIC helps with model selection. It aids in comparing different logistic models applied to the same dataset, helping to make informed decisions about which model to use. AIC is particularly well-suited for logistic regression for several reasons, some of which are unique to the nature of logistic regression as a statistical modeling tool:
In this section, we will demonstrate how AIC can be used for evaluating two logistic regression models. We will demonstrate using both Python and R code.
I will work with breast cancer dataset from sklearn. I will first create two logistic regression models with different sets of predictors. Then, I will calculate the log-likelihood for each model and use this to calculate AIC and compare them. In the Python code below, I created two logistic regression models using two different solvers such as liblinear and newton-cg.
from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import numpy as np # Load the dataset data = load_breast_cancer() X = data.data y = data.target # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create two logistic regression models with different solvers model1 = LogisticRegression(solver='liblinear') model1.fit(X_train[:, :10], y_train) model2 = LogisticRegression(solver='newton-cg') model2.fit(X_train[:, :10], y_train) # Predict log probabilities for each model log_prob1 = model1.predict_log_proba(X_test[:, :10]) log_prob2 = model2.predict_log_proba(X_test[:, :10]) # Calculate log-likelihood for each model log_likelihood1 = log_prob1[np.arange(len(y_test)), y_test].sum() log_likelihood2 = log_prob2[np.arange(len(y_test)), y_test].sum() # Compare the models print(f"Log-Likelihood for Model 1: {log_likelihood1}") print(f"Log-Likelihood for Model 2: {log_likelihood2}") # Calculate AIC for each model k1 = 10 + 1 # Number of parameters in model1 k2 = 10 + 1 # Number of parameters in model2 aic1 = 2 * k1 - 2 * log_likelihood1 aic2 = 2 * k2 - 2 * log_likelihood2 # Compare the models print(f"AIC for Model 1: {aic1}") print(f"AIC for Model 2: {aic2}")
The following output is printed:
Log-Likelihood for Model 1: -28.103983686589725
Log-Likelihood for Model 2: -26.20160164908428
AIC for Model 1: 78.20796737317946
AIC for Model 2: 74.40320329816856
Based on the above output, here’s how to interpret the results for the selection of the model:
Model Selection: Given these results, Model 2 is the better choice between the two. It not only has a better fit (higher log-likelihood) but also maintains a balance between fitting the data well and not being overly complex (lower AIC).
The following is the R code for evaluating logistic regression model using AIC. In the code below, both models use the same predictors, but model2 employs a different link function (probit instead of the default logit). The AIC values for both models are then calculated and compared. The model with the lower AIC is generally preferred.
# Load the necessary libraries library(MASS) # Load the biopsy dataset (similar to the breast cancer dataset) data(biopsy) # Clean the dataset (remove NA values) biopsy_clean <- na.omit(biopsy) # Define the response variable response <- as.factor(biopsy_clean$class) # Define the same set of predictors for both models predictors <- biopsy_clean[, c("V1", "V2", "V3", "V4", "V5")] # Model 1: Standard logistic regression model model1 <- glm(response ~ ., data = predictors, family = binomial()) # Model 2: Logistic regression with a different link function (e.g., probit) model2 <- glm(response ~ ., data = predictors, family = binomial(link = "probit")) # Calculate AIC for each model aic1 <- AIC(model1) aic2 <- AIC(model2) # Output the AIC values print(paste("AIC for Model 1:", aic1)) print(paste("AIC for Model 2:", aic2))
The above code when executed prints the following output:
We can select Model 2. The reason for this selection is grounded in the principle behind the AIC.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…