In today’s fast-paced and highly competitive business world, spanning across industries like telecommunications, finance, e-commerce, and more, the ability to predict and understand customer churn has emerged as a critical component of strategic business management. Whether it’s a telecom giant grappling with subscriber turnover, a fintech company aiming to retain its user base, or an e-commerce platform trying to reduce shopping cart abandonment, the implications of churn are vast and deeply impactful. This is where the role of logistic regression, a potent and versatile statistical method, comes into play. This blog delves into different aspects of training a logistic regression machine learning model for churn prediction, highlighting its universality and effectiveness across diverse industry landscapes. We will use Python programming and Sklearn packages to train the logistic regression (Sklearn Logistic Regression).
The Telco Customer Churn dataset used for the churn prediction model is obtained from Kaggle. This dataset is particularly valuable for those interested in understanding customer behavior and retention at a telecommunications company. It comprises 7043 rows and 21 columns, offering a comprehensive overview of customer demographics, account information, and service details. Key attributes include customer ID, gender, whether the customer is a senior citizen, partner and dependents status, tenure, phone and internet service details, and monthly charges, among others. The target variable, ‘Churn’, indicates whether the customer left within the last month, making this dataset ideal for binary classification tasks and specifically suited for analyzing factors that contribute to customer churn.
We want to build a customer churn prediction model using logistic regression. We will start by examining the dataset we got from Kaggle. We’ll take the following steps for building the model:
As a first step, we load and examine the churn dataset to understand its features and format.
import pandas as pd # Load the dataset file_path = '/file-path/WA_Fn-UseC_-Telco-Customer-Churn.csv' data = pd.read_csv(file_path) # Display the first few rows of the dataset and its summary data.head(), data.info(), data.describe()
The dataset consists of 7043 entries with 21 columns. Here’s an overview of its structure:
Before we proceed with building the model, we need to preprocess the data. The following is a list of data preprocessing methods which will be used with churn dataset:
The following Python code can be used with these preprocessing steps. In the following code, key scikit-learn and pandas classes are utilized for data preprocessing:
from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.model_selection import train_test_split # 1. Convert categorical variables to numeric # Label encoding for binary categories and get dummies for others binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn'] label_encoder = LabelEncoder() for col in binary_cols: data[col] = label_encoder.fit_transform(data[col]) # Get dummies for non-binary categorical variables categorical_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod'] data = pd.get_dummies(data, columns=categorical_cols, drop_first=True) # 2. Handle missing values (check for missing values first) missing_values = data.isnull().sum() # 3. Convert 'TotalCharges' to numeric type # 'TotalCharges' might have some blank strings which need to be handled data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce') data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].mean()) # 4. Scale numerical features scaler = StandardScaler() numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges'] data[numerical_cols] = scaler.fit_transform(data[numerical_cols]) # 5. Drop irrelevant features data = data.drop(['customerID'], axis=1) # Display missing values and the first few rows of the processed dataset missing_values, data.head()
The dataset is now preprocessed:
Next, we’ll split the data into training and testing sets. The target variable is ‘Churn’, which we want to predict. Let’s proceed with the data split and then build the logistic regression model.
# Splitting the data into training and testing sets X = data.drop('Churn', axis=1) y = data['Churn'] # Split the dataset into a training set and a test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train.shape, X_test.shape, y_train.shape, y_test.shape
The data has been split into training and testing sets. The training set contains 5634 samples, and the testing set contains 1409 samples.
Now, let’s build and train the logistic regression model using the training set. After training, we will evaluate the model’s performance on the test set. Let’s proceed with building the model using Sklearn.Linear_model.LogisticRegression.
from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix, accuracy_score # Building the logistic regression model logreg = LogisticRegression(max_iter=1000, random_state=42) logreg.fit(X_train, y_train)
The logistic regression model for predicting customer churn has been trained. The following Python code will help evaluate the logistic regression model for churn prediction:
# Predicting the Test set results y_pred = logreg.predict(X_test) # Evaluating the model accuracy = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) class_report = classification_report(y_test, y_pred) accuracy, conf_matrix, class_report
Here are the results:
These results indicate that the model is quite effective in predicting customer churn, with a good balance of precision and recall. However, there is still room for improvement, especially in reducing false negatives and increasing recall.
Further enhancements could include feature engineering, trying different models, or tuning hyperparameters.
In customer churn prediction, choosing to enhance precision, recall, or both hinges on specific business needs. Precision quantifies the accuracy of churn predictions, crucial when false positives (erroneously labeled churners) incur high costs. In contrast, recall measures the model’s ability to identify actual churn cases, vital when missing true churners is costlier. A trade-off often exists between these metrics; improving one can reduce the other. The decision largely depends on the comparative business impact of false positives versus false negatives. The F1 score, balancing precision and recall, becomes relevant when both error types have significant implications. Effectively, the choice of focusing on precision, recall, or both should align with the business’s strategic objectives and the financial ramifications of predictive inaccuracies. We will aim to improve the F1-score.
Improving the F1-score of the customer churn prediction model involves a variety of strategies. An F1-score of 0.64 indicates a moderate balance between precision and recall, but there’s room for improvement. Here are some approaches:
Remember, improving a model is an iterative process, and it often requires experimenting with multiple strategies to see which works best for your specific dataset and business context.
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…