In today’s fast-paced and highly competitive business world, spanning across industries like telecommunications, finance, e-commerce, and more, the ability to predict and understand customer churn has emerged as a critical component of strategic business management. Whether it’s a telecom giant grappling with subscriber turnover, a fintech company aiming to retain its user base, or an e-commerce platform trying to reduce shopping cart abandonment, the implications of churn are vast and deeply impactful. This is where the role of logistic regression, a potent and versatile statistical method, comes into play. This blog delves into different aspects of training a logistic regression machine learning model for churn prediction, highlighting its universality and effectiveness across diverse industry landscapes. We will use Python programming and Sklearn packages to train the logistic regression (Sklearn Logistic Regression).
Customer Churn Dataset
The Telco Customer Churn dataset used for the churn prediction model is obtained from Kaggle. This dataset is particularly valuable for those interested in understanding customer behavior and retention at a telecommunications company. It comprises 7043 rows and 21 columns, offering a comprehensive overview of customer demographics, account information, and service details. Key attributes include customer ID, gender, whether the customer is a senior citizen, partner and dependents status, tenure, phone and internet service details, and monthly charges, among others. The target variable, ‘Churn’, indicates whether the customer left within the last month, making this dataset ideal for binary classification tasks and specifically suited for analyzing factors that contribute to customer churn.
Plan for Training Logistic Regression Model
We want to build a customer churn prediction model using logistic regression. We will start by examining the dataset we got from Kaggle. We’ll take the following steps for building the model:
- Data loading: Load the dataset.
- Exploratory data analysis (EDA): Explore and understand its structure.
- Data preprocessing: Preprocess the data (handle missing values, encode categorical variables, etc.).
- Model training
- Split the data into training and testing sets.
- Build and train a logistic regression model.
- Model evaluation: Evaluate the model’s performance.
Customer Churn Data Loading
As a first step, we load and examine the churn dataset to understand its features and format.
import pandas as pd # Load the dataset file_path = '/file-path/WA_Fn-UseC_-Telco-Customer-Churn.csv' data = pd.read_csv(file_path) # Display the first few rows of the dataset and its summary data.head(), data.info(), data.describe()
The dataset consists of 7043 entries with 21 columns. Here’s an overview of its structure:
- customerID: Unique identifier for each customer.
- gender: Customer’s gender (Male/Female).
- SeniorCitizen: Whether the customer is a senior citizen (1) or not (0).
- Partner: Whether the customer has a partner (Yes/No).
- Dependents: Whether the customer has dependents (Yes/No).
- tenure: Number of months the customer has been with the company.
- PhoneService: Whether the customer has phone service (Yes/No).
- MultipleLines: Whether the customer has multiple lines (Yes/No/No phone service).
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No).
- OnlineSecurity: Whether the customer has online security (Yes/No/No internet service).
- OnlineBackup: Whether the customer has online backup (Yes/No/No internet service).
- DeviceProtection: Whether the customer has device protection (Yes/No/No internet service).
- TechSupport: Whether the customer has tech support (Yes/No/No internet service).
- StreamingTV: Whether the customer has streaming TV (Yes/No/No internet service).
- StreamingMovies: Whether the customer has streaming movies (Yes/No/No internet service).
- Contract: The contract term of the customer (Month-to-month, One year, Two year).
- PaperlessBilling: Whether the customer has paperless billing (Yes/No).
- PaymentMethod: The customer’s payment method.
- MonthlyCharges: The amount charged to the customer monthly.
- TotalCharges: The total amount charged to the customer.
- Churn: Whether the customer churned (Yes/No).
Customer Churn – Data Preprocessing Python Code
Before we proceed with building the model, we need to preprocess the data. The following is a list of data preprocessing methods which will be used with churn dataset:
- Convert categorical variables to numeric: Many machine learning models, including logistic regression, require numerical input. We need to encode categorical variables (like gender, InternetService, etc.) into numerical form.
- Handle missing values: Check for any missing values and decide how to handle them.
- Scale numerical features: Features like tenure, MonthlyCharges, and TotalCharges should be scaled to ensure they contribute equally to the model.
- Convert TotalCharges to numeric type: It’s currently an object type.
- Drop irrelevant features: The customerID column can be dropped as it is unlikely to be useful for prediction.
The following Python code can be used with these preprocessing steps. In the following code, key scikit-learn and pandas classes are utilized for data preprocessing:
- LabelEncoder (from sklearn.preprocessing.LabelEncoder): Transforms binary categorical text labels into numeric format, necessary for machine learning algorithms to process the data. Applied here to binary columns like ‘gender’, ‘Partner’, etc.
- get_dummies (from pandas.get_dummies): Converts non-binary categorical variables into dummy variables through one-hot encoding. Each category value is transformed into a separate binary column, used for columns like ‘MultipleLines’, ‘InternetService’, etc.
- StandardScaler (from sklearn.preprocessing.StandardScaler): Standardizes numerical features to have a mean of 0 and a standard deviation of 1, ensuring equal contribution to the model. Applied to ‘tenure’, ‘MonthlyCharges’, and ‘TotalCharges’.
from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.model_selection import train_test_split # 1. Convert categorical variables to numeric # Label encoding for binary categories and get dummies for others binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn'] label_encoder = LabelEncoder() for col in binary_cols: data[col] = label_encoder.fit_transform(data[col]) # Get dummies for non-binary categorical variables categorical_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod'] data = pd.get_dummies(data, columns=categorical_cols, drop_first=True) # 2. Handle missing values (check for missing values first) missing_values = data.isnull().sum() # 3. Convert 'TotalCharges' to numeric type # 'TotalCharges' might have some blank strings which need to be handled data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce') data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].mean()) # 4. Scale numerical features scaler = StandardScaler() numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges'] data[numerical_cols] = scaler.fit_transform(data[numerical_cols]) # 5. Drop irrelevant features data = data.drop(['customerID'], axis=1) # Display missing values and the first few rows of the processed dataset missing_values, data.head()
The dataset is now preprocessed:
- Categorical variables have been converted to numeric.
- There were no missing values initially. TotalCharges had some non-numeric values which have been converted to numeric and missing values filled with the mean.
- Numerical features (tenure, MonthlyCharges, and TotalCharges) have been scaled.
- The irrelevant customerID feature has been dropped.
Logistic Regression Model Training for Churn Prediction: Python Code
Next, we’ll split the data into training and testing sets. The target variable is ‘Churn’, which we want to predict. Let’s proceed with the data split and then build the logistic regression model.
# Splitting the data into training and testing sets X = data.drop('Churn', axis=1) y = data['Churn'] # Split the dataset into a training set and a test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train.shape, X_test.shape, y_train.shape, y_test.shape
The data has been split into training and testing sets. The training set contains 5634 samples, and the testing set contains 1409 samples.
Now, let’s build and train the logistic regression model using the training set. After training, we will evaluate the model’s performance on the test set. Let’s proceed with building the model using Sklearn.Linear_model.LogisticRegression.
from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix, accuracy_score # Building the logistic regression model logreg = LogisticRegression(max_iter=1000, random_state=42) logreg.fit(X_train, y_train)
Logistic Regression Model Evaluation for Churn Prediction
The logistic regression model for predicting customer churn has been trained. The following Python code will help evaluate the logistic regression model for churn prediction:
# Predicting the Test set results y_pred = logreg.predict(X_test) # Evaluating the model accuracy = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) class_report = classification_report(y_test, y_pred) accuracy, conf_matrix, class_report
Here are the results:
- Accuracy: The model achieved an accuracy of about 82.19% on the test set.
- Confusion Matrix:
- True negatives (customers predicted not to churn and didn’t): 934
- False positives (customers predicted to churn but didn’t): 102
- False negatives (customers predicted not to churn but did): 149
- True positives (customers predicted to churn and did): 224
- Classification Report:
- Precision (for customers who churned): 69%
- Recall (for customers who churned): 60%
- F1-score: This is a balance between precision and recall. For customers who churned, the F1-score is 0.64.
These results indicate that the model is quite effective in predicting customer churn, with a good balance of precision and recall. However, there is still room for improvement, especially in reducing false negatives and increasing recall.
Further enhancements could include feature engineering, trying different models, or tuning hyperparameters.
Improving Logistic Regression Model Performance for Churn Prediction
In customer churn prediction, choosing to enhance precision, recall, or both hinges on specific business needs. Precision quantifies the accuracy of churn predictions, crucial when false positives (erroneously labeled churners) incur high costs. In contrast, recall measures the model’s ability to identify actual churn cases, vital when missing true churners is costlier. A trade-off often exists between these metrics; improving one can reduce the other. The decision largely depends on the comparative business impact of false positives versus false negatives. The F1 score, balancing precision and recall, becomes relevant when both error types have significant implications. Effectively, the choice of focusing on precision, recall, or both should align with the business’s strategic objectives and the financial ramifications of predictive inaccuracies. We will aim to improve the F1-score.
Improving the F1-score of the customer churn prediction model involves a variety of strategies. An F1-score of 0.64 indicates a moderate balance between precision and recall, but there’s room for improvement. Here are some approaches:
- Feature Engineering:
- Create New Features: Derive new features from existing data that might have predictive power.
- Feature Selection: Use techniques to select the most important features. Redundant or irrelevant features can decrease model performance.
- Feature Transformation: Apply transformations like log, square, or square root to certain features that might have a non-linear relationship with the target variable.
- Data Quality:
- Handle Missing Values: If any, choose appropriate strategies to handle missing data.
- Outlier Detection and Treatment: Outliers can skew the results. Detect and manage them appropriately.
- Resampling Techniques:
- If your data is imbalanced (a significant difference between the number of churn and non-churn instances), use resampling techniques like SMOTE (Synthetic Minority Over-sampling Technique) or random undersampling to balance the data.
- Model Tuning:
- Hyperparameter Tuning: Use techniques like grid search or random search to find the optimal hyperparameters for the logistic regression model.
- Regularization Techniques: Apply L1 or L2 regularization to prevent overfitting and improve model generalization.
- Different Algorithms:
- Try other classification algorithms like Random Forest, Gradient Boosting, Support Vector Machines, etc., and compare their performance.
- Ensemble Methods: Use ensemble techniques like bagging or boosting to improve prediction accuracy.
- Cross-Validation:
- Use cross-validation to assess the model’s performance. This will give you a better understanding of how your model performs on unseen data.
- Threshold Adjustment:
- Adjust the decision threshold. Logistic regression models output probabilities, and you can change the threshold for classifying a customer as churned or not to improve precision or recall, depending on which is more important to your business case.
- Domain Knowledge:
- Incorporate domain knowledge to better understand and model customer behavior. For example, understanding factors that typically lead to churn in your industry can help in feature engineering.
- Post-Model Analysis:
- Analyze the errors made by the model. Understanding the type of errors (false positives or false negatives) can provide insights into what aspect of the model needs improvement.
- Regular Feedback Loop:
- Continuously update the model with new data and feedback to adapt to changing patterns in customer behavior.
Remember, improving a model is an iterative process, and it often requires experimenting with multiple strategies to see which works best for your specific dataset and business context.
- A Comprehensive List of Agentic AI Resources - January 5, 2025
- Understanding FAR, FRR, and EER in Auth Systems - January 3, 2025
- Top 10 Gartner Technology Trends for 2025 - January 1, 2025
I found it very helpful. However the differences are not too understandable for me