Logistic Regression Concepts, Python Example

logistic regression model 3

In this blog post, we will discuss the logistic regression machine learning algorithm with a python example. Logistic regression is a type of regression algorithm that is used to predict the probability of occurrence of an event. It is often used in machine learning applications. In this tutorial, we will use python to implement logistic regression for binary classification problems. 

What is Logistic Regression?

Logistic regression is a machine learning algorithm used for classification problems. That is, it can be used to predict whether an instance belongs to one class or the other. For example, it could be used to predict whether a person is male or female, based on their height, weight, and other features. It is a supervised learning algorithm that can be used to predict the probability of occurrence of an event. Logistic regression model learns the relationship between the features and the classes. The logistic regression algorithm is used to map the input data to a probability, unlike linear regression which is used to map the input data to continuous output values. Logistic regression models are used to predict the probability of an event occurring, such as whether or not a customer will purchase a product.  The output of the logistic regression model is a probability value between 0 and 1. The output represents the probability that the class of the input data is 1. 

The input data is mapped to a probability using the sigmoid function. The sigmoid function, also called as logistic function, is a mathematical function that maps values (sum of weighted input) from -infinity to +infinity to values between 0 and 1. The sigmoid function that represents the hypothesis is defined as:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$    …(eq. 1)

The value of z in sigmoid function represents the weighted sum of input values and can be written as the following:

$$z = \theta^{T}x $$    …(eq. 2)

Where θ represents the parameters. 

The following plot is created when the sigmoid function, [latex] \sigma(z) [/latex] is plotted against the net input function output, z. Note that the value of sigmoid function ranges between 0 and 1.

Logistic regression - sigmoid function vs net input

In the above plot, the [latex] \sigma(z) [/latex] approaches 1 when z approaches infinity. Similarly, [latex] \sigma(z) [/latex] approaches 0 when z approaches negative of infinity. Thus, it can be concluded that the value of [latex] \sigma(z) [/latex] ranges from 0 to 1. At z = 0, [latex] \sigma(z) [/latex] takes the value of 0.5. 

The picture below represents different aspects of a logistic regression model:

logistic regression model 3

Based on the above picture, the following represents some of the key concepts related to logistic regression model:

  • A set of input features (xi) and related weights (wi) combines together and get added to the bias element (b). This is depicted as new input function in the above diagram. This is same as linear regression function. It is same as “z” shown in equation 1 of the above formula.
  • The net input is passed to the sigmoid function and the output of the sigmoid function ranges from 0 to 1
  • The output of [latex] \sigma(z) [/latex] = P(Y = 1| x; w, b). The output represents the probability that a particular data point or example would belong to class 1 given its features x with parameters as weights (w) and the bias (b). Taking the example of IRIS data set, if the goal is to predict whether a flower is IRIS-Versicolor and the value of [latex] \sigma(z) [/latex] = 0.75. This means that the probability that the data points belong to the flower Versicolor is 0.75. When keeping the threshold as 0.5, we can predict that the flower is Versicolor. The predicted probability can be converted to the binary outcome of class 1 or 0 (Versicolor or otherwise in this example) can be represented via the following threshold function.

    logistic regression threshold function
    The threshold function will become like the following as a function of z (summation of weights and features added with the bias). Refer the sigmoid plot above. For the value of z greater than or equal to 0, one can predict the outcome to be class 1.
    logistic regression threshold function 2

  • The output from sigmoid function is passed to a threshold function (as shown above) which then sends output as 1 or 0, 1 for the positive class. 
  • One of the key aspect of using logistic regression model for binary classification is deciding the decision boundary. In other words, deciding the threshold value that will be used to predict class label is called decision boundary. One can decide decision boundary other than 0.5 as well.

The output of the logistic regression model (sigmoid function output) is always between 0 and 1. If the output is close to 0, it means that the event is less likely to occur. If the output is close to 1, it means that the event is more likely to happen. For example, if the value of logistic regression model (represented using sigmoid function) is 0.8, it represents that the probability that the event will occur is 0.8 given a particular set of parameters learned using cost function optimization. Based on the threshold function, the class label can said to be 1. For any new value X, the output of the above function will be used for making the prediction. 

The parameters in logistic regression is learned using the maximum likelihood estimation. The cost function for logistic regression is defined as:

Cost function for logistic regression

In above cost function, h represents the output of sigmoid function shown earlier, y represents the class/label of the training data, x represents the training data. Note that for binary classification problems, the first term will be zero for class labeled as as 0 and the second term will be zero for class labeled as 1. The equation below represents this aspect:

The value of cost function - Logistic regression

When the loss function is plotted against hypothesis function (sigmoid), the following plot occurs for y = 0 and y = 1.

logistic loss vs sigmoid function output plot

In order to fit the parameters, the objective function J(θ) would need to be minimized. Gradient descent algorithm (stochastic gradient descent – SGD) can be used for optimizing the objective or cost function. This is how the equation looks like for updating the parameters when executing gradient descent algorithm. Ensuring that gradient descent is running correctly, the value of J(θ) is calculated for θ and checked that it is decreasing on every iteration.

gradient descent algorithm for logistic regression

Besides stochastic gradient descent algorithm, it is recommended to use advanced algorithms such as some of the following: Conjugate gradient, BFGS, L-BFGS etc. When using scikit-learn for training logistic regression models, these algorithms can be used by mentioning solver parameter such as newton-cg, lbfgs, liblinear, saga, sag, etc. 

Logistic regression is similar to linear regression, but the dependent variable in logistic regression is always categorical, while the dependent variable in linear regression is always continuous.

Logistic Regression Applications / Examples

The following represents few examples of problems that can be solved using binary classification model trained using logistic regression algorithm:

  • Spam email classification: In the context of spam email classification, logistic regression can be used to determine whether an email is spam or not. The algorithm does this by training on a dataset of labeled emails (i.e., emails that have been manually classified as spam or not spam). Once the algorithm has been trained, it can then be applied to new emails to make predictions about their labels. When training the logistic regression model for spam email classification because a variety of features of an email (such as the presence of certain keywords) is considered in order to make its predictions. 
  • Fraudulent online transactions: In the context of online fraud detection, logistic regression can be used to classify transactions as either fraudulent or not fraudulent. This is done by inputting a variety of factors into the regression model, such as the amount of money involved in the transaction, the method of payment, the shipping address, etc. By analyzing this data, the model can generate a prediction about whether the transaction is likely to be fraudulent. This information can then be used by businesses to help prevent losses due to fraud. 
  • Loan default: Logistic regression models can be used to predict whether a loan will default. It does this by assigning a probability to each loan, based on factors such as the borrower’s credit score, employment history, loan amount, etc. The Logistic regression algorithm then uses these probabilities to classify loans into two groups: those that are likely to default, and those that are not. Logistic regression models are powerful in the sense that they can help lenders to identify loans that are more likely to default. This information can be used to make better decisions about which loans to approve and how to price them. Logistic regression can also be used to develop strategies for managing loan portfolios. For example, the model can be used to identify which loans are more likely to become delinquent and how to allocate resources to minimize losses. 
  • Malignant tumor: In the context of tumors, logistic regression can be used to classify tumors as benign or malignant. This is a important task, as it can help doctors to make treatment decisions. There are a number of factors that can be used to classify tumors, including size, shape, and texture.

Loading SkLearn Modules / Classes for Logistic Regression Model

The Python code used in this blog represents fitting a machine learning model using Logistic Regression. First and foremost, we will load the appropriate packages, sklearn modules and classes.

# Importing basic packages
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing Sklearn module and classes
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn import datasets
from sklearn.model_selection import train_test_split

Data Loading – IRIS dataset

The IRIS data set is used for training the logistic regression model. The Iris data set is a classification dataset that contains three classes of 50 instances each, where each class refers to a type of iris plant. The three classes in the Iris dataset are:

  • Setosa
  • Versicolor
  • Virginica 

As a next step, we will load the dataset and do the data preparation. The scikit-learn library will be used to load the Iris dataset.

iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
Y = iris.target

Create Training / Test Data

Next step is to create a train and test split. Note the stratification parameter. This is used to ensure that class distribution in training / test split remains consistent / balanced.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)

Perform Feature Scaling

Next step is to perform feature scaling in order to make sure features are in fixed range irrespective of their values / units etc.

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Train a Logistic Regression Model

Next step is to train a logistic regression model. The following needs to be noted while using LogisticRegression algorithm sklearn.linear_model implementation:

  • Usage of C parameters. Smaller values of C specify stronger regularization.
  • The multi_class parameter is assigned to ‘ovr‘. It represents one-vs-rest algorithm to be used. Other option is multinomial.
  • The solver parameter is assigned to ‘lbfsg‘. Other solvers which can be used are newton-cg, sag, saga, lib linear
# Create an instance of LogisticRegression classifier
lr = LogisticRegression(C=100.0, random_state=1, solver='lbfgs', multi_class='ovr')

# Fit the model
#
lr.fit(X_train_std, Y_train)
Fig 1. Logistic Regression Model being Fit

Measure Model Performance

Next step is to measure the model performance of the model trained using LogisticRegression as shown above.

# Create the predictions
#
Y_predict = lr.predict(X_test_std)

# Use metrics.accuracy_score to measure the score
print("LogisticRegression Accuracy %.3f" %metrics.accuracy_score(Y_test, Y_predict))

Logistic regression is a type of binary classification machine learning algorithm used to predict the probability of something happening, in our case whether or not an event will occur. The logistic regression algorithm can be implemented using python and there are many libraries that make it very easy to do so. In this blog post, we demonstrated how to use python’s scikit-learn library to implement logistic regression for solving various real-world problems.

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Data Science, Machine Learning, Python. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *