Cross Entropy Loss Explained with Python Examples

In this post, you will learn the concepts related to the cross-entropy loss function along with Python code examples and which machine learning algorithms use the cross-entropy loss function as an objective function for training the models. Cross-entropy loss is used as a loss function for models which predict the probability value as output (probability distribution as output). Logistic regression is one such algorithm whose output is a probability distribution. You may want to check out the details on how cross-entropy loss is related to information theory and entropy concepts – Information theory & machine learning: Concepts

What’s Cross-Entropy Loss?

The cross-entropy loss function is an optimization function that is used for training classification models which classify the data by predicting the probability (value between 0 and 1) of whether the data belong to one class or another. In case, the predicted probability of class is way different than the actual class label (0 or 1), the value of cross-entropy loss is high. In case, the predicted probability of the class is near to the class label (0 or 1), the cross-entropy loss will be less. Cross-entropy loss is commonly used as the loss function for the models which has softmax output. Recall that the softmax function is a generalization of logistic regression to multiple dimensions and is used in multinomial logistic regression. Read greater details in one of my related posts – Softmax regression explained with Python example.

In particular, cross-entropy loss or log loss function is used as a cost function for logistic regression models or models with softmax output (multinomial logistic regression or neural network) in order to estimate the parameters. Here is what the function looks like:

Fig 1. Logistic Regression Cost Function
Fig 1. Logistic Regression Cost Function

The above cost function can be derived from the original likelihood function which is aimed to be maximized when training a logistic regression model. Here is what the likelihood function looks like:

Fig 2. Likelihood function to be maximized for Logistic Regression

In order to maximize the above likelihood function, the approach of taking the negative log of the likelihood function (as shown above) and minimizing the function is adopted for mathematical ease. Thus, the cross-entropy loss is also termed log loss. It makes it easy to minimize the negative log-likelihood function due to the fact that it makes it easy to take the derivative of the resultant summation function after taking the log. Here is what the log of the above likelihood function looks like.

Fig 3. Log-likelihood function for Logistic Regression

In order to apply gradient descent to the above log-likelihood function, the negative of the log-likelihood function as shown in fig 3 is taken. Thus, for y = 0 and y = 1, the cost function becomes the same as the one given in fig 1.

Cross-entropy loss function or log-loss function as shown in fig 1 when plotted against the hypothesis outcome/probability value would look like the following:

Understanding cross-entropy or log loss function for Logistic Regression
Fig 4. Understanding cross-entropy or log loss function for Logistic Regression

Let’s understand the log loss function in light of the above diagram:

  • For the actual label value as 1 (red line), if the hypothesis value is 1, the loss or cost function output will be near zero. However, when the hypothesis value is zero, the cost will be very high (near to infinite).
  • For the actual label value as 0 (green line), if the hypothesis value is 1, the loss or cost function output will be near infinite. However, when the hypothesis value is zero, the cost will be very less (near zero).

Based on the above, the gradient descent algorithm can be applied to learn the parameters of the logistic regression models or models using the softmax function as an activation function such as a neural network.

Cross-entropy Loss Explained with Python Example

In this section, you will learn about cross-entropy loss using Python code examples. This is the function we will need to represent in form of a Python function.

Cross Entropy Loss Function
Fig 5. Cross-Entropy Loss Function

As per the above function, we need to have two functions, one as a cost function (cross-entropy function) representing the equation in Fig 5, and the other is a hypothesis function that outputs the probability. In this section, the hypothesis function is chosen as the sigmoid function. Here is the Python code for these two functions. Pay attention to sigmoid function (hypothesis) and cross-entropy loss function (cross_entropy_loss)

import numpy as np
import matplotlib.pyplot as plt

'''
Hypothesis Function - Sigmoid function
'''
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

'''
yHat represents the predicted value / probability value calculated as output of hypothesis / sigmoid function

y represents the actual label
'''
def cross_entropy_loss(yHat, y):
    if y == 1:
      return -np.log(yHat)
    else:
      return -np.log(1 - yHat)

Once we have these two functions, let’s go and create a sample value of Z (weighted sum as in logistic regression) and create the cross-entropy loss function plot showing plots for cost function output vs hypothesis function output (probability value).

#
# Calculate sample values for Z
#
z = np.arange(-10, 10, 0.1)
#
# Calculate the hypothesis value / probability value
#
h_z = sigmoid(z)
#
# Value of cost function when y = 1
# -log(h(x))
#
cost_1 = cross_entropy_loss(h_z, 1)
#
# Value of cost function when y = 0
# -log(1 - h(x))
#
cost_0 = cross_entropy_loss(h_z, 0)
#
# Plot the cross entropy loss 
#
fig, ax = plt.subplots(figsize=(8,6))
plt.plot(h_z, cost_1, label='J(w) if y=1')
plt.plot(h_z, cost_0, label='J(w) if y=0')
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Here is what the cross-entropy loss / log loss plot would look like:

Cross Entropy Loss Function Plot
Fig 6. Cross-Entropy Loss Function Plot

Note some of the following in the above:

  • For y = 1, if the predicted probability is near 1, the loss function out, J(W), is close to 0 otherwise it is close to infinity.
  • For y = 0, if the predicted probability is near 0, the loss function out, J(W), is close to 0 otherwise it is close to infinity.

Conclusions

Here is the summary of what you learned in relation to the cross-entropy loss function:

  • The cross-entropy loss function is used as an optimization function to estimate parameters for logistic regression models or models which has softmax output.
  • The cross-entropy loss function is also termed a log loss function when considering logistic regression. This is because the negative of the log-likelihood function is minimized.
  • The cross-entropy loss is high when the predicted probability is way different than the actual class label (0 or 1).
  • The cross-entropy loss is less when the predicted probability is closer or nearer to the actual class label (0 or 1).
  • A gradient descent algorithm can be used with a cross-entropy loss function to estimate the model parameters.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in Data Science, Machine Learning. Tagged with , .

One Response

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.