In this post, you will learn the concepts related to the cross-entropy loss function along with Python code examples and which machine learning algorithms use the cross-entropy loss function as an objective function for training the models. Cross-entropy loss is used as a loss function for models which predict the probability value as output (probability distribution as output). Logistic regression is one such algorithm whose output is a probability distribution. You may want to check out the details on how cross-entropy loss is related to information theory and entropy concepts – Information theory & machine learning: Concepts
What’s Cross-Entropy Loss?
Cross-entropy loss, also known as negative log likelihood loss, is a commonly used loss function in machine learning for classification problems. The function measures the difference between the predicted probability distribution and the true distribution of the target variables. It is commonly used in supervised learning problems with multiple classes, such as in a neural network with softmax activation. The cross-entropy loss is used as the optimization objective during training to adjust the model’s parameters. The cross-entropy loss function is an optimization function that is used for training classification models which classify the data by predicting the probability (value between 0 and 1) of whether the data belong to one class or another. In case, the predicted probability of class is way different than the actual class label (0 or 1), the value of cross-entropy loss is high. In case, the predicted probability of the class is near to the class label (0 or 1), the cross-entropy loss will be less. Cross-entropy loss is commonly used as the loss function for the models which has softmax output. Recall that the softmax function is a generalization of logistic regression to multiple dimensions and is used in multinomial logistic regression. Read greater details in one of my related posts – Softmax regression explained with Python example.
Cross-entropy loss is commonly used in machine learning algorithms such as:
- Neural networks, specifically in the output layer to calculate the difference between the predicted probability and the true label during training.
- Logistic Regression and Softmax Regression
- Multinomial Logistic Regression and Maximum Entropy Classifier
Cross-entropy loss or log loss function is used as a cost function for logistic regression models or models with softmax output (multinomial logistic regression or neural network) in order to estimate the parameters. Here is what the function looks like:
The above cost function can be derived from the original likelihood function which is aimed to be maximized when training a logistic regression model. Here is what the likelihood function looks like:
In order to maximize the above likelihood function, the approach of taking the negative log of the likelihood function (as shown above) and minimizing the function is adopted for mathematical ease. Thus, the cross-entropy loss is also termed log loss. It makes it easy to minimize the negative log-likelihood function due to the fact that it makes it easy to take the derivative of the resultant summation function after taking the log. Here is what the log of the above likelihood function looks like.
In order to apply gradient descent to the above log-likelihood function, the negative of the log-likelihood function as shown in fig 3 is taken. Thus, for y = 0 and y = 1, the cost function becomes the same as the one given in fig 1.
Cross-entropy loss function or log-loss function as shown in fig 1 when plotted against the hypothesis outcome/probability value would look like the following:
Let’s understand the log loss function in light of the above diagram:
- For the actual label value as 1 (red line), if the hypothesis value is 1, the loss or cost function output will be near zero. However, when the hypothesis value is zero, the cost will be very high (near to infinite).
- For the actual label value as 0 (green line), if the hypothesis value is 1, the loss or cost function output will be near infinite. However, when the hypothesis value is zero, the cost will be very less (near zero).
Based on the above, the gradient descent algorithm can be applied to learn the parameters of the logistic regression models or models using the softmax function as an activation function such as a neural network.
Cross-entropy Loss Explained with Python Example
In this section, you will learn about cross-entropy loss using Python code examples. This is the function we will need to represent in form of a Python function.
As per the above function, we need to have two functions, one as a cost function (cross-entropy function) representing the equation in Fig 5, and the other is a hypothesis function that outputs the probability. In this section, the hypothesis function is chosen as the sigmoid function. Here is the Python code for these two functions. Pay attention to sigmoid function (hypothesis) and cross-entropy loss function (cross_entropy_loss)
import numpy as np import matplotlib.pyplot as plt ''' Hypothesis Function - Sigmoid function ''' def sigmoid(z): return 1.0 / (1.0 + np.exp(-z)) ''' yHat represents the predicted value / probability value calculated as output of hypothesis / sigmoid function y represents the actual label ''' def cross_entropy_loss(yHat, y): if y == 1: return -np.log(yHat) else: return -np.log(1 - yHat)
Once we have these two functions, let’s go and create a sample value of Z (weighted sum as in logistic regression) and create the cross-entropy loss function plot showing plots for cost function output vs hypothesis function output (probability value).
# # Calculate sample values for Z # z = np.arange(-10, 10, 0.1) # # Calculate the hypothesis value / probability value # h_z = sigmoid(z) # # Value of cost function when y = 1 # -log(h(x)) # cost_1 = cross_entropy_loss(h_z, 1) # # Value of cost function when y = 0 # -log(1 - h(x)) # cost_0 = cross_entropy_loss(h_z, 0) # # Plot the cross entropy loss # fig, ax = plt.subplots(figsize=(8,6)) plt.plot(h_z, cost_1, label='J(w) if y=1') plt.plot(h_z, cost_0, label='J(w) if y=0') plt.xlabel('$\phi$(z)') plt.ylabel('J(w)') plt.legend(loc='best') plt.tight_layout() plt.show()
Here is what the cross-entropy loss / log loss plot would look like:
Note some of the following in the above:
- For y = 1, if the predicted probability is near 1, the loss function out, J(W), is close to 0 otherwise it is close to infinity.
- For y = 0, if the predicted probability is near 0, the loss function out, J(W), is close to 0 otherwise it is close to infinity.
Here is the summary of what you learned in relation to the cross-entropy loss function:
- The cross-entropy loss function is used as an optimization function to estimate parameters for logistic regression models or models which has softmax output.
- The cross-entropy loss function is also termed a log loss function when considering logistic regression. This is because the negative of the log-likelihood function is minimized.
- The cross-entropy loss is high when the predicted probability is way different than the actual class label (0 or 1).
- The cross-entropy loss is less when the predicted probability is closer or nearer to the actual class label (0 or 1).
- A gradient descent algorithm can be used with a cross-entropy loss function to estimate the model parameters.