In this post, you will learn the concepts related to cross-entropy loss function along with Python and which machine learning algorithms use cross entropy loss function as an optimization function. Cross entropy loss is used as a loss function for models which predict the probability value as output (probability distribution as output). Logistic regression is one such algorithm whose output is probability distribution.

In this post, the following topics are covered:

• What’s cross entropy loss?
• Cross entropy loss explained with Python examples

## What’s Cross Entropy Loss?

Cross entropy loss function is an optimization function which is used for training machine learning classification models which classifies the data by predicting the probability (value between 0 and 1) of whether the data belong to one class or another class. In case, the predicted probability of class is way different than the actual class label (0 or 1), the value of cross-entropy loss is high. In case, the predicted probability of the class is near to the class label (0 or 1), the cross-entropy loss will be less. Cross-entropy loss is commonly used as the loss function for the models which has softmax output. Recall that softmax function is generalization of logistic regression to multiple dimensions and is used in multinomial logistic regression.

In particular, cross entropy loss or log loss function is used as a cost function for logistic regression models or models with softmax output (multinomial logistic regression or neural network) in order to estimate the parameters of the logistic regression model. Here is how the function looks like:

The above cost function can be derived from the original likelihood function which is aimed to be maximized when training a logistic regression model. Here is how the likelihood function looks like:

In order to maximize the above likelihood function, the approach of taking log of likelihood function (as shown above) and maximizing the function is adopted for mathematical ease. Thus, Cross entropy loss is also termed as log loss. It makes it easy to maximize the log likelihood function due to the fact that it reduces the potential for numerical underflow and also it makes it easy to take derivative of resultant summation function after taking log. Here is how the log of above likelihood function looks like.

In order to apply gradient descent to above log likelihood function, negative of the log likelihood function as shown in fig 3 is taken. Thus, for y = 0 and y = 1, the cost function becomes same as the one given in fig 1.

Cross-entropy loss function or log-loss function as shown in fig 1 when plotted against the hypothesis outcome / probability value would look like the following:

Let’s understand the log loss function in light of above diagram:

• For actual label value as 1 (red line), if the hypothesis value is 1, the loss or cost function output will be near to zero. However, when the hypothesis value is zero, cost will be very high (near to infinite).
• For actual label value as 0 (green line), if the hypothesis value is 1, the loss or cost function output will be near to infinite. However, when the hypothesis value is zero, cost will be very less (near to zero).

Based on above, the gradient descent algorithm can be applied to learn the parameters of the logistic regression models or models using softmax function as activation function such as neural network.

## Cross-entropy Loss Explained with Python Example

In this section, you will learn about cross-entropy loss using Python code example. This is the function we will need to represent in form of Python function.

As per above function, we need to have two functions, one as cost function (cross entropy function) representing equation in Fig 5 and other is hypothesis function which outputs the probability. In this section, the hypothesis function is chosen as sigmoid function. Here is the Python code for these two functions. Pay attention to sigmoid function (hypothesis) and cross entropy loss function (cross_entropy_loss)

import numpy as np
import matplotlib.pyplot as plt

'''
Hypothesis Function - Sigmoid function
'''
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))

'''
yHat represents the predicted value / probability value calculated as output of hypothesis / sigmoid function

y represents the actual label
'''
def cross_entropy_loss(yHat, y):
if y == 1:
return -np.log(yHat)
else:
return -np.log(1 - yHat)


Once we have these two functions, lets go and create sample value of Z (weighted sum as in logistic regression) and create the cross entropy loss function plot showing plots for cost function output vs hypothesis function output (probability value).

#
# Calculate sample values for Z
#
z = np.arange(-10, 10, 0.1)
#
# Calculate the hypothesis value / probability value
#
h_z = sigmoid(z)
#
# Value of cost function when y = 1
# -log(h(x))
#
cost_1 = cross_entropy_loss(h_z, 1)
#
# Value of cost function when y = 0
# -log(1 - h(x))
#
cost_0 = cross_entropy_loss(h_z, 0)
#
# Plot the cross entropy loss
#
fig, ax = plt.subplots(figsize=(8,6))
plt.plot(h_z, cost_1, label='J(w) if y=1')
plt.plot(h_z, cost_0, label='J(w) if y=0')
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
plt.show()


Here is how the cross entropy loss / log loss plot would look like:

Note some of the following in the above:

• For y = 1, if predicted probability is near 1, loss function out, J(W), is close to 0 otherwise it is close to infinity.
• For y = 0, if predicted probability is near 0, loss function out, J(W), is close to 0 otherwise it is close to infinity.

## Conclusions

Here is the summary of what you learned in relation to cross entropy loss function:

• Cross entropy loss function is used as an optimization function to estimate parameters for logistic regression models or models which has softmax output.
• Cross entropy loss function is also termed as log loss function when considering logistic regression. This is because the negative of log likelihood function is minimized.
• Cross entropy loss is high when the predicted probability is way different than the actual class label (0 or 1).
• Cross entropy loss is loss when the predicted probability is closer or nearer to the actual class label (0 or 1).
• Gradient descent algorithm can be used with cross entropy loss function to estimate the model parameters.