In this post, you will learn about the concepts of neural network back propagation algorithm along with Python examples. As a data scientist, it is very important to learn the concepts of back propagation algorithm if you want to get good at deep learning models. This is because back propagation algorithm is key to learning weights at different layers in the deep neural network.

## What’s Back Propagation Algorithm?

Back propagation algorithm represents the propagation of the gradients of outputs from each node (in each layer) on final output, in the backward direction right upto the input layer nodes. All that is achieved using back propagation algorithm is compute the gradients of weights and biases. Remember that the training of neural network is done using optimizers and NOT back propagation. The primary goal of learning in neural network is to determine how would the weights and biases in every layer would change to minimize the objective or cost function for each record in the training data set. Instead of determining the final output as a function of weights and biases of every layer and take the partial derivatives with respect to each weights ad biases to determine the gradients, back propagation makes it simpler to propagate the gradients in backward direction and help determine the gradients of weights and biases in every layer using chain rule.

The main idea behind calculating gradients in case of neural network with respect to cost function C is the following:

How do we change weights and biases in every layer (increases or decreases) such that neural network provides output that minimises the cost function?

This is where back propagation algorithm helps in determining direction in which each of the weights and biases need to change to minimise the cost function.

Let’s understand the back propagation algorithm using the following simplistic neural network with one input layer, one hidden layer and one output layer. Let’s take activation function as an identity function for the sake of understanding. In real world problems, the activation functions most commonly used are sigmoid function, ReLU or variants of ReLU functions and tanh function.

Lets understand the above neural network.

• There are three layers in the network – input, hidden and output layer
• There are two input variables (features) in the input layer, three nodes in the hidden layer and one node in the output layer
• The activation function of the network is applied on the weighted sum of inputs at each node to calculate activation value.
• The output from nodes in the hidden and output layer is derived from applying activation function on the weighted sum of inputs to each of the nodes in these layers.

Mathematically, above neural network can be represented as following:

For training the neural network using the dataset, the ask is to determine the optimal value of all the weights and biases denoted by w and b. And, the manner in which the optimal values are found is to optimize / minimize a loss function using the most optimal values of weights and biases. For regression problems, the most common loss function used is ordinary least square function (squared difference between observed value and network output value). For classification problems, the most common loss function used is cross-entropy loss function. For optimizing / minimizing the loss function, the gradient descent algorithm is applied on the loss function with respect to every weights and biases based on back propagation algorithm. The idea is to change or update the weights and biases for every layer in the manner that the loss function reduces after every iteration.

Back propagation algorithm helps in determining gradients of weights and biases with respect to final output value of the network. Once gradients are found, the weights and biases are updated based on different gradient techniques such as stochastic gradient descent. In stochastic gradient descent technique, weights are biases are updated after processing small batches of training data. It is also called as mini-batch gradient descent technique.

The training of neural network shown in the above diagram would mean learning the most optimal value of the following weights and biases in two different layers:

• $$\Large w^1_{11}, w^1_{12}, w^1_{21}, w^1_{22}, w^1_{31}, w^1_{32}, b_1$$ for the first layer
• $$\Large w^2_{11}, w^2_{12}, w^2_{13}, b_2$$ for the second layer.

The optimal values for the above mentioned weights and biases in different layers are learned based on their gradients (partial derivatives) and optimization technique such as stochastic gradient descent. The gradients of all the weights and biases with respect to final output is found based on the back propagation algorithm. Here is the list of gradients which is required to be determined with respect to the final output value for learning purpose. If the final output is C (representing cost function), then the gradients can be determined as the following:

$$\Large \frac{\partial C}{\partial w^1_{11}}, \frac{\partial C}{\partial w^1_{12}}, \frac{\partial C}{\partial w^1_{21}}, \frac{\partial C}{\partial w^1_{22}}, \frac{\partial C}{\partial w^1_{31}}, \frac{\partial C}{\partial w^1_{32}}, \frac{\partial C}{\partial b_{1}}$$

.

$$\Large \frac{\partial C}{\partial w^2_{11}}, \frac{\partial C}{\partial w^2_{12}}, \frac{\partial C}{\partial w^2_{13}}, \frac{\partial C}{\partial b_{2}}$$

.

Let’s see how back propagation algorithm can be used to determine all of the gradients.

$$\Large \frac{\partial C}{\partial w^2_{11}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial w^2_{11}}$$

.

$$\Large \frac{\partial C}{\partial w^2_{12}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial w^2_{12}}$$

.

$$\Large \frac{\partial C}{\partial w^2_{13}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial w^2_{13}}$$

.

$$\Large \frac{\partial C}{\partial w^1_{11}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial a^2_1}\frac{\partial a^2_1}{\partial Z^2_1}\frac{\partial Z^2_1}{\partial w^1_{11}}$$

.

$$\Large \frac{\partial C}{\partial w^1_{12}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial a^2_1}\frac{\partial a^2_1}{\partial Z^2_1}\frac{\partial Z^2_1}{\partial w^1_{12}}$$

.

$$\Large \frac{\partial C}{\partial w^1_{21}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_2}\frac{\partial Z^3_2}{\partial a^2_2}\frac{\partial a^2_2}{\partial Z^2_2}\frac{\partial Z^2_2}{\partial w^1_{21}}$$

.

$$\Large \frac{\partial C}{\partial w^1_{22}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_2}\frac{\partial Z^3_2}{\partial a^2_2}\frac{\partial a^2_2}{\partial Z^2_2}\frac{\partial Z^2_2}{\partial w^1_{22}}$$

.

$$\Large \frac{\partial C}{\partial w^1_{31}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial a^2_3}\frac{\partial a^2_3}{\partial Z^2_3}\frac{\partial Z^2_3}{\partial w^1_{31}}$$

.

$$\Large \frac{\partial C}{\partial w^1_{32}} = \frac{\partial C}{\partial a^3_1}\frac{\partial a^3_1}{\partial Z^3_1}\frac{\partial Z^3_1}{\partial a^2_3}\frac{\partial a^2_3}{\partial Z^2_3}\frac{\partial Z^2_3}{\partial w^1_{32}}$$

.

The above equations represents the aspect of how cost function C value will change by changing the respective weights in different layers. In other words, the above equations calculates gradients of weights and biases with respect to cost function value, C. Note how chain rule is applied while calculating gradients using back propagation algorithm.

You may want to check this post to get an access to some real good articles and videos on back propagation algorithmTop Tutorials – Neural Network Back Propagation Algorithm.

### Learning Weights & Biases using Back Propagation Algorithm

The equation below represents how weights & biases in specific layers are updated after the gradients are determined. Letter l is used to represent the weights of different layers

$$\large w^l = w^l – learningRate * \frac{\partial C}{\partial w^l}$$

.

$$\large b^l = b^l – learningRate * \frac{\partial C}{\partial b^l}$$

.

## Conclusions

Here is the summary of what you learned about neural network back propagation algorithm in this post.

• Back propagation algorithm is ONLY about calculating gradients of weights and biases with respect to final output value
• Back propagation algorithm is NOT about learning new weights. It is optimization function such as gradient descent techniques which are applied for learning optimal weights.
• Back propagation algorithm makes use of Chain Rule for calculating gradients.
• Back propagation algorithm is about taking partial derivatives of cost function with respect to each of the weights and biases in order to determine the gradients.