Data scientists know that activation functions are critical to understanding neural networks. It is important to use activation function in order to train the neural network. There are many activation functions available for data scientists to choose from, so it can be difficult choosing which activation function will work best for their needs. In this blog post, we look at different activation functions and provide examples of when they should be used in different types of neural networks. If you are starting on deep learning and wanted to know about different types of activation functions, you may want to bookmark this page for quicker access in future.
Without further ado, let’s take a look at the animation which represents different types of activation functions:
Here is the list of different types of activation functions shown in the above animation:
- Identity function (Used in Adaline – Adaptive Linear Neuron): The activation function of the identity is represented by “I”. It’s simply a linear transformation and there is no activation performed.
- Sigmoid function: Sigmoid function squashes the activation value into a range between 0 and one, it is thus called the squashing function. It is one of the most common activation functions used for neural networks. It is a function which takes the number between 0 and (positive) infinity as input, and outputs values from 0 to less than or equal to one – depending on the value of its argument. It is usually employed in cases where an output with two possible states are required (e.g. binary classification), and when the value is near zero or one, it behaves as a linear activation function. Sigmoid activation function results in smooth and monotonic activation curves. It is used to model activation of logistic unit in the case of a neural network that has only one layer with sigmoid activation function (a single-layer perception curve). The sigmoid function is an activation function that is used in a logistic regression model.
- Tanh functon: Tanh or hyperbolic tangent functions are similar to sigmoid functions but their ranges have been shifted so that they now go from -infinty to infinity. It gives large values when x is close to +- ½ and activation function give minima at -½. Tanh has been used in logistic regression, but it’s not preferred these days because of its slow learning rates as compared with the other activation functions such as ReLU or sigmoid.
- Softmax activation function: Softmax activation function is very similar to sigmoid, but instead of squashing output values in the range [0, 255], it squeezes out probabilities between 0 and n where ‘n’ stands for number of classes.
- ArcTan function (inverse tangent function)
- ReLU (Rectified Linear Unit): Rectifier linear unit is one of the most widely used activation functions in deep learning models. It is used in feed-forward neural networks to produce smooth nonlinear activation. It is also used in convolutional neural networks that have linear receptive fields and a large output layer with several neurons. It is different from the sigmoid activation function in the sense that it is easier to train and results in faster convergence.
- Leaky ReLU (Improved version of ReLU): Leaky ReLU function is an activation function which is mostly used in the case of activation function for neural networks. It was introduced by Cagnin et al. (2012). Leaky ReLU has similar properties as activation functions which are usually applied to feed-forward neural network like sigmoid or tanh, but it allows a fraction of negative values that can increase computational efficiency and allow gradient-based training. The main advantage of LeakyReLU over other activation functions is that it allows a small fraction of negative values which can make learning more efficient and prevent zero padding artifacts.
- Randomized ReLU: Randomized ReLU function is an activation function that calculates activation as a linear combination of input and random number. This is done to encourage non-linear activation in neural networks, which reduces overfitting.
- Parametric ReLU: Parametric Rectified Linear Unit activation functions are faster than other activation functions like sigmoid or tanh. This activation function is used in many modern deep learning architectures such as ResNet, DenseNet and Alexnet which have enabled us to work on large-scale datasets. It improves the performance of neural networks and increases their accuracy, however it introduces a bias towards zero in the output due to which some gradients are not updated during training process. To prevent vanishing gradient problem you can use batch normalization layers with this activation functions as well.
- Binary (Perceptron)
- Exponential ReLU: Exponential ReLU functions are used in cases when you would want to speed up training and avoid some numerical instability problems that can arise during backpropagation.
- Soft Sign: Softsign activation function is the activation function that rescales activation values into a range between -1 and 1, thus the activation function is smooth and symmetric about zero. This activation function is very similar to the logistic sigmoid activation, but it does not have the same scale on its input or output. The activation function is used for regression problems. It can map an input x to a continuous value between -infinity and + infinity. A common use of the softsign activation function is when we learn using maximum likelihood estimation (MLE). In MLE, we try to find the activation function that best fits our training data. There are many different activation functions, but most of them result in a similar model when used with MLE. The softsign activation function is also commonly used for classification problems where we want to learn probability estimates (Pr(y = +)) or Pr(y = -)). You can use softsign activation functions in neural networks with binary output units where we want to learn the probabilities for each possible outcome.
- Inverse Square Root Unit (ISRU)
- Inverse Square Root Linear
- Square Non-linearity
- Bipolar ReLU: BiPolar ReLU activation function is a rectified activation function which prevents the vanishing gradient problem and is mostly used in research. The Bipolar ReLU activation function returns zero for negative values and keeps on increasing from 0 to its maximum value when fed with positive numbers. BiPolar Relu function is also known squash activation function. In the activation function for neural network, Bipolar Relu activation has a better performance than other activation functions like sigmoid and tanh as it prevents the vanishing gradient problem. However, Bipolar ReLU should be used with end-to-end learning algorithm to avoid overfitting. In some cases when data is very noisy, Bipolar activation function can help to train neural network with less loss or none.
- Soft Plus
The following represents different variants of ReLU:
- Leaky ReLU
- Randomized ReLU
- Parametric ReLU
- Exponential linear unit
- Bipolar ReLU
Out of the above activation functions, the most commonly / popularly used are the following:
- ReLU and its different variants