The activation functions are critical to understanding neural networks. It is important to use the activation function in order to train the neural network. There are many activation functions available for data scientists to choose from, so it can be difficult to choose which activation function will work best for their needs. In this blog post, we look at different activation functions and provide examples of when they should be used in different types of neural networks. If you are starting on deep learning and wanted to know about different types of activation functions, you may want to bookmark this page for quicker access in the future.
What are activation functions in neural networks?
In a neural network, an activation function is a mathematical function that determines whether a particular input should be activated or not. In other words, it decides whether a neuron should fire or not, in other words, whether it will send a signal to the next layer in the network. The activation function is one of the key components of a neural network. The picture below represents an activation function. Note how the summation of inputs and weights combined with the bias element is fed into the activation function. Based on the activation function, the output gets calculated.
There are many different activation functions that can be used, and the choice of activation function can have a significant impact on the performance of the neural network. The most common activation functions used in a neural network are the sigmoid function, the Tanh function, and the ReLU function. Each activation function has its own advantages and disadvantages which will be explained later in this post. In general, activation functions are chosen based on the specific problem that needs to be solved.
Different types of activation functions in neural networks
Without further ado, let’s take a look at the animation which represents different types of activation functions:
Here is the list of different types of activation functions shown in the above animation:
- Identity function (Used in Adaline – Adaptive Linear Neuron): The identity function is a special case of an activation function where the output signal is equal to the input signal. In other words, the identity function simply passes the input signal through unchanged. While this might not seem very exciting, it turns out that the identity function can be very useful in certain situations. For example, if you want your neural network to output a continuous signal instead of a discrete one, then using the identity function as your activation function can help to achieve this. The identity function on a given set is often denoted by the identity matrix. If f is an identity function on a set X, then we usually write f(x) = x for all x in X. In summary, the identity function is a special case of an activation function that simply passes the input signal through unchanged.
- Sigmoid function: The Sigmoid function takes an input and transforms it into an output between 0 and 1. In other words, the Sigmoid function squashes the activation value into a range between 0 and 1, it is thus called the squashing function. This output can be interpreted as a probability, which makes the Sigmoid function useful for classification tasks. The Sigmoid function is also differentiable, which means that it can be used to train a neural network. It is one of the most common activation functions used for neural networks. It is a function that takes the number between 0 and (positive) infinity as input and outputs values from 0 to less than or equal to 1 – depending on the value of its argument. It is usually employed in cases where an output with two possible states are required (e.g. binary classification), and when the value is near zero or one, it behaves as a linear activation function. Sigmoid activation function results in smooth and monotonic activation curves. It is used to model the activation of the logistic unit in the case of a neural network that has only one layer with a sigmoid activation function (a single-layer perception curve). The sigmoid function is an activation function that is used in a logistic regression model. The Sigmoid function is used in many types of neural networks, including feedforward neural networks. The following plot represents the output of the sigmoid function vs input:
- Tanh function: The Tanh function is often used as an activation function in neural networks. Tanh is a nonlinear function that squashes a real-valued number to the range [-1, 1]. Tanh is continuous, smooth, and differentiable. It has an output range that is symmetric about 0, which helps preserve zero Equivariance during training. The function outputs values close to -1 or 1 when the input is large in magnitude (positive or negative). This means that the gradient at these output values remains close to 1, which aids in training deep neural networks (where gradients often get smaller as backpropagation progresses). Tanh can also be thought of as a rescaled version of the sigmoid function. The Tanh function is sometimes also referred to as the hyperbolic tangent function. The following plot represents the output of the Tanh function vs input:
- Softmax activation function: The Softmax function is a type of activation function that is often used in neural networks. The Softmax function squashes the output of each unit in the network so that it is between 0 and 1, and so that the sum of all the outputs is 1. This makes it ideal for use in classification tasks, where each unit corresponds to a class and we want to be able to say with certainty which class the input belongs to. Softmax functions are often used in the output layer of neural networks, where they compute the probabilities for each class. The class with the highest probability is then outputted as the prediction. Softmax functions are also used in other parts of neural networks, such as in hidden layers. When used in hidden layers, Softmax functions can help to improve the convergence of training algorithms.
- ArcTan function (inverse tangent function): ArcTan is a trigonometric function that is commonly used as an activation function in neural networks. ArcTan takes a real number as an input and returns a real number between -π/2 and π/2. ArcTan is continuous and differentiable, which makes it well-suited for use in neural networks. The ArcTan function has a range of benefits, including its ability to prevent neurons from saturating, its computational efficiency, and its robustness against noise. Overall, the ArcTan function is a powerful activation function that can be used to improve the performance of neural networks.
- ReLU (Rectified Linear Unit): ReLU is a type of activation function that is used in neural networks. ReLU stands for Rectified Linear Unit. ReLU is a linear function that returns the input if it is positive, and returns zero if it is negative. ReLU is a piecewise linear function. ReLU has a range of [0, infinity]. ReLU is defined as: f(x) = max(0, x). ReLU is not differentiable at zero, which can be problematic for some optimization algorithms. ReLU is the most commonly used activation function because it is computationally efficient and has fewer issues with vanishing gradients than other activation functions. ReLU has been shown to outperform other activation functions in deep learning applications. ReLU is generally used in the hidden layers of a neural network. It is used in feed-forward neural networks to produce smooth nonlinear activation. It is also used in convolutional neural networks that have linear receptive fields and a large output layer with several neurons. It is different from the sigmoid activation function in the sense that it is easier to train and results in faster convergence. The following plot represents the output of the ReLU function vs input:
- Leaky ReLU (Improved version of ReLU): Leaky ReLU function is an activation function that is mostly used in the case of activation function for neural networks. It was introduced by Cagnin et al. (2012). Unlike traditional ReLU functions, which set all negative values to zero, Leaky ReLU allows a small amount of negative values to pass through. This has the effect of reducing the “dying” ReLU problem, where neurons can become permanently deactivated if they receive too many negative inputs. Leaky ReLU is therefore seen as a more robust activation function, and it has been shown to improve the performance of neural networks in various tasks.
- Randomized ReLU: The Randomised ReLU function is a generalization of the standard ReLU function and can be used in place of the ReLU function in any neural network. The Randomised ReLU function has been shown to outperform the standard ReLU function in terms of both training time and classification accuracy. Moreover, the Randomised ReLU function is also more robust to input noise than the standard ReLU function.
- Parametric ReLU: Parametric ReLU function is a rectified linear unit (ReLU) with parameterized slope α. The Parametric ReLU function is given by: f(x)=max(0,x)+αmin(0,x). The standard ReLU function is given by: f(x)=max(0,x). The Parametric ReLU function was proposed by He et al. in their 2015 paper “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In the paper, the authors found that the Parametric ReLU function improved the performance of deep neural networks on image classification tasks. The Parametric ReLU function has also been found to improve the training of deep neural networks and to reduce the number of parameters required to achieve a given performance level. This activation function is used in many modern deep learning architectures such as ResNet, DenseNet, and Alexnet which have enabled us to work on large-scale datasets.
- Exponential ReLU: The Exponential ReLU function is an activation function that has been shown to produce better results than the traditional ReLU function in several deep learning tasks. The Exponential ReLU function is defined as: f(x) = exp(x) if x<0, x if x>=0. The Exponential ReLU function has two primary advantages over the traditional ReLU function. First, the Exponential ReLU function avoids the “dying ReLU” problem, which occurs when a traditional ReLU-activated neuron outputs a negative value and is then unable to recover. Second, the Exponential ReLU function allows for a greater range of activity levels, which can lead to improved performance on some tasks. Despite these advantages, the Exponential ReLU function does have one significant disadvantage: it is much more computationally expensive than the traditional ReLU function. For this reason, it is important to carefully consider whether the Exponential ReLU function is the best choice for a given task before implementing it in a neural network.
- Soft Sign: Soft Sign is a mathematical function that is used in various fields such as machine learning and statistics. The Soft Sign function is defined as: Softsign(x) = x / (1 + |x|). This function has a number of useful properties, which make it well suited for use as an activation function in a neural network. Firstly, the Soft Sign function is continuous and differentiable, which is important for the training of a neural network. Secondly, the Soft Sign function has a range of (-1, 1), which means that it can be used to model bipolar data. Finally, the Soft Sign function is computationally efficient, which is important for large-scale neural networks. A common use of the soft sign activation function is when we learn using maximum likelihood estimation (MLE). In MLE, we try to find the activation function that best fits our training data. There are many different activation functions, but most of them result in a similar model when used with MLE. The soft sign activation function is also commonly used for classification problems where we want to learn probability estimates (Pr(y = +)) or Pr(y = -)). You can use soft sign activation functions in neural networks with binary output units where we want to learn the probabilities for each possible outcome.
- Inverse Square Root Unit (ISRU): In mathematics, the inverse square root function is defined as the function that is equal to the reciprocal of the square root of its argument. In other words, if x is any nonzero real number, then ISRU (x) = 1/sqrt(x). ISRU is a monotonic function and has a range of (-inf, inf). It is continuous and differentiable. The Inverse Square Root Unit function has several desirable properties that make it well-suited for use as an activation function in neural networks. In particular, the function is smooth, which helps to ensure that the neural network will converge to a solution. Additionally, the function is bounded, which helps to prevent the weights from becoming too large.
- Square Non-linearity: The Square Non-linearity function is a common activation function used in neural networks. This function takes the input, x, and outputs the square of x, f(x)=x^2. This function is non-linear, meaning that it can create complex models that are not linearly separable. The Square Non-linearity function is also differentiable, making it easier to train the neural network.
- Bipolar ReLU: Bipolar Rectified Linear Unit (BReLU) function is a type of activation function that is used in neural networks. It is similar to the standard ReLU function, except that it has a range of -1 to 1 instead of 0 to 1. Its popularity stems from its ability to train deep neural networks quickly and effectively. The Bipolar ReLU function is very similar to the traditional ReLU function, but with one key difference. Whereas the traditional ReLU function returns 0 for any input below 0, the Bipolar ReLU function returns -1 for any input below 0. This seemingly small change can have a big impact on training time and accuracy. The Bipolar ReLU function has been shown to provide a more consistent gradient, which leads to faster training times. In addition, the Bipolar ReLU function often provides better generalization than the traditional ReLU function. Bipolar ReLU has been shown to outperform standard ReLU in some applications. It is also less prone to the “dying ReLU” problem, where neurons with a negative input become inactive and stop learning.
- Soft Plus: Soft Plus is a function that is often used as an activation function in neural networks. The function takes the form of f(x)=ln(1+e^x). The Soft Plus function has several desirable properties, including being differentiable, bounded, and monotonic. These properties make Soft Plus a good choice for use as an activation function in a neural network.
The following represents different variants of ReLU:
- Leaky ReLU
- Randomized ReLU
- Parametric ReLU
- Exponential linear unit
- Bipolar ReLU
Out of the above activation functions, the most commonly / popularly used are the following:
- ReLU and its different variants