# The 10 most Common Activation functions for Deep Learning!(Mathematical principles + Advantages and disadvantages)

Activation function is an important part of neural network model. Sukanya Bag, the author of this paper, explained in detail the advantages and disadvantages of ten activation functions from the mathematical principle of activation function.Activation functions are functions added to artificial neural networks to help networks learn complex patterns in data.Similar to the neuron-based model in the human brain, the activation function ultimately determines what to fire into the next neuron.In artificial neural networks, the activation function of a node defines the output of that node under a given input or set of inputs.A standard computer chip circuit can be thought of as a digital circuit activation function that gets on (1) or off (0) outputs based on the input.Therefore, the activation function is a mathematical equation to determine the output of neural network. This paper summarizes ten common activation functions in deep learning and their advantages and disadvantages.First let’s take a look at how artificial neurons work, roughly as follows: The mathematical visualization of the above process is shown in the figure below: 1. Sigmoid activation function The image of the Sigmoid function looks like an S-shaped curve.The function expression is as follows: Under what circumstances is it appropriate to use Sigmoid activation functions?The Sigmoid function has an output range of 0 to 1.Since the output value is limited from 0 to 1, it normalizes the output of each neuron.A model for taking the predicted probability as the output.Since the probability ranges from 0 to 1, the Sigmoid function is very suitable;Gradient smoothing to avoid “jumping” output values;The function is differentiable.This means that you can find the slope of the Sigmoid curve at any two points;Clear prediction, that is, very close to 1 or 0.What are the disadvantages of the Sigmoid activation function?Gradient tends to disappear;Function output is not centered on 0, which reduces the efficiency of weight update;The Sigmoid function performs exponential operations and the computer runs slowly.2. Tanh/hyperbolic tangent activation function The graph of Tanh activation function is also s-shaped, expressed as follows: Tanh is a hyperbolic tangent function.The curves of the tanh and Sigmoid functions are relatively similar.But it has some advantages over the Sigmoid function.First, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update.The difference between the two is the output interval. Tanh has an output interval of 1, and the whole function is centered on 0, which is better than sigmoid.In the TANH diagram, negative inputs will be strongly mapped to negative and zero inputs will be mapped to near zero.Note: In general binary classification problems, the TANh function is used for the hidden layer, while the sigmoid function is used for the output layer, but this is not fixed and needs to be adjusted for specific problems.The image of ReLU activation function is shown in the figure above, and the function expression is as follows: ReLU function is a popular activation function in deep learning. Compared with Sigmoid function and TANH function, ReLU function has the following advantages: when the input is positive, there is no gradient saturation problem.Calculations are much faster.ReLU functions have only linear relationships, so it can be computed faster than Sigmoid and TANH.Of course, there are drawbacks: the Dead ReLU problem.When the input is negative, ReLU completely fails, which is not a problem during forward propagation.Some areas are sensitive and some are not.However, in the process of back propagation, if negative values are input, the gradient will be completely zero. Sigmoid function and TANh function also have the same problem.We find that the output of the ReLU function is 0 or positive, which means that the ReLU function is not a zero-centered function.Leaky ReLU Is an activation function specifically designed to solve Dead ReLU problems: ReLU vs Leaky ReLU Why Leaky ReLU is better than ReLU?Leaky ReLU adjusts the negative zero gradients problem by giving a very small linear component of X to the negative input (0.01x);Leak helps expand the range of ReLU functions, usually with a value of about 0.01;Leaky ReLU’s function range is (minus infinity to infinity).Note: Theoretically, Leaky ReLU has all the advantages of ReLU and Dead ReLU won’t have any problems, but in practice, it has not been completely proven that Leaky ReLU is always better than ReLU.5. ELUELU vs Leaky ReLU vs ReLUELU has also addressed the issue of ReLU.ELU has a negative value compared to ReLU, which brings the average value of activation close to zero.Mean activations close to zero make learning faster because they bring the gradient closer to the natural gradient.Obviously, ELU has all the advantages of ReLU, and: There is no Dead ReLU problem, the average output is close to 0, centered on 0;By reducing the effect of bias offset, ELU makes the normal gradient closer to the unit natural gradient, thus accelerating the mean learning to zero.ELU saturates to negative values with small inputs, reducing forward-propagating variation and information.One small problem is that it is more computationally intensive.Similar to Leaky ReLU, although better in theory than ReLU, there is currently insufficient evidence in practice that ELU is always better than ReLU.PReLU is also a modified version of ReLU: consider the formula for PReLU: parameter α is usually a number between 0 and 1, and is usually relatively small.If a_i= 0, f becomes ReLU if a_i> 0, F becomes leaky ReLU If a_i is a learnable parameter, f becomes PReLUPReLU has the following advantages:In the negative range, the slope of PReLU is smaller, which also avoids the Dead ReLU problem.Compared with ELU, PReLU is linear operation in negative range.Even though the slope is small, it doesn’t go to 0.SoftmaxSoftmax is an activation function for multi-class classification problems where more than two class tags require class membership.For any real vector of length K, Softmax can compress it into a real vector of length K, value in the range (0,1), and the sum of the elements in the vector is 1.Softmax differs from normal Max functions: The Max function outputs only the maximum value, but Softmax ensures that smaller values have a low probability and are not discarded directly.We can think of it as a probabilistic or “soft” version of the Argmax function.The denominator of the Softmax function combines all the factors of the original output value, which means that the various probabilities obtained by the Softmax function are related to each other.The main disadvantages of Softmax activation functions are: non-differentiable at zero;The gradient of negative input is zero, which means that for activation of this region, the weights are not updated during backpropagation, resulting in dead neurons that never activate.Swish’s design was inspired by the use of SIGmoid functions at Gating in LSTM and high Speed networks.We use the same gating value to simplify the gating mechanism, which is called self-gating.The advantage of self-gating is that it only requires simple scalar inputs, whereas normal gating requires multiple scalar inputs.This makes it easy for self-gated activation functions such as Swish to replace activation functions (such as ReLU) that take a single scalar as input without changing the hidden capacity or the number of parameters.The main advantages of Swish activation function are as follows: “unbounded” helps prevent the gradient from approaching zero and leading to saturation during slow training;(At the same time, boundedness is also advantageous because bounded activation functions can have strong regularization and large negative input problems can be solved);Derivative constant > 0;Smoothness plays an important role in optimization and generalization.9. Maxout In the Maxout layer, the activation function is the maximum input, so a multilayer perceptron with only 2 Maxout nodes can fit any convex function.A single Maxout node can be interpreted as a piecewise linear approximation (PWL) of a real-valued function in which the line segment between any two points on the graph of the function is located above the graph (convex function).Maxout can also be implemented for d-dimensional vectors (V) : Suppose two convex functions h_1(x) and h_2(x), approximated by two Maxout nodes, and the function g(x) is a continuous PWL function.Thus, a Maxout layer consisting of two Maxout nodes can well approximate any continuous function.10. SoftplusSoftplus function:F (x)= ln (1 +exp x) The derivative of Softplus is f ‘(x)=exp(x)/(1 +exp x)= 1/ (1 +exp(−x)), also called the Logistic/sigmoid function.The Softplus function is similar to the ReLU function, but is relatively smooth and, like ReLU, unilateral inhibition.It has a wide range of acceptances :(0, + inf).The original link: https://sukanyabag.medium.com/activation-functions-all-you-need-to-know-355a850d025e