Artificial Neural Network Activation Functions in Exact Analytical Form (Heaviside, ReLU, PReLU, ELU, SELU, ELiSH)

— Activation functions are fundamental elements in artificial neural networks. The mathematical formulation of some activation functions (e.g. Heaviside function and Rectified Linear Unit function) are not expressed in an explicit closed form. This made them numerically unstable and computationally complex during estimation. This paper introduces a novel explicit analytic form for those activation functions. The proposed mathematical equations match exactly the original definition of the studied activation function. The proposed equations can be adapted better in optimization, forward and backward propagation algorithm employed in an artificial neural network.


I. INTRODUCTION
Activation functions are bio-inspired mathematical equations to represent the firing action potential in a neuron or node. They are a transformation and nonlinear mapping of inputs from one layer into the next layer. Moreover, the set of interconnected neurons in an intricate manner forms diverse kinds of neural networks. Remarkably, decisions are made at the non-linearity of those functions, which exist at zero most of the time. This property of activation functions enabled neural networks to learn from complex and higher-order data to provide precise predictions and classifications. Moreover, their nonlinearity affects the convergence of neural networks and plays a vital role in specifying the convergence speed and computational efficiency. Models of neural networks include, but are not limited to, Feedforward Neural Network [1], Radial basis function Neural Network [2], Kohonen Self Organizing Neural Network [3], Recurrent Neural Network(RNN) -Long Short Term Memory [4], Convolutional [5] and Modular Neural Network. Accordingly, activation functions control the activation of neurons and thus the output of a neural network to perform complicated tasks like handwritten character recognition [6], Image Processing Applications [7], Application for Stock Market Prediction [8], Signature Verification Application [9], cancer detection [10], Thunderstorms Forecasting and Weather prediction [11], Predicting Driving Fatigue [12], Self-Driving Car Steering Angle Prediction [13], Navigating Self-Driving car [14], Online Advertising [15], Improvement of Chatbot in Trading System [16], and many more applications. Moreover, various activation functions have been employed in generating neural-based controllers to play a video game [17].
Researchers raised questions about the characteristics of continuous nonlinear activation functions to act in the hidden layer of the neural network and examined if they can belong to a Banach space [18]. Moreover, it can be ineffective learning to tune the parameters of activation function in a neural network and difficult to train in the existence of non-differential activation functions as in the threshold networks [19].
That is why researchers introduced various approximations for the activation function. Mostafa et al. approximated the activation function for vector equalization with recurrent neural networks by a sum of shifted hyperbolic tangent functions [20]. Timmons et al. used function approximation techniques as a replacement of networks using hyperbolic tangent and sigmoid activation functions. Their result shows an improvement in training on the CPU by 10% to 37% [21]. In another research, two properties are emphasized to be good practice in approximating activation function based on good approximation of binary threshold and polynomial [22]. In another work, the authors considered the Complex-valued neural networks at which non-parametric activation functions are designed to be flexible in the complex domain. Therefore, its degree of freedom can adapt its shape given the training data [23]. Ohn et al. derived the essential width, depth and sparsity of a deep neural network that can estimate general activation functions in both regression and classification problems up to a certain approximation error [24]. To be more precise, Heaviside is expressed as a summation of two inverse trigonometric functions [25]. Furthermore, appropriate smoothness class has been introduced under approximation properties of networks with rectified linear unit activation function which are not positive definite [26].
This paper introduces mathematical formulas that match the exact representation of several activation functions including: Heaviside or Step Function, Rectified Linear Unit (ReLU), Parametric Rectified Linear Unit (PReLU) and thus the Leaky Rectified Linear Unit (Leaky ReLU), Exponential Linear Unit (ELU), Scaled Exponential Linear Units (SELU) and Exponential linear Squashing (ELiSH).
The remainder of this paper is organized as follows: In Section 2, exact and defined analytic forms of traditional and widely used activation functions are presented. Section 3 introduces the activation functions and their formulated exact mathematical equations. The Matlab code of some plotted functions is given with the newly formulated equations. Finally, a conclusion of the novel method has been developed.

II. DEFINED ANALYTIC EXPRESSIONS OF ACTIVATION FUNCTIONS
Traditionally, the sigmoid function and hyperbolic tangent are widely used as differential activation functions in feedforward neural networks. The sigmoid (logistic) activation function is defined in (1) [27] and a plot is shown in Fig. 1.
The sigmoid function is characterized by it being computationally expensive, converges slowly and outputs not zero centered.The hyperbolic tangent (tanh) activation function [28] is defined in (2): The hyperbolic tangent (tanh) activation function is a shifted version of the sigmoid function to become zero-centered as shown in Fig. 2. The saturation at output boundaries of the above two nonlinear activation functions make the neural network unable to update weights during back-propagation learning and thus the vanishing gradient problem appears.

III. EXACT MATHEMATICAL FORMULATION OF OTHER SPECIAL ACTIVATION FUNCTIONS
In this section, each activation function is defined as given in the literature. After, the exact closed-form analytical formulas is expressed for Heaviside, Rectified Linear Unit, Parametric Rectified Linear unit, Exponential Linear Unit, and Scaled Exponential Linear Units activation functions.

A. Heaviside Function
The Heaviside or step function is given [29] as in (3): As seen from (3), the step function is zero for negative values of the input and some constant for positive input values. If the peak-topeak amplitude is set to one (A=1), then it is called unit step function.
Step function is highly used in operational calculus to solve differential equations which are widely used to model dynamics and engineering systems.The Heaviside exact closed-form analytical formula is derived as shown in (4): ( ) = 0 − +√ 2 (4) Again "A" is defined to be the peak-to-peak amplitude of the step function and ∈] − ∞, +∞[. If A=1 then the plot of the step function will match exactly it is original definition as shown in Fig. 3. The derivative of the Heaviside can then be derived and look like in (5): As it can be observed in (5) that the derivative of Heaviside step function didn't yield a Dirac delta function at z=0. This issue will be discussed in a further investigation. The unilateral Laplace transform of the formulated step function will remain ( 1 ), where "s" is the complex-frequency parameter.The step function can be shifted on vertical y-axis by adding a subtraction parameter denoted by "B" as given in (6). An example is shown in Fig. 4 where the peak-to-peak amplitude is chosen A=2 and a shift of B=1:  Step Function being shifted on the x-axis Alternatively, the Heaviside function can be shifted on the horizontal x-axis by adding a parameter denoted by "C" as in (7): Fig. 5 is an illustration of a unit step function (A=1) being shifted on the x-axis by C=2.Note that the "shift" mechanisim is applicable to all upcoming derived formulas of activation functions.

B. Rectified Linear Unit (ReLU)
The ReLU is defined mathematically [30] as a combination of two linear part as in (8): The linear part of ReLU is easier to be optimized with gradient methods. In fact, it over comes the vanishing gradient problem because of it is linearity on some domain of definition. In the other hand, it keeps nonlinearity allowing an efficient learning from complex relationships within the data. This actually led to an exploding gradient. Moreover, ReLU simplifies the model by having an output with exact value equal to zero and not approximating to zero like in the sigmoid function. This reduced the computational complexity and characterize it with representational sparsity. ReLU is used in many interesting applications like in switching linear encoding with designed rectified linear autoencoders models [27].
In (9), the ReLU exact closed form analytical formulation is presented in an elegant formula.
( ) = 0 − +√ 2 (9) Where A is the slope of the ReLU ramp part. A is fixed to one and a plot of ReLU will be then as in Fig. 6. The derivative of ReLU function can then be obtained as (10) which mathches the unit step function.
The negative values substituted in (10) are strictly zero which will be highly affecting the vanishing gradient problem. However, the derivative of ReLU at zero is now defined. The unilateral Laplace transform of the formulated ReLU function in (9) will remain ( 1 2 ).

C. Parametric Rectified Linear Unit (PReLU)
The PReLU [32] is a modified version of ReLU to enable backpropagation at negative input values. Mathematically it is given as in (11): is the slope at the negative part of the function. can be optimized during learning at the backpropagation. However, if is fixed to a certain value, say 0.01, which is entitled Leaky ReLU [33] then it doesn't provide accurate predictions for negative input values.
The PReLU exact closed form analytical formulation is given in (12) and sketched in Fig. 7 with = 0.2 .

D. Exponential Linear Unit (ELU)
The ELU [34] is defined as given in (14): The advantage of ELU over ReLU is in its ability to circumvent overfitting while training. In addition, it can decrease the bias shift effects observed in ReLU. The ELU exact closed-form analytical formulation is then derived as in (15).

Fig.8. ELU activation function
The derivative of ELU is obatained as in (16): And the second derivative of ELU is shown in (17).

E. Other Activation Functions
At this point, other activation functions can be derived easily from previous activation functions. For instance, the Scaled Exponential Linear Unit (SELU) [35] activation function is known in its internal normalization and its cannt die compared to ReLU. It is mathematical formula beside to its closed analytical form are given in (18).
Another interesting activation function is the Exponential linear Squashing (ELiSH) [36] activation function. In (19), the equation presents the mathematical definition of ELiSH along with its exact form.

IV. CONCLUSION
In this paper, we introduced a novel explicit analytic form for several activation functions (Heaviside, ReLU, PReLU, ELU, SELU, ELiSH). The derivative of Heaviside is proven to be zero for the entire domain of definition. The derivative of the ReLU has been also presented along with their Unilateral Laplace transform. Those proposed explicit formulas of activation function would simplify the algorithm within the trained neural networks and would help in achieving higher accuracies. Namely, the derived formulas will speed up the evaluation of neural network, scalable to diverse central processing units existed in the market and building quicker arithmetic logic units. Besides, they can be used easily and efficiently for analog applications. The results of this paper will yield remarkable applications in several areas like in operational calculus, engineering practice, science, and more specifically in different types of neural networks including Convolutional Neural Network, Recursive Neural Networks, Recurrent Neural Network, and Modular Neural Network.
The proposed functions will be studied in detail in the future. Their derivations and limitations will be presented along with their applications into the various field of study.