Phish: A Novel Hyper-Optimizable Activation Function

Deep-learning models estimate values using backpropagation. The activation function within hidden layers is a critical component to minimizing loss in deep neural-networks. Rectified Linear (ReLU) has been the dominant activation function for the past decade. Swish and Mish are newer activation functions that have shown to yield better results than ReLU given specific circumstances. Phish is a novel activation function proposed here. It is a composite function defined as f(x) = xTanH(GELU(x)), where no discontinuities are apparent in the differentiated graph on the domain observed. Four generalized networks were constructed using Phish, Swish, Sigmoid, and TanH. SoftMax was the output function. Using images from MNIST and CIFAR-10 databanks, these networks were trained to minimize sparse categorical crossentropy. A large scale cross-validation was simulated using stochastic Markov chains to account for the law of large numbers for the probability values. Statistical tests support the research hypothesis stating Phish could outperform other activation functions in classification. Future experiments would involve testing Phish in unsupervised learning algorithms and comparing it to more activation functions.


Introduction
Deep-learning algorithms are capable of solving complex problems. They use a series of synaptic weights and perceptrons to mimic the human thinking process. The success of training deep neuralnetworks (DNN) relies much on the activation function used in them. In each perceptron, two phases occur: a summation and transformation. In the summation, the inputs are multiplied with synaptic weights, which are initially generated at random, with a Hadamard product [12]. The transformation step consists of the summated vector being parsed through an activation function in addition to an optional bias [11]. Early architectures used TanH and Sigmoid extensively. However, the more complex DNNs required better activation functions.
The most commonly used activation function in DNNs is Rectified Linear (ReLU) [2]. It is a less probability inspired piecewise function with no discontinuities. It has a jump discontinuity when differentiated due to the sharp turn at the origin. Experiments demonstrated that ReLU increased the performance in DNNs, outperforming TanH [1] and Sigmoid [8]. However, ReLU has some faults. One of the biggest ones is the dying ReLU issue, but luckily leaky ReLU partially solved this issue via augmenting the negative domain of the function [3].
Swish and Mish are newer activation functions that have recently gained traction [9]. They are both composite and comprise at least one existing activation function. Unlike ReLU, these functions are non-linear, and their derivatives are void of discontinuities. They both perpetually increase and pass through the origin (0, 0). The new activation function created here would follow the parameters of Swish and Mish [6].
A new activation function was fabricated here. It is comedically named Phish. Phish is defined as f(x) = xTanH(GELU(x)). Phish is monotonic, unlike other activation functions where the slope is completely positive. On the interval from [0, ∞], it is completely positive and passes through the origin (0, 0). Phish estimates update gradients for backpropagating DNN algorithms.
An experimental simulation to compare Phish to existing functions will be conducted. The levels of independent variable will be Phish, Swish, Sigmoid, and TanH. There was no control. The dependent variable was the minimization of sparse categorical crossentropy (SCC), which is one of the most common loss functions in classification. Several control variables will be held, such as the DNN layers, optimizer, output function, and learning rate.

Backpropagation and Update Gradients
Activation functions are derived with the purpose of generating non-linearity to the inherently linear data transformed from the input layer of a neural-network. Backpropagation is the process where each synaptic weight in deep-learning algorithms are iteratevely finetuned to complete a task using loss calculated between the expected and actual outcomes. Suppose there is a multilayer perceptron with with weights, and biases adjusted through an arbitrary activation function A(x). In this multilayer perceptron, as with most, the weights are defined the simple matrix to represent the baseline values, which are usually randomly generated within a [-1, 1] interval. The less complex bias vectors can be represented by a one dimensional version of the matrix seen above. In addition, the weighted sum (aka. the values parsed through the activation function) is The weighted input can be obtained and parsed through A(x) for the intermediate column vector where there exists elements until the nth degree. To calculate the update gradient, the rate of change in loss L must be determined. Theoretically, though impractical, this can be determined via calculating the slope between two datapoints with an infinitesimal distance. The standard error can be approximated via finding the instantaneous rate of change in loss (eg. determining a partial derivative in respect to z). When the calculated error can be propagated to every weight in the neural-network. Using the weighted input, loss derivative, and activation function derivative, the update gradient can be calculated using basic algebra such that across many iterations. Due to space constraints, optimization and further analysis of partial derivatives has been omitted. As can be seen, the activation function and its derivative are critical in the training of deep neural-networks (DNNs) in supervised classification, or in unsupervised classification (eg. discriminators in generative adversarial networks). Substituting various activation functions can vastly alter the minimization of loss.

Derivation and Implementation
Much like Mish, Phish is a composite function. It comprises two existing activation functions, those being TanH and GELU. The inner function GELU 1 , is defined as to approximate ReLU such that no discontinuities occur on the differentiated graph. ReLU is perhaps the most used activation function in DNNs. It has shown to be effective in large-scale classification problems, often used in image classification. The outer activation function TanH 2 , is is defined by Since Phish is expressed in terms of other equations and variables, the true form of the equation can be determined. Therefore, through substituting variables and rearranging the terms, the Phish equation in the most pure form can be defined as Using the backpropagation equation derived in the introduction, the activation function A(x) can be substituted with any activation function to simulate the calculation of update gradients. Such gradients for Phish require its derivative. Update gradient calculation can be formulated via substituting the Phish derivative.
Based on the assumption that where z is any complex number, the derivative can be calculated by substituting integrals, rearranging the terms, and applying the chain rule onto all sides.

Evaluation
A simulation was conducted to compare activation functions. The levels of independent variable were Phish, Swish, Sigmoid, and TanH. Phish was the control. The minimization of loss was studied using DNNs. 1 GELU is an approximation of the ReLU activation function defined as The main implementation of such a function is to avoid the large jump discontinuity apparent in ReLU, which occurs at the origin (0, 0) on the Cartesian coordinate system. The non-linear function seems to outperform ReLU and ELU in certain tasks in language processing and classification. 2 TanH is the analogue hyperbolic tangent function often used throughout trigonometry. Similar in concept to Sigmoid, it has two horizontal asymptotes. However, these exist at y=±1, which indicates that the domain is half negative. Therefore, TanH + existsonlyrightwardof theorigin(0, 0), whichitcrosses.
One Intel i7 computer was obtained. Python3 was installed onto the machine with machinelearning and linear algebra dependencies. For the procedure, 170,000 training and 50,000 testing images were gathered. The images were preprocessing via normalization and cropping. The preprocessing was limited to generalize the training process.
A generic neural-network was fabricated for testing. It comprised an input layer, four hidden layers, and an output layer. The output layer was always used SoftMax. The models were compiled with the Adam optimizer. Binary classification crossentropy loss was substituted with SCC.
where w represents the arbitrary parameters of a given network with the y values representing the predicted and true labels. This was done so the network would assume correct classifications can only be a single prediction. SoftMax was used for the output layer. It is a deep-learning probability distribution function used in multi-class identification problems. It is where the input and output functions of the network calculate for the input vector. During testing, K=10 was constant because each of the databanks used had ten possible labels. The levels of independent variable were tested. This was done twenty-five separate times for each activation function. The minimization of SCC was recorded.
This project was conducted on a laptop without graphics processing units or cloud servers. Therefore, a large scale cross-validation was not reasonable. A memoryless stochastic model was more favorable for such a purpose. Thus a Markov chain was developed to simulate the process, which can be seen in the appendix. The effect of activation functions on minimizing the loss in classification for DNNs was tested. Various datasets were used to simulate classification backpropagation. Phish (red), Swish (green), Sigmoid (blue), and TanH (orange) can be seen in graph 1.
This particular graph shows the trend when training on MNIST fashion. The graph was the average loss across epochs calculated from twenty-five trials. Across the various epochs, it can be seen that Phish and Swish had a similar minimization of SCC. TanH and Sigmoid had significantly lower reduction of loss compared to Swish and Phish. From the data collected it could be inferred that Similar patterns were apparent when the networks trained on MNIST numbers and CIFAR-10 image databanks. Phish consistently outperformed TanH and Sigmoid. It was either on-par or slightly superior to Swish. The results of the experiment show that Phish is a promising alternative activation function. This data table shows the compared levels of independent variable. Six independent parametric t-tests 3 were calculated to determine the significance of the data collected. The value of significance was at 0.05, and was granted 48 degrees of freedom. A table value of 2.011 was used. A null hypothesis was generated. It stated that there would be no difference between any of the tested activation functions when given the task of minimizing sparse categorical crossentropy.
Five of the six comparisons were significant. The Phish vs. Swish test was not significant, which showed that through the testing, both activation functions delivered similar results by the tenth epoch. This logic can be seen in the similarities in calculated values between Phish, Swish, and the other functions. Phish vs. TanH delivered the greatest difference in performance, with Phish on average having the lowest loss and TanH having the highest. The variance of the Swish and Sigmoid datasets were also noticeably higher than the other two functions.

Procedural Flaws
There were many sources for error in the experimentation done to determine the properties of Phish. The first was that Phish was only compared to three other activation functions. Another flaw was that only one architecture was tested for classification, where many could have been tested. Other combinations of optimizers, metrics, losses, and layers may result in different findings.
In addition, a true simulation of the loss was never conducted. Only stochastic replications of limited simulations were analyzed. This was due to the limitations of devices in this research, as only a single laptop was available. While Markov chains can simulate probability, they cannot predict the change in probability.
To remedy these errors in the future, various types of classification algorithms could be tested using the activation functions. More functions could be compared as well. Lastly, better computers and cloud servers could be used to conduct the advanced simulations required to test Phish, that would otherwise impractical on a laptop.

Future Applications
Future applications of the activation function proposed in this research may vary. The first application would be further testing on types of datasets. MNIST and CIFAR-10 were used in this research. MNIST is a relatively simple dataset that most deep-learning models could solve [13]. CIFAR-10 consists of RGB images, which requires better models to solve [7]. Still, testing Phish on MNIST and CIFAR-10 only would limit knowledge on its properties. ImageNet is a public databank consisting of RGB images with an average resolution of 469×387 pixels [10]. It is organized according to the WordNet hierarchy, and is often used when to test pretrained convolutional neural-networks.
Specialized layers in the networks used for testing were omitted throughout evaluation. Further testing could determine the effect of Phish on such models. Specific examples would include recurrent neural-networks. These networks were engineered to solve the vanishing gradient problem [14]. Gated recovery unit and long-short term memory algorithms are extensions of recurrent neural-networks [5]. When testing time series data, Phish could be implemented in these algorithms via substituting Sigmoid layers.
Another example of a future study would be utilizing Phish in generative adversarial networks. These algorithms comprise of two models, often multilayer perceptrons, engaging in a minimax game. The first model is the generator, which captures the distribution of a given dataset. The second one is the discriminator, which differentiates samples from the dataset and ones generated by the generator. Ideally, the loss of the discriminator would be maximized with the accuracy yielding 1 2 everywhere. Testing Phish in a model with the purpose of maximizing loss would be an interesting future study [4].

Conclusion
Phish is a novel non-motonic activation function. It delivered higher performance in MNIST and CIFAR-10 image classification than Sigmoid and and TanH. It rivals Swish in loss minimization. The function perpetually increases without an upper bound. Its derivative is always positive. Phish evaluates calculations that increase the speed of loss minimization. Unlike ReLU, Phish is fully differentiable. Future studies could involve training generative adversarial networks with Phish and examining the performance. This project was conducted under adult approval with antivirus software.

Markov Chains
A Markov chain is a stochastic graph model based in probability. They are favorable for simulating large networks of events because they are memoryless. Each chain yeilds a stochastic transition matrix (STM).
The Markov model on Ω results in the stochastic process (X 0 , X 1 , X 2 , ... X t ) in which the transition state between x and y complies with the properties and In addition, the STM exists with non-negativity ∀x, y ∈ Ω, P (x, y) ≥ 0 (18) and stochasticity y∈Ω = P (x, y) = 1, ∀x ∈ Ω where each row converges to 1. This Markov model is continuous, with no termination node with 100% probability of returning to itself on the graph, and a 0% chance of transferring to any other stage. In addition, there is technically an appropriate start node in this graph. However, since this Markov model will be ran for extended periods of time, the law of large numbers states that the probability of the event occurring will be affected minimally from the first event, especially since there are only two possible stages in this model. Therefore, the chance of starting at either stage was 50% always. The Markov chain utilized here is a two stage graph with four global locations and two local ones for each stage. An minimalistic representation of the adjacency matrix could be fabricated accordingly with 4×4 dimensions. The probability values were guaranteed using the STM ω ij = Ω = P 1 (T ) P 2 (T ) P 2 (F ) P 1 (F ) (20) The Markov simulation was conducted where the P 1 (T ), P 1 (F ), P 2 (T ), and P 2 (F ) were retrieved from a cross-validation. For each activation function, a DNN was trained across ten epochs. The prediction ratios were implanted into four graphs. Each graph was simulated for 10,000 iterations twenty-five times to follow the ideal experimental design.

Deep Neural-Networks
A generalized model building framework is ideal when testing narrow components of deeplearning models such as activation functions. This is to minimize the lurking/confounding variables that may be introduced via convolutional, pooling, and recurrent layers. Forget and memory gates were also omitted from the model for the same reasons.
The testing model comprised an initial flattening layer, six hidden layers, and one output layer. The flattening layer manipulated the image data into a one-dimensional array for the next layer. The six hidden layers used one of the four activation functions tested and contained between 32-128 layers each. The output layer was always ten neurons, because MNIST, and CIFAR-10 both have ten classes. It used SoftMax instead of a Sigmoid, as probability of classification was distributed between more than two classes. The models trained using sparse categorical crossentropy loss and the Adam optimizer, which combines aspects of the previously engineered AdaGrad and RMSProp methods.