The Role of Metaheuristic Algorithm in weight Training and Architecture evolving of Feedforward Neural Networks

Neural
network and metaheuristic algorithm are two technique of machine learning. Each
of them is employed for different purposes. NN is used for classification,
regression, etc., however, a metaheuristic algorithm is used to find the optima
in a huge search space. To use a neural network, first, it should be trained.
In the process of training, the weight of each connection is obtained so that
the total error (real output minus predicted amount) became minimum. That’s
where stochastic search space come in to help find the best set of weights.
Therefore, finding weights of a neural network can be interpreted as finding
the optima of a vast search space. The focus of this paper is on the use of
metaheuristic algorithm on training and evolving structure of feed-forward
neural networks

The population-based algorithm can be divided into two main section: first, evolutionary algorithms (EA) that starts with a population and evolves it. EAs are derived from natural selection of Darwinian Theory. In this theory, just the best group of a population can survive and reproduce. Another population-metaheuristic algorithm is swarm intelligence (SI) based algorithms. SI originated from the individual and social behavior of animals (such as ant, bee, wolf, ETC.,) in a group that had a common target such as finding food.
Use of metaheuristic algorithm for weight training and architecture design of feedforward artificial neural networks.
Designing of a neural network is usually is the work of a human who is an expert with the method of trial and error with the training method of back propagation. However, with the help metaheuristic, we can train network with higher rate of generalization due to its universal search space. Furthermore, we can find the optimal architecture of a neural network much faster that trial and error. The evolutionary algorithm could when there is not a there is no exact mathematical definition. By today, in the case of the neural network, there is no mathematical theory that can help us analyze to find the best structure of the neural network. However, with the help of metaheuristic, near optimal structure of the neural network can be found. Therefore, the role of metaheuristic in designing NN is: 1) weight training. 2) Architecture design. There are other usage for metaheuristic on case of NN. However, in this paper the focus is on weight training and evolve of architecture.

Evolutionary Algorithm
Evolutionary algorithms are a type of metaheuristic algorithm that are inspired by natural evolution. The main type of EA's are evolution strategy (ES) [9], [10] evolutionary programming [11], [12] and genetic algorithm (GA) [13], [14] . These algorithms are useful when dealing with a large, complex problem with lots of local minima. Because of their global search abilities, they are less likely to be trapped in local minima while other method that are based on gradient information might easily stuck in local minima a. Evolutionary Neural Network: The evolution in neural network takes place in so many different dimensions such as weight training, topology design, learning rude design. Weight training would be a process that weight of connections is changed adaptively until it converges to the global optimum. In the case of evolution of topology, different possible architectures are the search space and the role of EA is to find the best architecture adaptively. The evolution of learning rule is defined as a process that learns the best learning methods. Though, EA can be used in every stage. From steps of defining the architecture to the training of Neural network.

A. The evolution of connection weights
Weight training is formulated as a minimization problem. Where fitness function is the error function and the inputs are weight of each connection in the neural network. Most of the training algorithm are based on gradient descent such as Backpropagation and conjugate gradient [15]- [17] and there are numerous successful ANN of this type [18], [19]. However, BP has some disadvantages due to its dependence to gradient descent. It cannot work in spaces where the gradient is not known and it is easily trapped in local minima because of its local search ability. On the other hand, EA doesn't need gradient information and its global search space is ideal for not being trapped in local minima [20], [21].

Representation of the problem:
Like any other optimization, solving weight training problem consists of two part. First, representation of the problem, i.e., whether the presentation should be binary of real value. Second, search operators such as mutation, crossover, and selection algorithm. Different combination of these two parts results in the different algorithm.

Binary representation
This representation is usually used in the Genetic algorithm. The weight of the network is encoded as a binary string of a certain length [20], [21]. The accuracy of this presentation directly related to the length of the string. However, increasing the number of bits results in slower computation. Therefore, there is a trade-off between accuracy and speed of convergence of ANN. By putting the binary string of each weight in a line, we can have a chromosome of binary bits and we can perform mutation and crossover on the chromosome population to find the best set of weights with the lowest error. The advantages of binary representation are simplicity and generality and are easy to perform crossover and mutation. It is also faster than other representation because weights are directly coded into bits. However, this presentation does not have a good result when applying crossover. This problem which is also known as permutation problem [22], [23] happens when two different bit string with totally different structure are selected for crossover because of their fitness function are the same. When this happens, the result of crossover would be quite irrelevant and the algorithm might be stuck in local minima.

Real-number representation
Instead to binary representation [24]- [26] proposed using real numbers to represent weights directly. Since the weight of each weight is defined by a real number, each individual sample would be a real vector. Therefore, traditional mutation and crossover can't be applied directly. In [24] Montana and Davis introduced some genetic operator for mutation and crossover on this kind of representation. The idea behind the design of this networks was to preserve useful features that are produced through search steps.
Another way to evolve real presentation vector is to use an Evolutionary algorithm that directly works with real numbers such as evolutionary strategy (ES) and evolutionary programming (EP) [25], [27], [28] . These two algorithms are just based on mutation. Therefore, we won't see the undesired effect of permutation problem on the next generation. The mutation operator is mostly Gaussian mutation. However, in [29] the Cauchy mutation is used too.

Comparison between B.P and evolutionary algorithm
As it was mentioned before, the revolutionary algorithm is attractive because they can search the space globally while the search space could have lots of dimensions, be multi-modal and even nondifferentiable space. However, it doesn't need gradient information. Therefore, it is ideal for search spaces that gradient information is not available. One other main advantage of using EA on ANN is that it could be easily used on different type of networks with minor changes, i.e., an algorithm that is used for training feedforward neural network, can be used to train recurrent neural network [30].
There are some studies that state that EA is slower than the BP [31], [32]. However, they did not clarify which type of type EA they compared with BP. The authors in [26], [27], [33], [34] stated that evolutionary training algorithm is much faster than back-propagation based algorithm. One thing that might make the comparison hard is the accuracy, i.e., a 6-bit string chromosome genetic algorithm might be faster than BP. However, if we use an 8-bit representation, it could be slower. Despite numerous papers than state evolutionary training is much faster than BP training, Kitano [35] argues that GA-BP, a technique that combines GA with BP, at the best condition, is equal to GA, therefore it does not need to use GA training. However, there are many papers that report an excellent result that by using hybrid evolutionary and gradient descent algorithm [36]- [39]. The paradox between Kitano's work and other works might be related to the different structure of EA and BP that has been discussed, i.e., did Kitano compared a slow GA training method with a fast BP method? The answer to this question doesn't have a clear answer. As the no-free-lunchtheorem [40] states, the best method is always depending on the problem.

Hybrid training
The EA are good in global search. However, they are not as good in local search. This means that when they get near the global optima, they spend more time in comparison with the algorithm that has search locally. Therefore, the idea of combining the global search ability of EA with local search ability of another algorithm can have faster and more accurate results. The local search could be BP [35], [41] or another local search algorithm such as simulated annealing [42]. The method is that GA starts with a random set of weights and train them up to a near optimal condition. After that, the pre-trained weights are delivered to a local algorithm to find the global optima. Hybrid training has been used successfully in different applications [22], [36], [37]. By using a global search first, we skip the local minima and by using local search algorithm, we can find the global optima faster and more accurate.

II. The evolution of the architecture
In section one, it is assumed that the structure of the ANN is fixed and only the weight of connections is evolved. However, even if the best set of weight for a given architecture is found. It doesn't guaranty that we have the best-trained network since we have no information about the architecture. The design of architecture is one of the most important parts of finding the best solution. However, up to now, there is no strict mathematical analysis that can help us find the best structure. Right now, the design of the structure is done by and expert human that tries to find the best structure with the excruciating method of trial and fail.
Overfitting and having a simple structure are two ends of structure design spectrums, i.e., a small and limited structure is limited in doing tasks such as classification or prediction. However, a big and complex structure may easily overfit. Overfitting is a problem that happens when the network is too big. When an ANN is overfitted, it has a very small error on training data. However, when it is tested with test data, the error is out of acceptable criteria. The constructive and destructive algorithm [43]- [45] tried to automate the structure design. In the constructive approach, starts with a minimal network and adds, layer, nodes, and connection whenever it is necessary. While the destructive architecture starts with a complex structure and removes layers, node and connections whenever are needed. However, these structures are easier to be trapped in local minima and they just represent just a limited number of structures [46]. Like any other optimization problem, the possible architecture of networks shares the search space and each point in this discrete structure are a representation of a specific structure. To solve this optimization problem, some fundamental constraint should be considered such as the minimum number of layers and nodes, minimum training error, etc., about architecture. This space has some qualities that are mentioned in [47]: 1: this surface infinite since the number layers and node are not limited 2: the surface is discrete therefore, it is not differentiable 3: the surface is deceptive because of the permutation problem 4: the surface is multi-modal since similar structures may have similar performance.
Simulation of topology has two dimensions. First: the genotype presentation. Second: the EA used to evolve the topology.
There is two different view about encoding architecture. One encode every parameter in the architecture which is called direct encoding. In this way, every connection, node, and layer are encoded into genotype. This representation can have the most accurate answer. However, evolution could be so slow due to a large number of parameters. This representation can be useful for small networks. On the other hand, we can have representation that just has general information about the network such as the number of layers, the node in each layer, Etc., this representation is called indirect representation and good for large neural networks. The typical pseudo code for the evolution of architecture is shown below: 1. Decode each architecture into a genotype. If the encoding is indirect, further rules and constraints should be implemented. 2. Trained each architecture with the training data 3. Compute the fitness (error function) of each topology 4. Select parents to produce offspring according to their fitness 5. Apply search operator such as mutation and crossover to produce offspring.

A. The direct encoding
The direct encoding is divided into two methods. One method separates the architecture design from the weight training [48], [49]. The second approach evolves the architecture and weights simultaneously [50], [51]. In the first approach, each connection is encoded into a bit. 1 means that there is a connection and zero means there is no connection. Since each connection is connected to two nodes, with a binary rectangular with N dimension, we can represent a network with N nodes. For instance, if the value of 3,5 is one, it means that there is a connection between node 3 and 5 in the network and if it is zero, it means that there is no connection between these two nodes. However, If the amount of , is a real value, it represents the second structure and the weight and architecture can be evolved together. To make a binary (or real-valued) chromosome that represent the structure of each architecture, we can put the rows or column of the matrix in serial to make a chromosome. After creation of the chromosome we can apply mutation and crossover operator on it. We can easily apply design constraints on the matrix. For example, a feedforward neural network just has values on the top right triangle of the matrix since there is no connection a layer and its previous layer. However, for the recurrent representation we can have values in any part of matrix because a layer can be connected to its previous layer.
Since each connection can be added or removed from the architecture, the result can be very accurate. In the [21] Schaffer et al. showed that an ANN designed by EA had better generalization ability than an ANN that was designed by a human with BP method. However, if a network is too big, it is really time-consuming to find the best architecture with this method. Instead, we can use implement our knowledge into genotype to limit the possible architectures. This way of encoding is called indirect encoding. The problem of permutation still exists in this architecture. [52] Suggested to not to use mutation to avoid permutation problem. However, Hancock in [23] argues that the permutation problem might not be so critical if we increase the population size.

B. The indirect encoding
In order to limit the possible topologies of architectures, we can implement predefined rules to restrict it. [53] Schaffarnly encoded some characteristic of architecture into the chromosome. The details can be either defined by our knowledge or it can be implemented through developmental rules. Therefore, we can have two type of indirect encodings.

Parametric representation
in this representation, genotype of architecture only contains information about the number of layers, number nodes in each layer and the number of connection between layers [53] This representation is useful when the ANN is considered to be so big or when we have enough information about the structure of ANN.

Developmental rule representation
In this structure, the rules are implemented into 2*2 matrixes. There are two types of matrixes: first, nonterminal in which the 4 elements of the 2*2 matrix are elements. Second, terminal matrixes in which the 4 elements are binary. The terminal matrix is only applied in the last step. To build a matrix that represents the structure of ANN, a number of the non-terminal matrices are applied first and in the last step, a terminal matrix is applied. For example, if 3 nonterminal and one terminal rule is used, the number of nodes would be 2*2*2*2=16. Which shows a neural network with 16 nodes. By combining different nonterminal matrixes, different architecture of ANN is defined. Developmental rules are used in [54] to construct chromosome representation. The destructive effect of crossover is decreased in this representation since it can preserve the structure that had been built already. However, due to the limited connection of this structure, it can't search the space fully. [55] and [56] proposed using fractals representation. They used real value representation. [56] Used simulated annealing for the evolution section.

II. Swarm intelligence based Algorithm (SI)
Another major group of metaheuristic algorithm is swarm intelligence based. One of the most important algorithm in this group is particle swarm optimization. From the time of the presentation of this algorithm, it has found a lot of usage. It also had been used for training of neural network widely. However, the abilities of this algorithm to find the structure is not yet fully investigated. The particle swarm optimization is a part of swarm intelligence technique. It was first invented by James kennedy an Russell C. eberhart [57] in 1995. This algorithm is inspired by the social and personal behavior of the fish or bird flock that when they are looking for the food (optima).
Particle Swarm Optimization (PSO) is a relatively recent heuristic search method whose mechanics are inspired by the swarming or collaborative behavior of biological populations. PSO is similar to the Genetic Algorithm (GA) in the sense that these two evolutionary heuristics are population-based search methods. In other words, PSO and the GA move from a set of points (population) to another set of points in a single iteration with likely improvement using a combination of deterministic and probabilistic rules. The GA and its many versions have been popular in academia and the industry mainly because of its intuitiveness, ease of implementation, and the ability to effectively solve highly nonlinear, mixed integer optimization problems that are typical of complex engineering systems. The drawback of the GA is its expensive computational cost. This paper attempts to examine the claim that PSO has the same effectiveness (finding the true global optimal solution) as the GA but with significantly better computational efficiency (fewer function evaluations) by implementing statistical analysis and formal hypothesis testing. The performance comparison of the GA and PSO is implemented using a set of benchmark test problems as well as two space systems design optimization problems, namely, telescope array configuration and spacecraft reliability-based design [58].
[59] Proposed cooperative PSO. In this algorithm vector of input is divided into a smaller vector and each part forms a separate swarm and looks for the optimum of its search space. The idea behind this configuration was to simulate crossover operator in genetic algorithm in which different individual share information with each other.
[60] Proposed using a PSO-based algorithm and apply it on several benchmark problems. The results were good but not as good as GA since the PSO method that was used was the simplest form of the PSO.
In [61] The opposition-based PSO is used. The opposition technique is a method of search in which the algorithm alongside the original particle, evaluate its opposite particle too. Between these two, the one that has higher fitness is elected for rest of iteration. The opposite here means the other side of the search criteria. For example, in a one-dimension with a span from 0 to 10, if 2 is selected, the opposite of 2 would be 8. The idea behind the opposition-based algorithm is to diversify the particle more.
[62] Used a different method of opposition. Instead of using opposition for all the particle in each iteration, it used it only for one particle at a time. [63] Used a hybrid PSO and GA to train ANN. First, a random set is selected and trained with PSO until the time it gets near the optima. Next, the pre-trained population by PSO is trained by GA to find the global optima. [64] uses a continuous version of Ant Colony optimization (ACO) algorithm. [65] proposes a hybrid PSO and artificial fish swarm algorithm (AFSA-PSO) and [66] used artificial bee colony (ABC) to train the neural network.
[67]Is one of the first works that trained an FFNN with PSO. Eberhard et Al. trained the network with two different methods. The first method (GBEST) which is the classic PSO was used and the other method is (LBEST) which divide the swarms into subgroups neighbors. Each group looks for the global optima individually. Therefore, change of trapping in local minima is lower for LBEST approach especially when the number of neighbors is minimum, i.e., L=2. However, LBEST is usually slower than GBEST.
[68]PSO algorithm has a good speed to get close to the optima. However, it is very slow near the optima because of its velocity. It proposed using a hybrid PSO-BP method that starts training with PSO and when it gets near the optima, it changes into BP method. The author argues that it is better than adaptive PSO (APSO) and BP algorithm in case of speed and convergence accuracy.
[69]propose using optimized particle swarm optimization (OPSO). There are some parameters In the PSO algorithm which are usually constant or changing linearly through time. The idea of OPSO is to use another PSO to optimize the free parameter of the PSO that trains weights of FNN, i.e., swarm on swarm on FNN.
[70]uses a hybrid PSO and simulated annealing (SI). However, this method is not like another hybrid that starts with PSO and when finishes the training with a local search algorithm. In this approach, SPO and SI are used in each step. First, a new spot of the swarm is calculated according to the PSO algorithm. Then, SI search operator is applied to the swarm. If the new spot is better, it will replace the old one. However, if the new spot is worse than original one, it may be replaced with lower probability. This method allows the algorithm to jump off the local minima because using SI allows the algorithm to risk the currently good position and leave it to find another good position. However, a classic PSO is more probable to be stuck in local minima in this situation. One of the well-known problems of PSO is its slow convergence near optima.
To solve this, [71]proposed time-varying inertia weight. Inertia weight is big at the start with the start of iteration and decreases up to the end of the iteration. The idea behind this is to decrease momentum of swarm near optima which causes the particle to swing near optima. [72] Proposed time-varying acceleration coefficients. It decreases the cognitive coefficient, i.e., the personal behavior parameter and increases the social component through time. Big cognitive parameter at the start let the swarm search the space more freely by not being made to converge to GBEST. However, having smaller cognitive and bigger social parameter, lets the swarm converge to the GBEST faster.
To improve the search abilities of PSO several algorithms inspired from EA and incorporated mutation and into the PSO algorithm. [73] Incorporated GA mutation. It selects two swarms (parent). While r is uniformly random number, the offspring would be the sum of "r" multiplied by first parent and "1-r" multiplied to the second parent. [74] Proposed adding gaussian mutation to the PSO. It adds a random number to the every dimension of every swarm. [75]the incorporated algorithm proposed in [17]- [20] into PSO all together to evolve and train FNN.
In ESPNet [76], Yu et.al. Proposed a structure to evolve structure and weights simultaneously. The architecture of FNN is optimized using discrete PSO (DPSO). After finding the best structure, the weights are trained using PSO. When algorithm became near optimal and no future improvement is seen, the evolutionary strategy is applied on the trained network to find the global optima. In [77] where the authors used an Artificial Bee Colony (ABC) algorithm to evolve the design of an ANN with two different fitness functions.

VI. conclusion
In this paper the role of metaheuristic for evolving and training of neural network is reviewed. These algorithm showed that they are quite useful for training of neural network. They are faster and they find the global optima better than the famous back propagation. In the case of topology design, they can help human find the best architecture much faster. The majority of weight training with genetic algorithm had been done in 1990 and just some novel works has been done in the last decade.
The use of SI algorithm is still being investigated in the literature. Although there are many works that investigate the role of PSO for weight training just a few number of work address evolving architecture of neural network with particle swarm optimization.