An Overview of Sequential Learning Algorithms for Single Hidden Layer Networks: Current Issues & Future Trends

—In this paper, a brief survey of the commonly used sequential-learning algorithms used with single hidden layer feed-forward neural networks is presented. A glimpse at the different kinds that are available in the literature up until now, how they have developed throughout the years, and their relative execution is summarized. Most important things to take note of during the designing phase of neural networks are its complexity, computational efﬁciency, maximum training time, and ability to generalize the under-study problem. The comparison of different sequential learning algorithms in regard to these merits for single hidden layer neural networks is drawn.


I. INTRODUCTION
T HE internal layout of Multi-layer feedforward neural networks (MLFNs) allows to generate an approximate representation which describes how well an input data is correlated and it has made these networks the most widely used neural networks (NNs) when it comes to pattern classifications [1], [2], [3]. Multiple hidden layer NNs are considered as general approximates but because of their complex nonlinear behavior, ****This work has been submitted to the IEEE Transaction for Artificial Intelligence for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible** they are not preferred in industrial applications where most applications require much faster and more generalized models [4]. Due to its simple network structure and excellent approximation performance, single hidden layer feed-forward neural networks (SHLFNs) have been widely used in pattern classification and function approximation applications [5]. Under normal circumstances with high dimensional large training data, the existing SHLFN algorithms has a slow learning speed, which limits the practical application of these learning algorithm [6]. Another disadvantage of the SHLFNs learning algorithm is its poor generalization ability [7]. At present the learning algorithms for SHLFNs can be widely classified as: • Batch learning algorithms • Online sequential learning algorithms.
For the NN batch-mode learning algorithm, the offline learning method is generally adopted [8]. That is, the network training process is divided into several independent stages in time. One typical feature of the batch learning algorithm based neural networks is that all sample information must be known beforehand. This feature makes the batch learning algorithm unable to process objects with time-varying characteristics i.e. real time data. In addition, before the network training, if the number of samples are not enough, it will seriously affect the performance of the network. Similarly, when adding new samples, all the samples must be used to retrain the whole neural network [4]. To compensate the shortcomings of SHLFNs batch learning algorithm, researchers are committed to explore new online sequential (OS) learning algorithm that provide compatible performance index with the existing batch learning based SHLFNs [9], [10]. With OS learning algorithm, all sample data enters the neural network one by one, and at any time, only one training sample is visible and used for learning. The training samples entering the network are discarded after the training is over. The most promising feature of this type of learning algorithm is that before the start of the learning process, there is no need of prior knowledge of the training dataset size. Therefore, these algorithms are more suitable for real time problems in industrial environment applications.
In the history of SHLFNs OS learning algorithms, the earliest algorithm was the Stochastic Gradient Decent Back Propagation (SGBP) algorithm which was formulated in [11]. It was proposed as an improvement to standard back propagation algorithm (BP) with input data samples are processed in chunks. Resource Allocating Network (RAN) learning algorithm proposed by Platt in [12]. RAN fundamentally suffers from a problem of unlearning over time and this problem was later addressed in [13] where the author proposed RAN-LTM. The shortcomings of RAN learning algorithm were further studied by Kadirkamanathan et al. and he proposed the extended Kalman filter (EKF) iterative strategy and replaced the previously used LMS method for network parameters training which resulted in RANEKF learning algorithm [14]. Although this method improves the convergence speed of the algorithm, the network complexity is also increased. To minimize the computational cost of the RANEKF learning algorithm, the MRAN algorithm added the deletion mechanism for the neurons in the hidden layer [15]. An improved MRAN learning algorithm (Hyper-MRAN) was proposed in 2006 which uses a new weight adjustment strategy that reduces the disadvantages of high memory requirements of the MRAN algorithm with higher accuracy [16].
In 2004, a new SHLFNs OS learning algorithm called GAP-RBF algorithm was proposed, which further improved the performance of MRAN [17]. It was later generalized in [18] and was called as GGAP-RBF. In an attempt to improve the computational complexity of the GGAP-RBF, in [19] GAP-DRBF was proposed that replaced the extended Kalman filter with decomposed extended Kalman filter and the underlying activation function was changed to direct-link RBF (DRBF). This learning algorithm has many adjustable parameters that needs manual tuning. When these parameters are not properly selected, the performance of the network will be seriously degraded. In view of this shortcoming of the GGAP-RBF learning algorithm an improved GGAP-RBF neural network learning algorithm (GIRAN) was formulated [20]. Although the improved GGAP-RBF can solve many practical problems related to manual tuning of parameters, the activation function of the algorithm is in the form of RBF which still was computationally not very effective and limits its use in realtime applications. The most famous RAN variant (SRAN) introduced in [21] uses self-adaptive mechanism to update the weight and biases of hidden layer neurons thereby drastically increases the learning speed of the algorithm.
The concept of batch learning based extreme learning machine (ELM)was introduced in [22] which Liang extended to an OS learning based ELM that significantly improved the computational efficiency of the algorithm and can select multiple activation functions [23]. To further improve the performance of Liang algorithm, recursive least squares (RLS) method was replaced with orthogonal least squares iterative strategy to further improve the generalization ability of the network [24]. Additional improvement in setting the initial random weights for the hidden neurons in OS-ELM was made in [25] by using set of regularized linear equations. With OS-ELM, the condition of having a greater number of hidden neurons than the dataset size and also the singularity problem encounter during the noisy data was studied in [26] and a new algorithm called Regularized online sequential extreme learning machine (ReOS-ELM) was proposed. The ReOS-ELM makes use of the Tikhonov regularization [27] with bioptimization function to counter the problem of singularity and ill-posed matrix inversion with standard OS-ELM. Ensemble technique [28] was studied with OS-ELM in [29], [30] to increase the generalization speed. To classify timeliness based data, a forgetting factor mechanism was introduced with OS-ELM in [31]. Class imbalance learning (CIL) is one of the most researched areas in NNs and in [32], the first ELM based framework was proposed for CIL problems with explicit feature mapping. In case of implicit feature mapping in the hidden layer neurons, a kernel based solution KOS-ELM was introduced [33]. Various other improved OS-ELM for multiclass data streams were studied in [34], [35], [36].
The SHLFNs OS learning algorithm can be divided into two categories: • SGBP, RAN, RAN-LTM, RANEKF, MRAN, EMRAN, HMRAN, GGAP-RBF, GAP-DRBF, GIRAN and SRAN The network parameters for these OS learning algorithms are done iteratively as new sample information is received. Also, before the algorithm training, there are many initialization parameters that need to be manually set for reliable performance and CW-OS-ELM These OS learning algorithms eliminate the use of initialization parameters to improve the learning speed of the NNs and reduce the artificially adjusted initialization network parameters. Activation function for hidden layer neurons are also no longer limited to the RBF form which makes these algorithms computationally efficient.
To compare the performance of various OS learning algorithms for SHLFNs, most of the researchers used training time, number of hidden layer neurons, approximation accuracy, generalization ability and algorithm stability as performance indices.
II. ONLINE SEQUENTIAL LEARNING ALGORITHM BASED SHLFNS The structure of the SHLFNs consists of three layers, as shown in Fig. 1, including the input layer, a single hidden layer, and the output layer. Passing of signals to the hidden layer is the major function of the input layer. The hidden layer neuron can be a radial basis function or other satisfying infinitely differentiable or arbitrary nonlinear bounded piecewise continuous qualitative function form, denoted here by ( ). The output layer node has generally a linear activation function, and the hidden layer and the output connection weight are represented by 'w'. In the RAN, RAN-LTM, RANEKF, MRAN, GGAP-RBF, GAP-DRBF and GIRAN, the hidden layer and the output connection weight is a constant value of '1', while in SGBP and sequential learning-based ELM algorithms, this parameter is randomly set.
A. Stochastic Gradient Decent Back Propagation (SGBP) Algorithm SGBP was proposed by LeChun et al., is an improvement to standard BP learning algorithm. In SGBP, the training dataset sample is introduced one by one in the hidden layer neurons, and then based on the estimate of the error gradient, parameters are updated by the standard BP algorithm [11]. The algorithm was named as stochastic gradient back propagation because of the use of the estimate of error gradient. The purpose of using the estimate was to reduce the speed of convergence which is advantageous in case of noisy data. Because the input data is processed one by one, the SGBP results in faster learning with repeating data sequences and can track changing more effectively because of slow convergence with noisy data which is usually the case in most of practical applications.

B. Resource Allocation Network (RAN) Algorithm
The RAN was proposed by Platt in [12] is a dynamic single hidden layer radial basis function (RBF) neural network. The corresponding training algorithm is called RAN learning algorithm. In RAN, the "novelty" of training dataset smple is exploited for introducing the hidden layer neurons, and then the parameters are updated by the LMS algorithm [37]. The RAN learning algorithm is started with a no hidden layer neuron, the hidden layer neuron activation function type is RBF, and the first two input sample data are used for network initialization. Subsequently, if for an input sample there is unnecessarily large error between the output and the desired one, the sample is considered novel, and a neuron is added to the hidden layer of the network. If the input sample does not meet the novelty requirements, the hidden layer neurons are not added, but the LMS algorithm is initiated to update the current network parameters e.g. centers for neurons, the hidden and output layer connection weights etc. The novelty criteria of the RAN online-sequential learning algorithm is: where, is the error in the output of the network, is the required approximation accuracy (manual adjustment parameter), is the 2 norm between the current input data sample x, and the center of the hidden layer neuron closest to it, = ( , ) where, and are the max. and min. distance between all the input data samples, and 0 < < 1 is an attenuation coefficient. As the input data sample increases, decreases at an exponential rate until . When the input data sample does not meet the novelty requirements, the center point and the weights are updated using the LMS algorithm.

C. Resource Allocating Network with Long-Term Memory (RAN-LTM) Algorithm
The major drawback with RAN is the inherited unexpected forgetting which causes the network to unlearn over time [38]. This problem was addressed in 2003 by K. Okamato who proposed the idea of external memory with RAN [13]. The new algorithm was called resource allocating network with long term memory. The learning phase of this algorithm was divided into 2 parts.
• Firstly, a hidden neuron is allocated and its weight connection with the output are determined. • Then in the later stage output error is calculated and if the error shoots over a predetermined value, another neuron is added like in the case of RAN but together with it a new memory unit is also allocated in which the previous input output pair is stored The architecture of the RAN-LTM is given in Fig. 2 [38].

D. Resource Allocation Network using Extended Kalman Filter (RANEKF) Algorithm
The RANEKF algorithm is an improvement of RAN. The addition of neurons in the hidden layer is the same as for the RAN learning method. The only difference is that the adjustment of the network parameters (centers and weights) are done by the extended Kalman filter (EKF) method as a substitute for the more traditional LMS method [14]. The extended Kalman filter method has a faster convergence speed than LMS method, but it requires more computer resources. With the development of computer hardware technology, the EKF method is more advantageous than the LMS iterative method when the problem size is small.

E. Minimal Resource Allocation Network (MRAN) Algorithm
This algorithm combines the hidden layer growth criterion of RAN learning algorithm with an ideal deletion strategy for the hidden layer neurons. In 1994, Cheng proposed the method of deleting hidden layer nodes in batch learning algorithm [5]. In this method, each time the sample data enters the network processing, the weights of each hidden layer node must be checked and deleted in case if the weight of hidden layer node is not equal or greater than a certain wanted value. Inspired by the Cheng method, the MRAN algorithm with hidden layer node deletion strategy was proposed [15]. The proposed MRAN algorithm was compared both with RAN learning and the RANEKF algorithm. Since the learning process of MRAN algorithm entails the introduction of new hidden layer neurons along with iterative tuning of network parameters with the deletion strategy of hidden layer neurons, so it can be more streamlined under the premise that SHLFNs has better performance. Its common applications are listed in [39]. An extended MRAN (EMRAN) was proposed to reduce the computational cost associated with MRAN in case of larger input dimensional size [40]. A winner neuron approach was used to select a neuron and then update the parameters only related to this neuron using EKF method rather than updating the parameters of all the neurons in the hidden layer.

F. Hyper Minimal Resource Allocation Network (HMRAN) Algorithm
The computational complexity of MRAN increases drastically as the input data sample dimension increases as it requires to learn unwanted information from the input data sample. The underlying activation function in MRAN is RBF and in case of high dimensional input data sample, the computational cost increases because the increase in neurons number proportionally increases the size of the covariance matrix for extended Kalman filter. Therefore, the time complexity of the algorithm limits its use in real-time industrial applications. To overcome this problem with MRAN, extended minimal resource network (EMRAN) was proposed in [40] but this algorithm lags reasonable accuracy. In [16], Hyper MRAN (HMRAN) was formulated, with suitable input data sample dimension selection with hyper radial basis function (Hyper RBF) as an activation function for SHLFNs, to reduce the time complexity problem of MRAN by using localized extended Kalman filter approach also the hyper RBF activation function ensure high accuracy.

G. Generalized Growing and Pruning Radial Basis Function (GGAP-RBF) Algorithm.
Nanyang University of Technology in Singapore, in 2005, proposed a new OS learning algorithm for the existing shortcomings of various SHLFNs OS learning algorithms, called the GGAP-RBF algorithm [18], which makes the use of the traditional RBF. The performance of this SHLFNs onlinesequential learning algorithm was improved over the MRAN algorithm. The parameters in MRAN are adjusted at each iteration using the extended Kalman filter method, which leads to the increment in the hidden layer neurons during the parameter update process [16]. Also, the size of the covariance matrix used in EKF is usually very large, which increases the computational complexity of the network structure. This results in an excessive computational burden of the algorithm and a large amount of computer resources, which limits the real-time application of the MRAN algorithm. Although the GGAP-RBF learning algorithm uses the extended Kalman filter method in the parameter update process, but only updates the parameters (center and width) of the hidden layer neurons closest to the current input data and the weight of the corresponding neurons connecting the hidden layer neurons, which greatly reduces the computational cost of this algorithm. In addition, for the shortcomings of the initial parameter selection of the various heuristic algorithms mentioned above, for the uniformly distributed input data, the GGAP-RBF learning algorithm also estimates the importance of the hidden layer neurons, thereby reducing the initial parameters. Even though, the GGAP-RBF OS learning algorithm reduces the number of initialization parameters of MRAN, the disadvantage of having uniformly distributed input data drastically reduced the performance of the algorithm.

H. Growing and Pruning Direct Link Radial Basis Function (GAP-DRBF Algorithm.
In [19], the authors improved the performance of GGAP-RBF by replacing the RBF with Direct-link RBF which is basically the augmented version of conventional RBF with a linear mapping of input and output to improve the accuracy of training and also the EKF was replaced with decomposed EKF proposed in [41] that reduces the number of initialization parameters thereby ensuring that the computational complexity of the GGAP-RBF is further reduced as the training data size increases. On the flip side, this algorithm has a significant disadvantage: it requires data to be uniformly distributed, which usually is not the case in practical applications hence requiring more hidden neurons, reducing the performance of the algorithm. The GGAP-RBF utilized the idea of dynamically estimating the importance of the hidden neurons but the formula given in [18] was further corrected in [20] where the author improved the GGAP-RBF algorithm in a bid to reduce the initialization parameters of the algorithm and introduced the new definition and estimation formula for measuring the importance of the hidden layer neurons. The corresponding algorithm is called GIRAN (Improved GGAP-RBF Learning Algorithm) learning algorithm. The dynamic adjustment formula for 'K' was given by where, K equals the number of hidden neurons. Based on the GGAP-RBF algorithm, combined with the update formula of parameter, and the adaptive adjustment method of radial basis function width, the improved algorithm is as follows.
Given the estimated error, the single point output is expected to be accurate. For the input sample.
• Calculate the network output • Calculate the novelty criterion • Apply the novelty criterion to determine whether to increase the hidden layer neurons • Else: Use the EKF method to update the hidden layer section of the network closest to the current input. • Check if the deletion criteria of the hidden layer neurons are met • If yes, delete the hidden layer neurons, correspondingly reduce the dimension of EKF. Else: GOTO Step 1.
Usually, in RBF based sequential learning algorithms, the control parameters are manually set for reliable performance which results in poor generalization efficiency with redundant data. In [21], a solution to this problem was proposed by using self-regulated initial control parameters. These control parameters change automatically based on the difference between the new information in the coming sample and the already learnt network's information. The higher the difference, the greater is the probability of the incoming sample to contribute towards sequential learning process. If the difference is negligible, then the sample is considered as redundant and is simply discarded. It is to be noted that SRAN does not include pruning strategy and relies on EKF for training which makes it computationally more expensive.
III. EXTREME LEARNING MACHINE BASED ONLINE SEQUENTIAL LEARNING ALGORITHM To counter the learning speed problems of the commonly used SHLFNs sequential-learning algorithms, Huang proposed an extreme learning machine (ELM) batch learning algorithm, which sets the weight connections between input and hidden layer and the bias of the hidden layer neurons randomly, and then analytically determines the weight connections of hidden layer neuron with that of the output layer of the SHLFNs by calculating a pseudo-inverse operation on the matrix generated by the output layer to obtain the weights [25]. In theory, the input weights and neuron deviations of SHLFNs do not have to be adjusted during training, and they can be arbitrarily assigned [22], which greatly improves the learning speed of the SHLFNs.
To solve the problem of large hidden layer neuron in ELM with batch learning, the Regularized Least-Squares Extreme Learning Machines (RLS-ELM) was proposed in [25]. In RLS-ELM, instead of random initialization of the hidden layer weights and biases, the decision is taken by using a system of regularized linear equations. The output weights are obtained in the same manner as the original ELM. This single act of changing the way in which the weights of the input layers are obtained makes the system obtain a very good performance in relation to the original ELM and ensures that the overall network is relatively smaller with a higher training speed.

A. Online Sequential Extreme Learning Machines (OS-ELM) Algorithm
Using the core idea of ELM algorithm, in [23], an onlinesequential extreme learning machine based on recursive least squares algorithm (RLS), called OS-ELM (RLS) learning algorithm was proposed. This algorithm was further improved in [24] where the author used orthogonal least squares method and replaced the computationally bulky RLS method to update the network parameters. The improved OS-ELM was termed as OS-ELM (OLS). The major drawback related to OS-ELM was ill-posed and singular matrix obtained in case of noisy data. Also, the dependence of hidden layer neurons on the size of dataset greatly effects the efficiency of this learning algorithm

B. Ensemble based Online Sequential Extreme Learning Machine (EOS-ELM) Algorithm
To further enhance the generalization ability and speed of OS-ELM, in 2009, Lu proposed the ensemble-based OS-ELM termed as EOS-ELM [29]. It uses various OS-ELMs with different hidden neurons and the same data sample is passed to the hidden layer. Because of the different parametric settings of each OS-ELM, the output of OS-ELMs are different and the final output is calculated by the average of the outputs of each OS-ELM. This ensemble of different OS-ELMs effectively produces better results because of each ELM possessing distinct capacity of adaptation to the new streams of data and the fact that mean of a population is always close to the expected value rather than any individual value in the population.

C. Regularized Online Sequential Extreme Learning Machines (ReOS-ELM) Algorithm
The OS-ELM algorithm produced exemplary results for non-noisy training data. It can work with data arriving one by one or in small data streams. The issue of singular matrix and ill-posed optimization problem in case of noisy data was studied by H.T. Huynh associated with OS-ELM [26]. The author replaced the standard optimization problem associated with OS-ELM with bi-optimization problem composed of actual 2 norm of current weight vector and training error that improved the generalization ability of the network for real-time noisy data as studied in [42]. Also, the singularity problem was solved by using Tikhonov regularization [27] instead of least squares method used in [25]. The ReOS-ELM learning algorithm not only produces faster learning but also significantly helps in overcoming the usual problem of setting the initial domain of impact factor and bias in case of noisy data.

D. Online Sequential Extreme Learning Machine with Forgetting Mechanism (FOS-ELM) Algorithm
The generalization ability of EOS-ELM hinders with the timeliness property associated with certain types of data e.g. weather and stock forecasting data. Because the input data has a limited period of validity, training the EOS-ELM with this type of data results in wrong prediction and instability. To overcome this issue and to reflect the timeliness property of the incoming data, J. Zhao et al. proposed the forgetting mechanism with EOS-ELM to oust the out-of-date datum for the sequential learning in [31]. Zhao concluded that rather than re-training the network, as new data enters the hidden layer, training is only done by using newly available datum and with the already known network parameters only if the new data arriving is valid. A similar work was reported in [43] where, a timeliness online sequential extreme learning machine (TOS-ELM) was proposed based on incoming data's distribution and central tendency characteristics. In [44], a variable forgetting factor based on directional forgetting factor [45] with EOS-ELM, called DFF-OS-ELM, was used to better capture the timeliness property of the incoming data.

E. Voting base Online Sequential Extreme Learning Machine (VOS-ELM) Algorithm
VOS-ELM was proposed in [46] that used several independent OS-ELM in parallel with same number of hidden nodes and similar input data chunk but with different randomly generated initialization parameter for each extreme learning machine. A weight vector for the several OS-ELMs were calculated in parallel with RLS algorithm. The final measurement was calculated based on voting method between the OS-ELMs. The classification accuracy of VOS-ELM was found to be higher than the original OS-ELM also the speed of learning was greatly improved in contrast to V-ELM [30].

F. Weighting Online Sequential Extreme Learning Machine (WOS-ELM) Algorithm
The most widely encountered problem during the last decade in machine learning is class imbalance learning (CIL) problem. Majority of researchers have proposed various solutions based on batch learning to tackle CIL problems especially in the field of medicine and online fraud detection. It was not until 2013 that a first online sequential extreme learning machine-based algorithm was proposed to counter the bi-class CIL problem [32]. Mirza B. & Lin Z. proposed their weighted online sequential extreme learning machine algorithm that efficiently assign higher weights to minority class and lesser weight to majority class in order to alleviate the CIL problem. This weighting tuning was based on optimization which is a popular statistical evaluation criteria for the CIL problems [47]. Because of no need of external storage of past learned data and simple weight tuning method, the convergence rate of WOS-ELM is very fast, but it requires the setting of optimal number of neurons in the hidden layer. Also, the proposed algorithm works only in case of bi-class problems with known feature mappings.

G. Online Sequential Extreme Learning Machine with Kernels (KOS-ELM)
To overcome the limiting condition of known feature mappings for WOS-ELM, kernel based OS-ELM was presented in [33] called KOS-ELM. This was the first attempt to combine OS-ELM with nonlinear adaptive filtering technique. Although the resultant algorithm has better classification capability for bi-class CIL problem, but it works only when the data is entered in the hidden layer one by one as for each time a new center is calculated for each input data sample which hinders its utilization for large datasets. Two variants of KOS-ELM; approximate linear dependency kernel-based OS-ELM (ALD-KOS-ELM) and fixed budget kernel based OS-ELM (FB-KOS-ELM) was also proposed in [33] in order to apply KOS-ELM algorithm to large datasets that are sparse in nature.

H. An Incremental Extreme Learning Machine for Online Sequential Learning (I-ELM) Algorithm
Based on the idea of ELM algorithm, in [48], three variants of incremental extreme learning machine (IELM) was proposed to solve online-sequential learning problems namely:

I. Robust Online Sequential Extreme Learning Machine (ROS-ELM) Algorithm
Even though the generalization ability of OS-ELM improves by using ensemble learning technique as studied in [29], Zhou et al. in 2002 deduced that the selective ensemble learning technique is even better than the standard ensemble learning technique for NNs [28]. Based on this idea of selective ensemble learning a ROS-ELM algorithm was presented in [49]. An adaptive selective ensemble method based on particle swarm optimization was used with OS-ELM. The output error in root mean square sense is compared to a threshold value to decide whether to do selective ensemble using PSO or to just proceed with standard EOS-ELM algorithm. This adaptive selective mechanism results in the improvement of the inherit instability of the EOS-ELM.

J. Ensemble of Subset Online Sequential Extreme Learning Machine (ESOS-ELM) Algorithm
To solve the CIL problem of data having timeliness property with ensemble-based OS-ELM, Mirza B. & Lin et al. proposed ensemble of subsets of OS-ELM algorithm termed as ESOS-ELM in the literature [50]. ESOS-ELM consists of ordinary EOS-ELM with external memory to store previously learned information. A control is implemented to detect the validity of the incoming data which enters the main ensemble of OS-ELMs in balanced subsets. The main architecture of ESOS is shown in Fig. 3. Though this framework helps getting better classification efficiency with imbalance data having timeliness property, the major disadvantage was its restriction to just biclass classification problems.

K. Volting based Weighted Online Sequential Extreme Learning Machine (VWOS-ELM) Algorithm
The bi-class classification shortcoming of original WOS-ELM for CIL problems and even in ESOS-ELM with timeliness data was further addressed in [34] by the same authors and they proposed an improved framework called VWOS-ELM. Mirza B. et al. extended the idea presented in [50] and replaced OS-ELM with WOS-ELM proposed initially for stationary bi-class CIL problems. The architecture thus proposed is shown in Fig. 4 The control for the detection of implicit time information in the incoming datum was replaced with the majority voting system that selects the trained WOS-ELM. Each WOS-ELM in VWOS-ELM has same number of hidden neurons with different initial parametric settings. The enhanced sequential learning algorithm thus produced has better classification property in comparison to WOS-ELM for bi-class CIL problems and can also handle multi-class CIL problems. This technique however cannot work with timeliness data as there is no detection mechanism for time validity of the input data stream. Also, another drawback of this framework was to specify the optimal number of neurons in the hidden layer.

L. Meta-congnitive Online Sequential Extreme Learning Machine (MOS-ELM) Algorithm
Newly proposed MOS-ELM [35] for CIL problems resolves the major issues in WOS-ELM & VWOS-ELM of classifying the timeliness data and the problem of only working with the bi-class dataset. MOS-ELM can solve both timeliness and class imbalance classification problems. Effective cost weighting with data sampling was used to extend bi-class CIL to multi-class CIL and a windowing method to detect the validity of change required for timeliness data classification problems. There was no need to specify the optimal number of neurons in the hidden layer during the initialization process that resulted in faster convergence with better generalization efficiency [32].

M. M-estimator based Online Sequential Extreme Learning Machine (M-OS-ELM) Algorithm
Recently a variation of OS-ELM was proposed to cater the classification of noisy data especially in case of chaotic time series [51]. By replacing the least square with M-estimator in the cost function of the optimization function of OS-ELM, an iterative solution, that solves the M-estimator based model, is devised. To further find the optimal threshold value for Mestimator, a sequential parametric estimation was proposed. The proposed M-OS-ELM learning algorithm was found to be very robust in contrast to OS-ELM and ReOS-ELM for noisy data.

N. Online Sequential Regularized Extreme Learning Machine (OS-RELM) Algorithm
Owing to the ill-posed and stability issues with OS-ELM in case of smaller number of neurons in the hidden layer than the datum size, in [52], an improved version of OS-ELM similar to the ReOS-ELM [26] was presented. OS-RELM makes use of the regularization to deal with the inherit ill-posed issue. Moreover, it uses Leave-One-Out cross-validation method [53] to learn from the new incoming data. A novel update scheme was formulated to eliminate the required initialization phase for stability and optimal performance during the sequential learning. OS-RELM outperforms the basic OS-ELM in terms of computational cost with superior generalization ability.

O. Online Sequential Reduced Kernel Extreme Learning Machine (OS-RKELM) Algorithm
Although OS-RELM somewhat addressed the issue of computationally ill-posed and stability with OS-ELM, it still would get trapped in singularity problem when the incoming datum size is larger as compared to neurons in the hidden layer. Deng et al. proposed a solution to this problem by incorporating different kernels for the hidden neurons. The presented framework of OS-RKELM takes a subset of training samples in the initialization phase to train the kernel based hidden neurons and then afterwards the algorithm can handle incoming data streams arriving either one by one or in chunks [54]. OS-RKELM overcome the drawback of OS-ELM of specifying the optimal random initial weights for better generalization ability in case of implicit feature mapping.

P. Weighted Online Sequential Extreme Learning Machine with Kernels (WOS-ELMK) Algorithm
ESOS-ELM [50], and WOS-ELM [32] works only for biclass CIL problems. Also, both these algorithms require to specify the number of neurons in the hidden layer for optimal performance. VWOS-ELM [34], presented for multi-class CIL problems, resolves the major stability problem inherited with single WOS-ELM when hidden layer neurons are less in number than the incoming data size via majority voting based ensemble of WOS-ELM. Lately, another online sequential learning algorithm (MOS-ELM) [35] proposed for multi-class CIL classification. All these variations of online sequential ELM frameworks for CIL problems require explicit feature mapping in the hidden layer. The idea of using kernel based hidden layer neurons for training the data with implicit feature mapping was exploited in [36] and the author presented a new framework called WOS-ELMK. The proposed algorithm works extremely well with multi-class CIL problems by utilizing implicit kernel mapping rather than explicit random feature mapping. To cater larger class-imbalance data streams, a window approach for transferring the data to hidden layer was used with an external memory control to improve the convergence.

Q. Adaptive Forgetting Factor with Generalized Regularization Online Sequential Extreme Learning Machine (AFGR-OS-ELM) Algorithm
DFF-OS-ELM, proposed in [44], uses a variable exponential-forgetting factor with regularization which causes the regularization effect to slowly fade away. Thus, eventually the DFF-OS-ELM will get trapped in the same problematic operation of ill-posed matrix inversion. In order to overcome the instability and limited use of DFF-OS-ELM, W. Guo et al. presented a more sophisticated framework called AFGR-OS-ELM in [55]. The author employed a novel adaptable forgetting mechanism with a more generalized regularization term [56], [57] in the cost function instead of using the exponential forgetting regularization term in [44]. Doing so, the improved AFGR-OS-ELM framework ensured a constant regularization effect without fading throughout the entire learning process which ultimately removed the ill-posed problem with most of OS-ELM algorithms.

R. Improved Online Sequential Extreme Learning Machine Algorithm
To avoid the ill-posed matrix inversion problem associated with OS-ELM, the size of hidden neurons is usually larger than the input datum size. Genetic algorithm (GA), [58] whici is a renowned global optimization tool, was used recently to optimize the randomly initialized weight and biases of simple OS-ELM [59]. The improved OS-ELM was able to give better generalization efficiency and opened the door to a new phase in the research of finding a new sequential learning algorithm which combines OS-ELM with various evolutionary algorithms.

S. Combination Weight based Ensemble of Online Sequential Extreme Learning Machine (CW-EOS-ELM) Algorithm
In January 2019, CW-EOS-ELM algorithm was proposed which selects the OS-ELM from the ensemble, based on individual ELMs correlation and running error [60]. Adaboost algorithm was used to set the initial weights and biases of every OS-ELM in the ensemble and then later are updated analytically during the update phase in accordance with the aggregate game theory. The proposed framework can only work with bi-class classification problems and does not require any beforehand settings of weights and biases of individual OS-ELMs establishing the fact that re-learn process is dynamic.

IV. SUMMARY & DISCUSSION
Usually the performance of various SHLFNs onlinesequential learning algorithms are compared in terms of the type of problem and input data stream type i.e. imbalance or timeliness-based data. There are very few algorithms that can be used to solve bi-class as well as multi-class classification problem. In terms of handling imbalance data streams, the proposed framework makes use of computationally efficient weighting method for different classes of data or intelligent sampling techniques to extract balanced subsets of data. Windowing method is widely used to capture the timeliness property of the incoming data during the retraining of the network. Even though major improvements in the sequential learning algorithms have been suggested over the years for the SHLFNs, all have some limitation of their own:  Table I summarizes all the above methods just to provide a theoretical guidance for the practical engineering applications when using any one of these learning algorithms.
The most common performance indices for any training algorithms are the approximation accuracy (training error) and the generalization ability (test error) which are calculated using the average value of the root mean square error produced by the algorithm. The stability is measured using the average result's standard deviation. The complexity of the network trained by the algorithm can also be obtained using the total number of the hidden layer neurons and learning speed is approximated by the amount of training time required. The activation functions of the RAN and its derivatives are fixed and is in the form of RBF except in GAP-DRBF where DRBF was used whereas in case of OS-ELM and its different variants of, two different activation functions of RBF and Sigmoid (SIG) forms are mostly used. In RAN, RAN-LTM, RANEKF, MRAN, EMRAN, HM-RAM, GGAP-RBF, GAP-DRBF and GIRAN algorithms, the parameter of the expected approximation accuracy is required, and the performance of the network is controlled by the adjustment of this parameter which is usually achieved through the trial and error method. Similarly, in ELM variants, the setting of the number of hidden layer neurons is used to control the performance of the network.
The sequential learning algorithms including RAN, RAN-LTM, RANEKF, MRAN, EMRAN, HMRAN, GGAP-RBF, GIRAN and SRAN shows good stability when the activation function is RBF. Also, the standard deviation in this case is zero which confirms robustness. In case of learning algorithm based on OS-ELM, the standard deviation usually is large which makes the algorithm less robust. The main reason for the poor stability of the algorithm is that the weight connections between the inputs and hidden layer neurons are set randomly before training which are later updated by the iterative algorithm. Since the initial weights are different for each iteration setting thereby, the performance of the network is slightly different at each iterative step which reduces the stability of the algorithm.
In addition, for most of the variants of OS-ELM learning algorithm, it is necessary to use the training samples to initialize the network parameters. When the size of hidden layer is large, more training samples are needed to initialize the parameters and since these input data sample have not been learned by the network, they will reduce the approximation ability of the network. Conversely if the total number of sample data is small, it will result in large training error of the algorithm or even instability. To get good approximation accuracy and generalization ability different activation function have been used for the hidden layer neurons [61] .
For any OS learning algorithm, the process of learning from the previous data sample must end before the new sample enters the network allowing the algorithm to be fast enough to deal with real-time problems. The training time associated with RAN learning algorithm and its variants is longer as compared with sequential learning ELM variants. Table II shows the performance of some of the studied sequential learning algorithms in terms of learning time, generalization ability (testing error), accuracy (training error) and no. of hidden layer neurons used for Mackey chaotic time series dataset. A better visualization for learning time for these algorithms in given in Fig. 5. The Mackey chaotic time series dataset used include 4000 data samples with 300 testing and validation samples.   Finally, Table V summarizes the all these sequential learning techniques in terms of their merits & demerits. In summary, when you need to choose a higher stability algorithm, you can choose anyone from RAN or its derivatives while if you need a higher processing speed algorithm, you should choose OS-ELM alternatives.

V. FUTURE WORK
The main focus of this review article is to summarize the sequential learning algorithms used with single hidden layer feed-forward neural networks. The authors tried to list the advantages as well as the various limitations associated with the existing algorithms to provide a prior guideline for anyone using one of the above mentioned algorithms. The following problems were identified as open research queries and are worth exploring for potential researchers.
1) Even after more than a decade of research in ELM, the problem of settings the optimal number of neurons in the hidden layer is still there and needs to be randomly selected beforehand. Thus, implying that the improved algorithms have not solved the fundamental problem associated with ELM stability. 2) At present, the practical applications of sequential learning variants are very limited and there is plenty of work needs to be done especially to use SHLNs sequential learning algorithms for video and text streams classification. 3) Although the sequential leanring algorithms for multilayer neural network have attracted many researchers, their applications to practical problems are very rare. Therefore to explore application of sequential learning with MLFNs is much needed. 4) Another contribution, for the improvement in the generalization ability of sequential learning algorithms, can be made by combining conventional learning algorithms e.g. support vector machine, decision trees, random forest, nearest neighbour with them. 5) Timeliness based data streams, over the period of time, completely reshape the correlation between the data stream features thus opening a new research objective of using correlation to predict the data stream type.

VI. CONCLUSION
Compared with the batch learning algorithm of SHLFNs, the OS learning algorithm can be applied to the processing of real-time problems, which is more suitable for industrial environment. Built on the brief summary of existing SHLFNs online-sequential learning algorithm, this paper compares the performance of various algorithms, based on training time, approximation accuracy, generalization ability, and algorithm stability The merits and demerits of various online learning algorithms and the scope of their applications are pointed out, which provides a theoretical guidance for he practical application of the SHLFNs online learning algorithm.

VII. ACKNOWLEDGMENT
The authors acknowledge the King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia for this research work. Continued on next page Works only with bi-class problems EOS-ELM [60]