Step size self-adaptation for SGD

—Convergence and generalization are two crucial aspects of performance in neural networks. When analyzed separately, these properties may lead to contradictory results. Optimizing a convergence rate yields fast training, but does not guarantee the best generalization error. To avoid the con- ﬂict, recent studies suggest adopting a moderately large step size for optimizers, but the added value on the performance remains unclear. We propose the LIGHT function with the four conﬁgurations which regulate explicitly an improvement in convergence and generalization on testing. This contribution allows to: 1) improve both convergence and generalization of neural networks with no need to guarantee their stability; 2) build more reliable and explainable network architectures with no need for overparameterization. We refer to it as step size self-adaptation.


I. INTRODUCTION
N EURAL networks imitate signal transmission within neurons in the brain with units which are interconnected through weighted links and assembled in layers [1], [2]. Training a neural network implies updating the model weights to best map inputs to outputs. This process is framed as an optimization problem that involves minimizing the model errors on a training dataset. When training a network with gradientbased methods, accelerating convergence to the solution is of a high priority [3], [4], but not the only performance variable to optimize. Minimizing the difference between the model errors on a training and a testing dataset, which is called the generalization error, plays a fundamental role [5]- [11].
The iterative optimization schemes with an adaptive step size schedule converge faster [12]- [15], but generalize poorly [16]- [20]. They are often outperformed by non-adaptive stochastic gradient descent (SGD) [21] for over-parameterized neural networks, where the number of trainable parameters is much higher than the number of samples they are trained on. Exploring critical generalization capacity, several studies explained this phenomenon: overparametrization ensures faster convergence [22]- [28] while inducing implicit regularization of the original problem, which can potentially ease the minimization of the generalization error [29]- [35]. However, oveparameterized models require an enormous number of units and layers to represent, process, and store data. This heavily reduces the transparency of neural networks, making them difficult to interpret.
I. Kulikovskikh is with the Department of Information Systems and Technologies, Samara University, Samara 443086, Russia (e-mail: kulikoskikh.im@ssau.ru).
T. Legović  What makes neural networks generalize well? Relying on the extensive empirical studies of SGD, it became evident that the step size maximizing the test accuracy is usually larger than the step size which minimizes the training loss [36], [37], [65]. The occurrence of an implicit regularizer demystifies this matter as well. For a small step size, SGD behaves similar to GD on the full batch loss function. When a step size increases, the regularizer starts penalizing the mean Euclidean norm of the minibatch gradients [33], [34], [37] that makes the training loss non-monotonic. Another explanation is that faster gradient descent methods naturally generate chaotic dynamical systems [38], which bring the optimizer to the edge of stability [36], [37] and, thus, yield a non-monotone decrease pattern in the loss function. This finding shares some similarities with the edge of chaos concept [39]. According to the concept, deep networks may be trained only sufficiently close to criticality, avoiding the regions of vanishing and exploding gradients, which correspond to the ordered and chaotic phases, respectively.
To examine the hypothesis, we simulate non-monotonocity in neural networks with the LIGHT (LogIstic Growth with HarvesTing) activation function, which originates from population dynamics [57], [58] and behaves as follows. It starts growing with the rate r by the logistic law. At the time t = T , it starts declining with the rate E. The y-intercepts at the moments 0 and T are specified. The default step size of SGD is modified with regard to r and E, respectively. For a diagnostic purpose, we suggest four configurations of the function to regulate explicitly an improvement in convergence and generalization on testing (see Fig. 1): -default-: no improvement; -r-: an improvement in convergence; -E-: an improvement in generalization; -Er-: both convergence and generalization are improved. Increasing the rate r allows us to push a learning system towards the edge of stability. By increasing the rate E, we fix this edge by pushing the system towards the equilibrium point. The presence of sliding modes along discontinuity surfaces, which are modelled with T , establishes the control over the system behavior.
We show that the formulated hypothesis is valid and that the LIGHT function contributes to: 1) improving convergence and  generalization by training neural networks with a moderately large step size with no need to ensure its stability; 2) building more reliable and explainable network architectures with no need for overparameterization. We refer to it as step size selfadaptation.

II. RELATED WORK
A. Adaptive optimization SGD is one of the most dominant first-order optimization algorithms for training neural networks [21]. Albeit its popularity and simplicity, SGD scales the gradient equally in all directions that results in worse convergence than the adaptive methods, for example, Adam [14] and Adagrad [12] that scale the gradient using the information from past gradients [16]. Adopting optimization methods with variable step sizes leads to faster convergence and worse generalization compared to non-adaptive methods [15], [59], [60]: they train faster, but their performance plateaus on testing due to less stable and predictable behavior. Consequently, further development of optimization techniques is directed towards a better trade-off between convergence and generalization.
In order to address the above problem, the first group of studies is directed towards different SGD modifications such as SGD with adopting extended differentiators [61], random reshuffling [62], local changes in gradients [63], [64] and etc.
The analysis of SGD based optimization for overparameterized models has recently become another active area of research interest [4], [20], [22], [24], [34], [65]. Recent studies indicated that large step sizes can preserve good generalization and accelerate SGD convergence without any additional gradient scaling. While analyzing the effect of overparametrization, Wu et al. [34] pointed to the difference in directional biases for SGD and GD with a moderate and annealing step size. Vaswani et al. [66] explored line-search techniques and provided heuristics to automatically set larger step sizes. Li and Arora [65] carried out an analysis of an exponential step size schedule. They showed that using SGD with momentum [26] and an exponentially increasing step size, coupled with batch normalization, maintains good balance between convergence and generalization across all standard architectures. Nitanda and Suzuki [67] provided an analysis of a convergence rate for the averaged SGD in the Neural Tangent Kernel Regime. The authors disclosed the conditions on which the method can achieve the minimax optimal convergence rate, with the global convergence guarantee.
In parallel with more successful SGD adoption for overparameterized models, substantial progress has been achieved in optimization methods with adaptive step sizes. SGDP and AdamP use effective step sizes without changing the update directions [19]. This allows to preserve the original convergence properties of GD optimizers. RAdam adopts the learning rate warmup heuristic to rectify the variance of adaptive step sizes [26] and, by that, stabilize training, accelerate convergence, and improve generalization. In an attempt to balance generalization and convergence on unstable and extreme step sizes, Luo at al. [16] put forward AdaBound and AMSBound which adopt dynamic bounds on step sizes to eliminate the generalization gap between adaptive methods and SGD and maintain higher learning rate early in the training. These methods were further developed with regard to a dynamic decay rate in [17]. Xie at al. [18] proved that the normalized Adagrad ensures robustness to the choice of hyper-parameters and achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak-Lojasiewicz (PL) inequality. Zhou at al. [20] proposed the SAGD method that leverages differential privacy to boost the generalization performance of adaptive gradient methods.

B. Adaptive activation
Using adaptive activation functions in neural networks is one more way to balance convergence and generalization. The first activation function presented "all-or-none" character of nervous activity with a step function [68], [69] to solve a binary classification problem. Wilson and Cowan [70] derived coupled nonlinear differential equations from the dynamics of spatially localized populations containing both excitatory and inhibitory model neurons. They investigated population responses to various types of stimuli and introduced an sshaped monotonic function of stimulus intensity -a sigmoid function. Yamada and Yabuta [71] suggested an approach to optimally tune the shape of the sigmoid function in control systems. While comparing the approximation capabilities of activation functions, DasGupta and Schnitger [72] pinpointed that the standard sigmoid is more powerful than the binary threshold even when computing boolean functions [73]. Piazza et al. [74] proposed the adaptive polynomial activation function to address the issue of complexity in neural networks. Xu and Zhang [75] proposed another adaptive activation function to reduce a network size. Goh and Mandic [76] suggested to adapt the amplitude of activation functions, while reconsidering recurrent neural networks in terms of nonlinear adaptive filters. Bai et al. [77] showed that varying the slope of an activation function with different step sizes is more beneficial than using momentum and an adaptive step size in the backpropagation algorithm. Flennerhag [78] suggested simple drop-in replacements that learn to adapt their parameterization with regard to the network inputs. PPolyNets [79] are accurate and efficient parametric polynomial activations specifically developed for encryption schemes which support only polynomial operations. Goyal et al. [80] suggested to normalize polynomial activations to increase the stability of neural networks. Kunc and Klěma [81] proposed a novel transformative adaptive activation function that improves the gene expression inference by generalizing existing adaptive activation functions.
De Felice et al. [40] drew attention to the fact that the biological activation function has a more complicated behavior which reduces to the usual (step or sigmoid) function for some hyperparameters describing its shape and stated that the non-monotonicity of the function increases the capacity of neural networks. Baldi and Atiya [45] extended previously known results regarding the effects of delays on stability and convergence properties. Forti and Nistri [53] introduced a general class of neural networks, where the neuron activations are modeled by discontinuous functions. The authors discovered that the presence of sliding modes ensures global convergence in neural networks in finite time. Duan et al. [49] established the existence and global exponential stability of almost periodic solution for the delayed high-order Hopfield neural networks. The study [46] discussed the dynamics of a class of the delayed neural networks with discontinuous activation functions. The authors concluded that the solution of delayed neural networks with discontinuous activation functions can be regarded as a limit of the solutions of delayed neural networks with high-slope continuous activation functions. According to [82], [83], rectified linear units (ReLU) and their different modifications [84], [85] in the hidden layers of neural networks demonstrate better convergence and generalization in comparison with the continuous activations. Exploring monostability and multistability of almost-periodic solutions in the fractional-order neural networks, Wan et al. [86] indicated that the dynamics in neural networks with the unsaturating piecewise linear activation functions is more complex. Nie and Zheng [41] looked into the problem of coexistence and dynamical behaviors of multiple equilibrium points for neural networks with discontinuous non-monotonic piecewise linear activation functions and time-varying delays. The study revealed that discontinuous neural networks can have greater storage capacity than the continuous ones. Hayou et al. [87], however, mentioned that only a specific choice of hyperparameters such as initialization and activation with regard to the concept of chaos [39] improves convergence and generalization in neural networks.

III. PRELIMINARIES
, we minimize an empirical loss function for each mini-batch dataset B(t) ⊆ {1, . . . , m} with a weight vector θ ∈ R n : where measures the discrepancy between the output y and the model prediction. The SGD optimizer finds the weight vector with a fixed step size η: where . . , L}, L is the number of layers, d l is the number of nodes in the layer l.

IV. LIGHT
We built a diagnostic function r,E (t) on the standard sigmoid by simulating different types of non-monotonocity with the growth rate r and decline rate E. The function grows with a constant rate r according to the logistic law. After time T , it declines with a constant rate E. We call the function LIGHT (LogIstic Growth with HarvesTing) as its behavior inherits the principles of population dynamics [57], [58]. Let us present the LIGHT function.
Definition (LIGHT) For any time t ∈ R, time instant T > 0, growth rate r > 0 and decline rate E ≥ 0, a non-monotonic function r,E (t), such that where ε is the extent to which r is impacted by E, behaves as: with the derivative r,E (t): where q is the rate with which the function grows when smaller.
By introducing q, we move from the infinitesimal calculus to quantum calculus [88]- [91] to avoid the concept of limits and, thus, simplify the definition. Another justification of this parameter is from the point of explainability: it generalizes the Verhulst (q = 1, light-v) [57], [92], [93]  To introduce non-monotonocity in the LIGHT diagnostic function, we propose four different configurations (see Fig.  2). We augmented the -default-configuration, which reduces to sigmoid if q = 1 and N 0 = 0.5, with three more configurations which regulate an improvement in convergence and generalization: -default-: r = 1, E = 0; no improvement; -r-: r ↑, E = 0; an improvement in convergence; -E-: r = 1, E ↑; an improvement in generalization; -Er-: r ↑, E ↑; an improvement in both convergence and generalization.
The sign ↑ means that value of the parameter is significantly increased. The decline in growth, delayed by T , induces a discontinuity in the function. A simultaneous increase in r and E makes the discontinuity even more noticable by scaling the function magnitude that results in a greater impact on a convergence/generalization trade-off. To diagnose a neural network capability with the LIGHT function, we modified a step size η in the SGD optimizer (2) by replacing (t) with r,E (t) in the loss (1).

A. Experimental Setup
As noted in the surveyed literature on adaptive optimization and activation, the success in balancing convergence and generalization is often attributed to the complexity and capacity of neural networks. To clearly observe the distinctive contribution of the LIGHT function to the step size adaptation, we focused primarily on creating the simplest network architecture -a single neuron L = 0 [97]- [103], the capacity of which recently drives the renewal interest in neural networks [104]- [107]. To investigate how a small increase in the network complexity, without overparameterization, may impact the proposed instrument, we also complemented the model with a hidden layer L = 1 of the ReLU neurons d l = 5, where the LIGHT function is applied only to the output.
We compared four non-adaptive methods -SGD with the -default-configuration (sigmoid-sgd) and SGD with the -r-, -E-, -Er-configurations (light-v-sgd and light-g-sgd) -to two popular adaptive methods with the -default-configuration -Adam (sigmoid-adam) and AdaGrad (sigmoid-adagrad). For all the optimizers, we used the default parameters, batch size |B(t)| = 75, and n epoch = 1500. The number of runs were equal to 10.
The light-v and light-g hyperparameters were optimized with a random search [108] with a 2.5% random pick of all possible combinations from the full grid space within the following ranges: r ∈ [0. 1,20] with the number of points n r = 5; E ∈ [0, 20], n E = 5; T ∈ [0, 3], n T = 3; N T ∈ [0.2, 0.8], n N T = 5. The number of epochs for hyperparameter search was equal to 1. As we optimized the hyperparameters with a small percentage of random combinations and one epoch, we added redundancy to the experimental setup by implementing both variations of the light funciton (light-v and light-g). This allowed us to validate the consistency of the chosen hyperparameters and to distinguish the examples where the rates r and E were properly balanced.
The LIGHT function in two variations was implemented as a custom output activation layer with Keras class LIGHT(Layer). The layer controls a convergence and generalization trade-off with the LIGHT configurations. The code is available at the repository: https://github.com/yukinoi/ light-diagnostic-function.

B. Synthetic Data
We generated a set of synthetic linearly separable and nonseparable datasets (m = 1000, n = 2) with lower and higher levels of variance (see Fig. 3). The datasets were randomly split into training (80%) and testing (20%) subsets. For conciseness, all the plots on synthetic datasets for different combinations of network architectures are deferred to Appendix A. Figures A1-A16 (a), (c), (e) depict the test accuracy curves for the -r-, -E-, -Er-configurations (light-v and light-g) on the synthetic datasets with lower and higher variance. For the sake of comparison, we added the -default-configuration of non-adaptive (sigmoid-sgd) and adaptive (sigmoid-adam and sigmoid-adagrad) optimizers to each plot. The presented results comply with the expected behavior of curves given in Figure 1. The curves on CIRCLES with one neuron A10-A10 seem different as they reach the maximum accuracy in a few epochs in comparison with the other datasets. However, they behave similarly within these epochs.
Tables I-IV summarize the maximum test accuracy and the number of epochs needed to reach it for each dataset. The best balance between the maximum accuracy and the number of epochs is highlighed in bold. We can observe that both light-v-sgd and light-g-sgd outperform other optimizers with the -default-configuration. The differences between the light variations point to the apparent discrepancy in the balance between r and E occured due to the setting of hyperparameter optimization. When r and E are not fully balanced, we can also observe that the -r-or -E-configuration become slightly superior to the -Er-configuration (see Table I, light-g-sgd; Table II, light-v-sgd; Table III, light-g-sgd).
The optimal light-v and light-g hyperparameters for different configurations are demonstrated in Figures A1-A16 (b), (d), (f). By analyzing the boxplots, we see that a non-zero decline rate in the -Er-configuration increases the growth rate r compared to the value r in the -r-configuration. It also brings more stable results as the standard deviation of accuracy curves is substaintially reduced.  When optimizing r on training (validation), the result does not deliver good generalization. This means that the system is pushed towards the edge of stability, which is not clearly defined. By increasing E, we fixed the edge of stability, improving generalization. In addition, it allowed us to shift the region for picking r to the right, allowing for more extreme values, and, thus, accelerating convergence.
To underline the LIGHT benefits in trading off convergence and generalization, we also analyzed the number of epochs at a test accuracy threshold (see Tables V-VIII). The lowest number of epochs needed to reach the accuracy threshold is strengthened with bold font. The hyphen indicates that the accuracy threshold is not reached in 1500 iterations. We can see that the light-based SGD greatly outperfoms other optimization methods.

C. Application
We validated the proposed step size adaptation approach with the LIGHT diagnostic function on MNIST, Fashion MNIST, and CIFAR10 datasets. The labels of the image classification datasets were binarized with the target class {5}. The samples were randomly extracted (m = 1000) from each of them and split into training (80%) and testing (20%) subsets. To classify the images, we used the one-layer network architecture and the light-g variation with the pre-defined growth and decline rate: r = 4.08, E = 6.4. Figure 4 shows the accuracy curves on testing for the -Er-configuration. The number of epochs needed to reach the maximum accuracy and the test accuracy threshold for each image dataset are summarized in Tables IX, X, and XI. By analogy, bold font indicates the lowest number of epochs needed to reach the accuracy threshold. The hyphen shows that the threshold is not reached. As we can see, the provided results disclose the benefits of the proposed step size adaptation in managing the trade-off between convergence and generalization with reliable and explainable network architectures.

VI. CONCLUSION
We contributed to the direction of SGD based optimization with step size self-adaptation. This technique allows to increase a step size by some fixed growing rate r which is self-stabilized with some fixed declining rate E in order to ensure best balance between convergence and generalization. It equips the optimizer with a simple instrument for explicit control over convergence/generalization trade-off which is the key to building reliable network architectures. Rather than suggesting another adaptive and non-monotone activation function, we put forward this instrument as a simple diagnostic function. The function adopts sliding modes in line with some fixed growing r and declining E rates to simulate discontinuities and explicitly regulate their influence on convergence and generalization. In addition, the LIGHT function relies on the laws of population dynamics as an original sshaped monotonic function [70] but exhibits more complex behavior as noted in [40]. This means that the proposed self-adaptation mechanism may open up new opportunities for building not only reliable but explainable neural network architectures [109] with greater capacity.