A Smoothed LASSO Based DNN Sparsiﬁcation Technique

—Deep Neural Networks (DNNs) are increasingly being used in a variety of applications. However, DNNs have huge computational and memory requirements. One way to reduce these requirements is to sparsify DNNs by using smoothed LASSO (Least Absolute Shrinkage and Selection Operator) functions. In this paper, we show that for the same maximum error with respect to the LASSO function, the sparsity values obtained using various smoothed LASSO functions are similar. We also propose a layer-wise DNN pruning algorithm, where the layers are pruned based on their individual allocated accuracy loss budget determined by estimates of the reduction in number of multiply-accumulate operations (in convolutional layers) and weights (in fully connected layers). Further, the structured LASSO variants in both convolutional and fully connected layers are explored within the smoothed LASSO framework and the tradeoffs involved are discussed. The efﬁcacy of proposed algo-rithm in enhancing the sparsity within the allowed degradation in DNN accuracy and results obtained on structured LASSO variants are shown on MNIST, SVHN, CIFAR-10, and Imagenette datasets.


I. INTRODUCTION
Deep Neural Networks (DNNs) are used in a large number of classification and recognition tasks used by commercial, medical and military applications. Having to model complex patterns and predictions for real world problems, DNNs are computational and memory intensive, owing to large number of connections and computational nodes. This creates a bottleneck in deploying DNNs on present-day devices owing to the limited on-board resources available on these devices. This situation necessitates strategies to minimize computational and memory requirements of DNNs without adversely affecting their performance.
The approaches in research fraternity towards this direction can be broadly classified into two clusters: 1) Approximating certain portions of DNNs by means of approximate computing techniques; 2) Pruning the redundant portions of DNNs using sparsifying algorithms. The inherent error-tolerant feature of the DNN applications is leveraged by these approaches towards minimizing the resources required.
Regularization is a widely used technique to help DNNs generalize the examples learnt during training. The main objective during training is to minimize the cost function C(W ), where W is the set of all the DNN weights. Typically, The authors are with the Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai 600036, Tamil Nadu, India (e-mail: ee16d400@ee.iitm.ac.in; nitin@ee.iitm.ac.in; vinita@ee.iitm.ac.in) a cost function comprises of error function (such as cross entropy) and a penalty term for regularization.
The benefit of using LASSO (Least Absolute Shrinkage and Selection Operator) regularization while training the DNNs is two-fold. It enhances the generalization capabilities of the DNNs for the examples learnt during training and produces sparse DNNs. However, the LASSO function is nondifferentiable at origin. Therefore, it does not fit the standard DNN training framework using gradient descent based algorithms involving backpropagation of the gradient information. One widely used and popular approach to overcome this difficulty is to use smoothed LASSO functions ( [1]- [3], [7], [29], [30], [35]). Using a non-convex approximation [49] is another way to overcome the drawback of non-differentiability of LASSO function at origin. The non-differentiability of LASSO is also dealt with by using subgradient method. However, subgradient method has a slower rate of convergence than smoothed LASSO based method [13].
In this work, we focus on sparsification of DNNs using smoothed LASSO based regularization techniques.
A. Related works DNN approximation involves approximating certain DNN portions by various methods such as precision scaling, approximate multipliers and approximate adders. Pruning involves sparsifying a DNN by removing the insignificant (or redundant) connections (or group of connections) from the DNNs.
The approximate/pruning algorithms employed by various works in literature typically comprise of identifying DNN portions to be approximated (or pruned), modifying them accordingly. In many cases, the DNN is healed by means of re-training to compensate for the loss in DNN performance due to approximations or pruning (sparsifying).
1) DNN approximations: [5] and [11] rank the various neurons respectively based on their sensitivity. The sensitivity of neurons in these works are determined by calculating the contribution of neurons to the DNN output degradation by using gradient information of neuron output obtained during backpropagation. Those neurons, causing significant degradation in DNN output quality are deemed sensitive. The less sensitive neurons are approximated (by using precision scaling and approximate multipliers) and DNNs are retrained to possibly recover the loss in accuracy. These steps are repeated iteratively as long as the DNN performance loss is within the tolerable limit. [5] calculates the energy consumption of MAC (Multiply And Accumulate) operations in every layer using Synopsys power compiler. It then chooses the energy-intensive DNN layer in every iteration and identifies the sensitive neurons within that layer. [11] ranks the DNN neurons based on their sensitivity values and those with sensitivity values below a certain threshold are approximated. [8] approximates a DNN using permuted diagonal matrices, where a pre-trained dense DNN is approximated to a permuted diagonal model and the non-zero elements lying along the permuted diagonals alone are re-trained.
2) DNN Sparsification: Majority algorithms in DNN sparsification works use LASSO function. These works such as [4], [9], [31]- [34], [36]- [43], [45] use various Deep Learning libraries such as Tensorflow and PyTorch for training the DNNs with LASSO function. These libraries use subgradient methods to deal with non-differentiability of LASSO function at the origin. Subgradient methods are known to have slower convergence [13] due to step change in derivatives near the origin. [14] incorporates a convex optimization based algorithm to sparsify the DNN. However, algorithms based on convex optimization techniques incur higher computational costs.
[1]- [3], [7] cover a wide range of LASSO smoothing functions from polynomial approximations [1] to higher order approximations (logarithmic [2]), (hyperbolic [3]). [29], [30], [35] use quartic smoothing function in the context of L 1/2 regularization based pruning. [1] and [7] have employed the polynomial based smoothing functions and quadratic smoothing function respectively in the context of group LASSO for pruning neurons in the hidden layer and input layer (features) of feedforward fully connected neural networks respectively. As shown in [3], the poorer convergence of logarithmic smoothing function proposed in [2] limits its usage in LASSO based training of DNNs. The remaining smoothing functions (polynomial class of functions [1], hyperbolic [3]) have demonstrated significant generalization and pruning capabilities for different neural nets with different activation and error functions.
In [6], [12], [46], the DNNs are sparsified by retaining the important connections by means of L 2 regularization, dropout techniques and pruning the unimportant connections. The remnant DNN is retrained (with L 2 regularization and dropout) iteratively until the degradation in DNN accuracy levels are within the acceptable limits.
3) Pruning algorithms: [37], [38], [43], [44] use the L 1 norm as a sensitivity metric in the context of pruning filters in convolutional layers. The filters with lower L 1 norm values are deemed less sensitive and are pruned first. [45] prunes one layer at a time for various compression ratios (fraction of connections pruned). After obtaining the DNN accuracy loss characteristics of various pruning ratios for all layers, the set of best individual compression ratios for all layers to achieve the targetted overall DNN compression ratio for minimum possible loss in accuracy is determined by using binary search algorithm. [42] uses layer-wise relevance propagation method [50] where, the contribution of layers to the intended output of DNN is ascertained by using backpropagation information. [46] uses Monte Carlo analysis for identifying the best subset of weights in the pruned DNN. It also proposed another heuristic where filters with least sum of output values across the testing data are pruned.

B. Contributions of this paper
The main contributions of this paper are as follows: 1) As opposed to previous studies ( [1], [3]), we find that all smoothing functions exhibit similar pruning capabilities, if the maximum error with respect to the LASSO function is set to the same value across all the smoothing functions. 2) We propose a novel layer-wise pruning algorithm to enhance the sparsity in DNNs. Based on estimates of the reduction in number of multiply-accumulate operations (in convolutional layers) and weights (in fully connected layers), accuracy loss budgets are assigned to each of them. Individual layers are then pruned accordingly. 3) Within the smoothed LASSO framework, we explore structured LASSO variants for the fully connected and convolutional layers separately and discuss the trade-offs involved. The rest of this paper is as follows: Section II discusses the limitations of the existing algorithms using smoothed LASSO functions. Section III contains the study of smoothing functions based on a fixed maximum error. Section IV presents the proposed novel layer-wise pruning algorithm and Section V contains the detailed discussion on structured LASSO variants. Section VI presents the results of experiments on the proposed pruning algorithm and structured LASSO variants. Section VII concludes the paper. Architecture details of neural networks used in this paper are given in an appendix.

II. LIMITATIONS OF EXISTING ALGORITHMS USING
SMOOTHED LASSO FUNCTIONS In this section, we discuss the limitations of LASSO based algorithms in [1] and [3].
We use the terms "sparsity " to denote the proportion of weights pruned and "neuron sparsity" for the proportion of neurons pruned.
In the presence of a penalty term, cost function C(W ) is written as: where, W is the set of DNN weights, C o (W ) is the error term, P (W ) is the penalty term and λ is the penalty coefficient. When a smoothed LASSO function (h(w)) is used as the penalty term, the cost function can be re-written as follows: where, w i j is the i th weight in j th layer and N j is the number of weights in j th layer. The existing smoothing functions are listed in Table I (for brevity w i j is represented as w).
Hyperbolic [3] ln(cosh(αw)) α For the sqrt, quadratic, quartic and sextic approximations, the smoothing parameter (β) lies between 0 and 1. Whereas, in the case of logarithmic and hyperbolic approximations, γ and α (both > 0) are typically of the order of 100. As defined in [3], the error w.r.t. LASSO is given by: Given the poorer convergence of logarithmic smoothing function compared to other smoothing functions [3] and the higher computational expense of square root smoothing function, coupled with its lower pruning capability [1], these two smoothing functions are omitted from further discussions in this paper. The hyperparameters learning rate (η) and penalty coefficient (λ) values for various experiments have been systematically chosen based on the heuristics discussed in [28].
In order to evaluate the performance of various smoothing functions, we carried out a series of experiments using MNIST architecture I and MNIST architecture II. The pixel values in the MNIST dataset have been normalized to [-1,1] for all the experiments in this paper [1]. To compare with the results published in [1], we had chosen the smoothing parameter (β) for the polynomial class of functions to be 0.003 and smoothing parameter (α) for hyperbolic function to be 634 as discussed in [3]. The pruning threshold for each of these experiments were chosen so as to ensure maximum dip of 0.5% in the testing accuracy levels of the networks. Table II shows the neuron sparsity values obtained by different smoothing functions using Group LASSO based pruning [1] on MNIST architecture II. The experiments were carried out in two different ways: 1) neurons were pruned during training (as done in [1]); 2) neurons were pruned after training. The neuron sparsity values obtained in Table II support the conclusions in [1] that employing higher order smoothing functions leads to larger sparsity in neural networks. LASSO based pruning [3] was carried out on MNIST architecture I using the different smoothing functions. Table III shows the sparsity values obtained. From the sparsity values, it can be observed that higher order smoothing functions exhibit superior pruning capabilities in conventional LASSO based penalty functions also.
The limitations of the algorithms proposed in [1] and [3] are as follows: 1) The pruning algorithms discussed above have been implemented with the assumption that the transition from smoothing approximation to the LASSO term occurred at same value (β) across all the smoothing functions ( Figure 1a). As seen from Figure 1a, this assumption causes significant differences in the maximum error (e = max E w ) due to each smoothing function. This, in turn affects the sparsity capabilities of the different smoothing functions, as will be seen in the next section. Table II, it is observed that pruning MNIST architecture II after training has resulted in larger neuron sparsity values in hidden layer II compared to pruning during training [1] for similar accuracy values. This can be attributed to the fact that training a DNN fully until convergence with a LASSO based function is more likely to yield larger number of insignificant weights rather than altering the trajectory of some weights by pruning during the training process. Therefore, for all further experiments in this paper, we prune the DNNs only after they have converged.

2) From neuron sparsity values in
3) The algorithms in [1], [3] train the neural networks using the smoothed Group LASSO/LASSO penalty functions from scratch. However, training the present state-ofart DNNs from scratch is quite time consuming. Also, the present state-of-art DNNs are trained robustly using different penalty terms such as L 2 . Therefore, it is advantageous to take these pre-trained DNNs and train them again with smoothed LASSO based functions to achieve sparsity. [4], [6] also used pre-trained DNNs for pruning channels and weights respectively by means of LASSO and dropout methods. We use this approach of starting with pre-trained DNNs in our proposed smoothed LASSO based pruning algorithm, as discussed in section IV.
In the following sections, we carry out systematic studies on various smoothing LASSO approximations and LASSO variants to acquire deeper insights into DNN pruning.

MAXIMUM ERROR
In this section, we look at the effect of fixing the maximum value of error (e = max E w ) to the same value in all smoothing functions. The smoothing functions are re-written in terms of e in Table IV. Figure 2a shows the different smoothing functions, when the maximum error value (e) is set at same value for all the smoothing functions. From Figure 2b, it can be seen that the derivatives of different smoothing functions now match more closely when compared to those in Figure 1b (where the smoothing functions have same transition points). Since, the update of weights are based on derivatives of the smoothing functions, it is expected that setting maximum error (e) to the same value in all cases will result in relatively similar weight updates across all the cases.
For proper performance (pruning capability) of smoothing functions, the point of transition from smoothing function to | · | function can at most be in the order of 10 −2 [7]. As this transition point depends on maximum error (e) value, it is necessary to arrive upon a proper choice of e value based on systematic experiments.
Towards this direction, we performed a series of experiments on various neural networks using various smoothing functions and e values. Based on the results obtained, we determine that quadratic smoothing function (with e = 10 −3 ) is the best choice for sparsifying DNNs. The details of these  experiments will be discussed in the results section (Section VI) of this paper.

A. On pruning threshold
The earlier works on pruning [1], [4], [6] pruned the DNNs with a fixed threshold. Given a pruning threshold, there is a trade-off between the sparsity levels and the performance (accuracy) of the DNNs. The maximum permissible pruning threshold cannot be determined a priori. Therefore, it is necessary to increase the pruning threshold and test the DNN performance iteratively in a systematic manner (dynamic pruning) as long as the degradation in DNN output quality is within the acceptable limit. A pruning algorithm along this direction was proposed in [3] and was tested on different fully connected networks. It was shown to result in increased sparsity. To check the performance of this algorithm for convolutional layers, we used the algorithm to sparsify the convolutional layers of SVHN CNN I. Table V compares the smoothed LASSO based sparsity results of SVHN CNN I obtained by fixed threshold with those obtained by dynamic threshold on the convolutional layers. It can be observed that using dynamic pruning threshold leads to increase of sparsity values for similar accuracy levels.

B. Layer-wise pruning
In the state-of-art DNNs, there is a significant difference between each layer. For example, CNNs consist of a large number of convolutional layers, each differing significantly in filter dimensions, number of filters and channel widths. In such scenario, pruning weights of various layers with the same pruning threshold (Algorithm 1) results in ineffective pruning. Therefore, it is necessary to come up with layer-wise heuristics for effective pruning.
As discussed in Section IA, [37], [38], [42]- [46] carried out DNN pruning by using various layer-wise heuristics. [37], [38], [43], [44] use L 1 norm to determine the DNN parts for pruning. [42] uses a layer-wise relevance propagation method [50] to determine the layers for pruning. [46] uses Monte Carlo analysis for pruning the connections and identifies the insignificant filters based on their output sum values over the testing data. The compression ratio procedure used in [45] causes a significant overhead as every layer need to be pruned completely to identify the suitable pruning threshold. The heuristics in these works are good enough to identify the least sensitive DNN parts for pruning. However, they do not identify the energy-intensive layers in the DNNs.
Based on [51], the energy used in convolutional and fully connected layers primarily depends on number of MAC operations and number of weights respectively. Therefore, we include these two parameters in the heuristics used for layer-wise pruning in our proposed algorithm. Towards this direction, we introduce a metric termed "accuracy loss budget" for each layer. The procedure to determine accuracy loss budget for each layer is as follows: Layers are pruned one at a time, while keeping all other layers intact, until the maximum allowed DNN accuracy loss (∆A max ) is reached. The number of connections pruned in every layer (Sp j ) is determined. This count is the maximum sparsity that can be obtained in a particular layer. Unlike in [45], every layer need not be pruned completely, thereby reducing the overhead involved. Since, energy is based on number of MAC operations and weights for convolutional layers and fully connected layers respectively, we define the Sensitivity ratio (S j ) for each of these layers as follows: 1) Convolutional layers: S j for a convolutional layer j is calculated in terms of maximum decrease in number of MAC operations (M Sp j ) that can be obtained in that layer.
where, h oj and w oj are height and width respectively of layer j output. Given the use of M Sp j for calculating S j in convolutional layers, the sparsity metric here should also reflect the reduction in MAC operations. Therefore, we introduce the term MAC sparsity, defined as the proportion of MAC (Multiply And Accumulate) operations that can be skipped due to pruning of weights. We use MAC sparsity to gauge the pruning results of all convolutional layer based experiments. 2) Fully Connected Layers: S j for a fully connected layer is obtained directly from Sp j value.
For fully connected layers, we retain the "sparsity" measure defined earlier (in Section II) to gauge the results of all related experiments.
The accuracy loss budget (∆A j ) of a layer j is then calculated as follows: Once ∆A j is obtained for all layers, each layer is ranked based on its ∆A j value. Then, each layer is sequentially pruned by varying the pruning threshold dynamically until the loss in DNN accuracy equals ∆A j value of the layer. The layers with higher ∆A j values are pruned first. In case of multiple layers with equal ∆A j , those layers closer to the input layer are pruned first [38] (Note: In our pruning algorithm, we prune convolutional layers and fully connected layers separately).
C. Healing after pruning [5], [11] exploit the natural error-healing ability of DNN training process to curtail degradation in the output quality of DNNs after approximation.We also use a similar approach where the sparsified DNNs are trained again to improve DNN testing accuracy.

D. Algorithm description
Based on the above discussed sub-sections, the proposed algorithm (Algorithm 2) systematically develops sparse DNNs using three stages: Retraining; Iterative pruning-testing and Healing.
During Retraining (line 1), a pre-trained DNN is re-trained with the cost function comprising both error function and the smoothed LASSO based penalty function. This is followed by the iterative pruning-testing stage (lines 2-26), where the weights falling below the pruning threshold are pruned. For each layer, the pruning threshold is increased until the DNN accuracy loss (∆A) equals the maximum allowed degradation in DNN accuracy (∆A max ) and s j for every layer is ascertained (lines 2-16). Then the accuracy loss budget of every layer (∆A j ) is calculated (lines [17][18][19][20] and the layers are ranked based on their ∆A j values (line 21). Layers with higher ∆A j values are pruned first (lines 22-26). The pruning threshold for each layer is incremented dynamically with a step size of ∆p until the DNN accuracy loss (∆A) equals ∆A j for the layer.
These set of steps enhance the sparsity in the DNNs by allowing higher DNN degradation proportions and higher sequential priority for layers which are likely to contribute more to the DNN sparsity.
Finally, in the healing stage (line 27), the DNN (only unpruned connections) is re-trained for M epochs. The cost function in this stage does not contain the LASSO penalty term.
The stages of retraining and healing will definitely cause a rise in the overall training time. However, as this entire process is only a one-time affair, the overhead incurred here is a favourable trade-off for the significant benefits of sparse DNNs during inference.
The experiments performed on the proposed algorithm and results obtained will be discussed in Section VI. Consider a m × n matrix with b v being the number of bits required to store each entry in the matrix. The memory requirement to store the entire matrix is b v mn bits.
Assume that this matrix (weights of a layer) has been sparsified by the conventional LASSO technique with a sparsity of f (fraction of entries which are zeros). A sparse matrix in TensorFlow requires a 1D array to store the values (with b v bits each) and a 2D array to store the indices in INT64 format. Therefore, each non zero entry in the sparse matrix requires b v and 2 × 64 bits for storing their values and indices respectively.
For the conventional LASSO to be an efficient technique of reducing memory requirement, the sparse matrix memory requirements must be less than the normal matrix memory requirements, leading to the following inequality: If the matrix entries are stored in FLOAT64 format, a minimum of 67% sparsity is required for conventional LASSO to be useful. In case of FLOAT32 format, it is 80%. Clearly, very high sparsity values are required for the conventional LASSO based pruning to be a useful technique in terms of achieving memory savings on the account of irregularity and indexing overhead it incurs.
This necessitates the use of a smoothed LASSO based structured LASSO penalty term to prune groups of weights in DNNs for achieving structuredness in DNN sparsity. However, structured LASSO penalty function is not purely a L 1 type function. It is an intermediate mode between L 1 and L 2 . Therefore, in comparison to LASSO function, structured LASSO function is expected to generate less sparse DNNs. There is a trade-off between sparsity levels and structuredness in the sparse DNNs as we switch from LASSO to structured LASSO penalty function. The associated cost function in the context of structured LASSO is as follows: where, λ f c and λ conv are the penalty coefficients associated with Fully Connected (FC) layers and convolutional (conv.) layers respectively. P f c and P conv are the quadratic smoothing function based structured LASSO penalty terms for fully connected layers and convolutional layers respectively. The advantage of using our formulation with smoothed LASSO is that it brings the convolutional and fully connected layers within a common framework. When we want to prune a particular type of layer (either FC or conv.), the penalty coefficient corresponding to the other layer type can be set to zero. These two types of layers differ significantly in their structures and associated bottlenecks. The following subsections carry out structured sparsity study on these two types of layers independently.  Fig. 3. A neuron (in hidden FC layer) with input and output weights grouped into m and n clusters respectively.
In this paper, we carry out a detailed study on using structured LASSO within the smoothed LASSO framework for fully connected layers and discuss the trade-offs involved. Structured LASSO with different cluster sizes (from pruning individual weights using conventional LASSO to pruning neurons using Group LASSO) are explored. The input and output connections of neurons in a hidden layer are grouped into different clusters. Figure 3 depicts the clustering of input and output weights of a neuron in a hidden layer. Based on the pruning threshold, these clusters are pruned as a whole. When all the input (and/or output) connections of a neuron are grouped into a single cluster, the entire neuron can be pruned as a whole when the pruning criterion is satisfied. This is identical to that of Group LASSO technique proposed in [1].
The associated smoothing function based structured LASSO penalty term (P f c ) is as follows: where, W j is the set of weights in j th hidden layer, w i,k and u T i,k are the i th input weights cluster and output weights cluster of k th neuron in hidden layer j.
Note that, in case of clustered LASSO, L 2 norm is used instead of absolute value | · | for calculating h(·) and groups of weights as a whole are pruned based on the type of clustered LASSO being used. The (m, n) cluster configuration notation used in this section refers to the number of clusters used for grouping input weights (m) and output weights (n) respectively (refer Figure 3). It is to be noted here that, (1,1) corresponds to Group LASSO.

B. In convolutional layers
A typical convolutional layer comprises of filters, channels (across which the convolution operation is carried out), pooling layers and activation functions. In a convolutional layer, a weight is denoted as follows: W (n f , n c , h, w), where n f , n c , h, w are the dimensions along the axes of filter, channel, filter height and filter width respectively. Figure 4 depicts an example of a typical convolutional layer. [9], [47] highlight the importance of structured sparsity in convolutional layers n f n c h w Fig. 4. A typical convolutional layer with n f , nc, h, w being the dimensions along the axes of filter, channel, filter height and filter width respectively. for quicker inference. Several works such as [37]- [41], [43] carried out structured sparsity such as filter or channel pruning using LASSO function. To the best of our knowledge, there have been no works till date on implementation of structured LASSO in convolutional layers using the smoothed LASSO functions. In this paper, we compare various structured LASSO types in the smoothed LASSO framework and discuss the trade-offs involved.
The associated smoothing function based structured LASSO penalty term (P conv ) is as follows: where, W j is the set of weights in j th layer and w i is the i th cluster. As in the case of FC layers, here also L 2 norm is used instead of absolute value | · | while calculating h(·) and the cluster of weights are pruned instead of a single weight.
The various types of structured LASSO clusters in the convolutional layers are as follows: 1) Channel wise: The weights belonging to a channel across all the filters are grouped together. In Equation (5), w i = w(:, i, :, :), i ∈ 1 to N c (number of channels). 2) Filter-3D wise: The weights belonging to a filter across all the channels are grouped together. In Equation (5), w i = w(i, :, :, :), i ∈ 1 to N f (number of filters). 3) Filter-2D wise: The weights belonging to a filter of a channel are grouped together. In Equation (5), w i = w(j, k, :, :), j ∈ 1 to N f , k ∈ 1 to N c and i ∈ 1 to N f × N c . 4) Row wise: The weights belonging to a row (along the width) of a filter in a channel are grouped together. In Equation (5), w i = w(j, k, m, :), j ∈ 1 to N f , k ∈ 1 to N c , m = 1 to H (number of rows (filter height)) and i ∈ 1 to N f × N c × H. 5) Column wise: The weights belonging to a column (along the height) of a filter in a channel are grouped together. In Equation (5), w i = w(j, k, :, m), j ∈ 1 to N f , k ∈ 1 to N c , m = 1 to W (number of columns (filter width)) and i ∈ 1 to N f × N c × W .

VI. RESULTS
In this section, we discuss various experiments and their results pertaining to the study of smoothing functions, proposed algorithm and structured LASSO. The simulations have   been implemented in the TensorFlow 2.2, using Python 3.5.2 language on Intel i7 CPUs.
We have reported our results on various datasets such as MNIST, SVHN, CIFAR-10 on standard architectures and Imagenette [16] on VGG-16 [15]. We use the Imagenette dataset (10 classes) instead of the full Imagenet (1000 classes) [17] in order to reduce the training time required. Since we start with a VGG-16 network that was pre-trained on the full Imagenet dataset and then re-train it on the reduced set of Imagenette classes, the resulting (baseline) accuracy is in the range of 97%.

A. Choice of smoothing functions
We performed a series of experiments on MNIST architecture II and SVHN CNN I architecture using the various smoothing functions with various maximum error (e) values. The (η, λ) values for MNIST and SVHN networks are (0.03, 10 −3 ) and (3 × 10 −5 , 0.3) respectively. Due to convergence issues, the maximum error (e) was not decreased below 10 −4 . Each of the benchmarks were subjected to conventional LASSO based pruning for two different pruning threshold values. Figures 5-7 show sparsity values obtained. The following conclusions are drawn: Therefore, for a fixed e value, we conclude that the choice of smoothing function has a relatively low impact on sparsity. Hence, quadratic smoothing function, which is simplest in terms of computation is the best choice.  Figure 8 shows the variation of testing accuracy with epochs while training MNIST architecture II for e = 10 −3 , 10 −4 . It can be observed that the network converges faster when e is set at 10 −3 . This is expected as pushing e to smaller values results in the smoothed LASSO function becoming sharper at origin, resembling the (unsmoothed) LASSO function, thereby slowing down convergence. Given this, we empirically find 10 −3 to be a suitable value for e and use it for all further experiments. B. Sparsity results based on accuracy loss budget (∆A j ) for each layer Table VI compares MAC sparsity values obtained by quadratic smoothing function with those obtained by hyperbolic smoothing function on CIFAR-10 CNN and Imagenette VGG-16 CNN, using the proposed algorithm (Algorithm 2). It can be seen that both the smoothing functions exhibit similar pruning capabilities within the framework of Algorithm 1 also, thereby, further validating the choice of quadratic smoothing function for DNN pruning. Quadratic function is used in all further experiments. Table VII compares the sparsity results obtained on convolutional layers of CIFAR 10 CNN by using Algorithm 1 with proposed algorithm (Algorithm 2). Algorithm 2 has been implemented using both Sp j and M Sp j for calculating accuracy loss budget (∆A j ) values. Algorithm 2 has also been implemented using L 1 norm ( [37], [38], [43], [44]), where, the L 1 norm value of weights of a layer is normalized by dividing with number of weights in that layer (similar to that used in [37]). The accuracy loss budget (∆A j ) values are then assigned proportional to the normalized L 1 norm value.  lower sparsity than the other two types. Among the other two types, using Sp j in Algorithm 2 gives lower MAC sparsity than using M Sp j . This is because using Sp j targets those layers with larger number of weights (layers 4,5,6,7) and using M Sp j targets those layers with larger number of MAC operations (layers 2,4,5) during pruning. Therefore, given the energy intensive nature of MAC operations in convolutional layers, it is necessary to use M Sp j while allocating the ∆A j values to individual convolutional layers. Table VIII shows similar results on convolutional layers of Imagenette VGG-16 architecture. It can be observed that using Algorithm 2 with M Sp j value has resulted in a considerable increase in total MAC sparsity by targeting those layers with larger number of MAC operations.

C. Structured LASSO
The proposed algorithm (Algorithm 2) has been used for all experiments on structured LASSO. Table IX shows the sparsity levels obtained by using structured LASSO with various cluster sizes on MNIST architecture II. It is seen that, decrease in cluster size (i.e. moving from Group LASSO to conventional LASSO) results in increase of sparsity in bulkier layers (Hid. layer I & II). This is expected as using the L 1 norm (conventional LASSO) yields sparse solutions than a mixture of L 1 − L 2 (structured LASSO variants). However, this trend is not observed in the O/P layer. This is due to the insignificant accuracy budget allocated for O/P layer, as it accounts for only 1% of DNN weights.  2) In convolutional layers: Table XI shows the MAC sparsity values obtained by using structured LASSO on convolutional layers of SVHN CNN II [10]. It is observed that MAC sparsity of conv. II layer is larger for smaller sizes of groupings. An opposite trend is observed in conv. I layer, which is of little concern as this layer accounts for only 1.18% of MAC operations in convolutional layers. However, it is observed that using filter-2d grouping yields larger MAC sparsity than row and column wise groupings. Table XII shows similar results on convolutional layers of Imagenette VGG-16 CNN. It can be clearly observed that total sparsity is larger for smaller groupings of structured LASSO. However, as observed in the previous example on SVHN CNN, filter-2d wise grouping gives larger MAC sparsity than row and column wise groupings.

VII. CONCLUSION
Based on the detailed study of smoothing functions with a fixed maximum error e value across all the functions, validated using various benchmarks, we determine that the quadratic smoothing function is most suitable to sparsify DNNs. The results on CIFAR-10 and Imagnenette datasets demonstrate the efficacy of the proposed layer-wise pruning algorithm in enhancing DNN sparsity by targeting appropriate layers during pruning. The various structured LASSO types for fully connected and convolutional layers have been implemented using the quadratic smoothing function. Results obtained on MNIST, SVHN, Imagenette datasets show that using structured LASSO variants with smaller group sizes result in higher sparsity values than using those with larger group sizes.

APPENDIX
The architecture details of various neural networks used in this paper are as follows: 1) MNIST architecture I: SF − L 2 with ReLU from [3].