Cascaded Compressed Sensing Networks: A Reversible Architecture for Layerwise Learning

Recently, the method that learns networks layer by layer has attracted increasing interest for its ease of analysis. For the method, the main challenge lies in deriving an optimization target for each layer by inversely propagating the global target of the network. The propagation problem is ill posed, due to involving the inversion of nonlinear activations from lowdimensional to high-dimensional spaces. To address the problem, the existing solution is to learn an auxiliary network to specially propagate the target. However, the network lacks stability, and moreover, it results in higher complexity for network learning. In the letter, we show that target propagation could be achieved by modeling the network s each layer with compressed sensing, without the need of auxiliary networks. Experiments show that the proposed method could achieve better performance than the auxiliary network-based method.


I. INTRODUCTION
The error backpropagation (BP) method has achieved great success in the supervised learning of deep neural networks [1]. Due to taking the entire network as a whole to optimize, the method can hardly disclose the contribution of each layer, causing difficulty to network analysis. As an alternative to BP, the emerging layerwise learning method [2], [3] seems more suitable for network analysis. The method aims to learn the network layer by layer and thus reduces the network analysis to the layer level.
For layerwise learning, the key is to derive an optimization target for each layer, simply called local target hereafter. Currently, the most simple method is to directly employ the global target [4], namely the optimization target of the entire network, as the target of each layer. However, the global target cannot guarantee to be optimal for each layer to minimize the global loss. To achieve BP level performance, the methods in [5], [6] propose to append a classifier to the layer of interest, and then optimize them together with the global target. This forms a shallow network with at least one hidden layer, in which the layer of interest is still hidden and thus hard to analyze [7], [8], [9]. To directly optimize and analyze each layer, we resort to another emerging layerwise learning method known as target propagation (TP) [2], [3], [10], which derives the local target of each layer by inversely propagating the global target, and optimizes the layer by matching its output to the target. In the method, local targets are required to be of two properties. First, their feedforward outputs should W. Lu be identical or close to the global target. This ensures that the optimization of each layer could reduce the global loss to the maximum extent. Second, the local target of each layer should be close to the forward output of the layer. This will help to reduce the variation of each layer over optimization, while the low variations incline to improve the network's generalization [11], [12], [13]. However, the two properties are hard to achieve because the inverse propagation of global target involves the inversions of nonlinear activations from low-dimensional to high-dimensional spaces. This is an ill-posed problem. For the problem, the existing solution is to learn an auxiliary network to specially propagate the target. Obviously, the method will increase the complexity of network learning, and moreover, the auxiliary network has no guarantee for stable and accurate target propagation. In the letter, we show that auxiliary networks are not necessary for layerwise learning, and target propagation could be achieved with guaranteed accuracy by modeling the network's each layer with compressed sensing.
In compressed sensing [14], [15], it has been proved that a high-dimensional signal x ∈ R n with sparse transformations s = Dx over an orthonormal dictionary D ∈ R n×n , could be recovered from its low-dimensional projection y = W Dx = W s, where W ∈ R m×n , m n. The recovery is implemented by first recovering the sparse vector s from y via solving an 0 or 1 -regularized least square problem [16], and then simply deriving x = D s. Inspired by this reverse process, we propose to model each network layer to be a compressed sensing process. Specifically, we formulate each layer as a cascade of W and D, such that the input s of W could be recovered from its output y via sparse recovery, and the input x of D (namely the output y of W ) could be simply recovered from its output s by multiplying with D . Repeating this process layer by layer, target propagation will be finally accomplished. To obtain better performance, as usual, we suggest to further deploy a nonlinear function f (·) following each dictionary D, in order to remove small activations in magnitude. This operation brings two benefits. First, it further improves the sparseness of the output activations of D, which is beneficial to the sparse recovery on the following W. Second, it tends to reduce within-class scattering and thus improve the classification accuracy [17]. Overall, with the perfect combination of W , D and f (·), we establish a compressed sensing framework for each layer, and then obtain a cascaded compressed sensing architecture with all layers. To the best of our knowledge, this is the first time that compressed sensing is introduced for layerwise learning to handle the target propagation problem.  Compared to auxiliary networks-based TP methods, the proposed compressed sensing-based TP method has two major advantages. First, it avoids the usage of auxiliary networks and thus reduces the complexity of network learning. Second, it could achieve target propagation with guaranteed accuracy. In layerwise learning, the amount of local targets we need to compute and store is equal to the number of activations in the layer. Considering convolution networks contain huge amounts of activations and pose great computational challenges to layerwise learning, as in [3], [10], we only test the proposed layerwise learning method on fully-connected networks for tractable computation. Our performance advantage is verified by the classification experiments on MNIST and CIFAR10.

II. METHOD
Let us consider a deep fully connected network with L layers as shown in Fig. 1. Each layer consists of two sublayers {W l , D l }, where W l ∈ R m l ×n l with m l n l denotes a projection matrix and D l ∈ R m l ×m l means an orthonormal dictionary, D D = I. Following D l is a pointwise thresholding function f l (·) which zeroes out all but the k l ( m l ) largest positive elements, acting similarly to the ReLU function. Note that the output layer is a classifier which comprises W L but no D L . Suppose h l W ∈ R m l and h l D ∈ R m l are the output activations of W l and D l , respectively. The feedforward network structure is formulated by where g(·) is a softmax function, and the notations h 0 D and h L W denote the network's input and output, respectively. As sketched in Algorithm 1, the proposed layerwise learning is implemented within three phases. First, we initialize the network via feeding forward the network input h 0 D . During this process, the projection matrices W l are randomly generated, and the orthonormal dictionaries D l are learned with their input activations h l−1 W . Second, as shown in the lower part of Fig.1, by compressed sensing the global targetsh L W are inversely propagated layer by layer, producing local targets h l W andh l D respectively for W l and D l , 1 ≤ l < L. Finally, we feed forward the input h 0 D again, in order to successively minimize the local losses L l (h l W ,h l W ) and L l (h l D ,h l D ) of each layer by updating W l and D l , 1 ≤ l < L. This finally minimizes the global loss L(h L W ,h L W ). In the letter, we simply define the local losses with Euclidean distances and the global loss with cross entropy. In the following, we detail the three phases. Generate h l W and h l D by (1) and (2). 4: Initialize W l with Gaussian distributions.

5:
Learn D l with its input h l W by Algorithm 2. 6: end for 7: Further update the final W L by (3). 8 Update W l by (7) and D l by (9) and (10). 16: end for 17: Update the final W L by (12).

A. Activation forward propagation
By (1) and (2), we know the network takes h 0 D as input and h L W as output. During the feedforward propagation, we initialize the projection matrix W l by drawing its elements from standard normal distribution. The distribution has been proved nearly optimal for compressed sensing [18]. To keep the input activation h l−1 D unchanged in 2 norm after projections, W l needs to be scaled by a factor 1/ √ m l [19]. For the output layer W L , besides random initialization, we further update it by in order to reduce the difference between the final output h L W and the global targeth L W . As detailed later, the reduced difference is beneficial to the final layerwise optimization. Empirically, the parameter α in (3) should be small in case of overfitting.
Next, we move to learning the orthonormal dictionary D l , which has dense input h l W and sparse output h l D . Such kind of dictionaries can be learned with an procrustean approach [20], which is introduced in Algorithm 2 for completeness. In the algorithm, there is a crucial parameter d, which counts the number of largest elements maintained by the thresholding function T (·). To obtain sparse transforms, we should adopt a relatively small d. Note that Algorithm 2 supports online learning, and the property is necessary for batch learning. From the viewpoint of classification, we can say that the sparse transform matrix D l is mainly used for feature selection, and the projection matrix W l serves not only for feature selection but for dimensionality reduction. Moreover, it is noteworthy that the order of D l and W l could be reversed in the cascaded compressed sensing framework. In the letter, we decide to put W l ahead in order to decrease the dimension of the network input h 0 D and then reduce the optimization complexity. Finally, let us see the initialization of the nonlinear function f l (·), which has one thresholding parameter to maintain the k l largest positive elements in the output of D l . Note that f l (·) only keeps a portion of positive elements and discards the negative ones, because empirically both of them tend to share the similar information for classification. The choice of k l depends on both the sparse degree of the output of D l and the compression ratio of the following W l+1 . Empirically, as discussed later, a large k l that allows for containing most of positive elements tends to provide good classification performance, when the dictionary output is not very sparse.

B. Target backward propagation
The global targeth L W is propagated backward by repeating the following compressed sensing recovery process: where the layer index l decreases from L to 2;h l−1 D and h l−1 W denote the local targets derived respectively for D L−1 and W L−1 . The derivations of the two targets are analyzed as follows. Let us first see the deriving ofh l−1 D by (4). By compressed sensing, the row size m l of W l in (4) should be at least twice its input sparsity k l−1 , namely m l > 2k l−1 , in order to approximately recoverh l−1 D fromh l W . Empirically, the condition requires not to be strictly met, and a good classification is usually obtained as m l < 2k l−1 . This implies that for layerwise learning, the feedforward of sufficient amounts of features may be more important than the accuracy of target propagation. We could solve (4) with many algorithms [21]. In our experiments, we will employ the known OMP [22] algorithm, which has a parallel implementation in the software SPAMS [23]. As for the local targeth l−1 W , it is seen in (5) that the target could be simply derived fromh l−1 D by multiplication, benefiting from the orthonormality of D l−1 .
Usually, we prefer to simply label each class with a onehot vectorh L W , such that the instances of the same class will share the same local target at each layer. Note that the label sharing may cause difficulty for layerwise optimization, as the instances of the same class have high variations, as often encountered at early layers [3]. To avoid the problem, an effective solution is to adopt diverse labels/tagets to represent the instances of the same class [3]. To achieve this, instead of (4), we propose to derive the local targeth L−1 D of the penultimate layer bỹ and then derive the targets for the rest layers by (4) and (5).
Considering the global loss L(h L W ,h L W ) has been previously decreased in the initialization phase by updating W L in (3), we use a small β in (6) to further reduce the global loss to achieve a good classification performance. It is seen that a small β will result in a small difference h L−1 D −h L−1 D , and then a sequence of small h l D −h l D (with l decreasing from L − 2 to 1) by the stable recovery of compressed sensing [25]. The small differences between forward activations and Algorithm 2 Online orthonormal dictionary learning [24] 1: Input: Dataset X ∈ R m×n of n samples of length m and initialized dictionary D (0) . 2: for j = 1 to J do 3: S (j) = T (D (j) X), T (·) is a threshold function to zero out all but d largest elements (in magnitudes) in each column of its matrix input. 4: Run SVD for XS (j) = P ΣQ .

5:
D (j+1) = P Q . 6: end for 7: Output: Learned dictionary D = D (J) backward targets will reduce the variations of D l and W l in optimization, while the low variations over optimization incline to improve the network's generalization [11], [12], [13]. To achieve such property, the TP method in [3] constrains the difference between targets and activations, as learning the auxiliary network. However, the network-based method lacks stability, thus inferior to our compressed sensing-based method.

C. Forward layerwise learning
Feeding forward the network input h 0 D , we successively update the parameters {W l , D l } of each layer by matching their outputs h l W and h l D to their local targetsh l W andh l D . Formally, this process is realized by iterating where the index l increases from 1 to L − 1. As a classifier, the final output layer is updated bỹ Then the resulting {W l ,D l } L l=1 constitute the final network. Note that the dictionary updating rules (9) and (10) are adapted from Algorithm 2. For tractable computation, we simply update W l with the ridge regression model (7), although other more sophisticated models, such as the Lasso [26] and elasticnet [27], probably yield better updates.
For large-scale data, such as ImageNet [28], we suggest to process the data in batches and perform layerwise learning iteratively. But for small or middle-scaled data, such as MNIST [29] and CIFAR10 [30], the learning could be accomplished in a single pass. In this case, compressed sensing-based target propagation needs to be performed only once, implying that the dictionaries are not required to be orthonormal in the final forward optimization and they could be simply updated bỹ instead of the SVD method used in (9) and (10).

III. EXPERIMENTS
In this section, we aim to prove that the proposed compressed sensing-based target propagation (CSTP) method could achieve better performance than the auxiliary networkbased target propagation (ANTP) method [3], on the layerwise learning of deep networks. Moreover, we analyze the proposed CSTP method by ablation study. For comparison, as in [3], we test the proposed CSTP method on two typical datasets, MNIST [29] and CIFAR10 [30]. The MINIST dataset consists of 28 × 28 gray-scale images of handwritten digits. It has a training set of 60,000 samples, and a test set of 10,000 samples. The CIFAR10 dataset consists of 32 × 32 natural color images in 10 classes. There are 50,000 samples in the training set and 10,000 samples in the testing set.

A. Settings
In the following, we briefly introduce the parameters set for the proposed CSTP-based layerwise learning algorithm. As shown in Algorithm 1, the algorithm consists of three phases. In the initial feedforward phase, the dictionaries are learned with Algorithm 2, for which we set the iteration number to be 20 and the sparsity parameter d ≈ m l /3. The thresholding function f l (·) keeps about k l ≈ m l /2 positive elements. By (3), we initialize the output layer W L , namely the classifier. To obtain better performance, we suggest to handle (3) with stochastic gradient descent (SGD), for which we set the momentum as 0.9, weight decay as 0.0005, batch size as 64, and epoch number as 30. The learning rate is initialized as 0.01, which is decayed by multiplying a factor of 0.9 every 4 epochs. In the target backpropagation phase, we first diversify the targets by (6), with β = 0.2. Then the targets are backpropagated by (4), which is implemented with OMP [22], with k l ≈ n l /3. In the final layerwise learning phase, we update W l and D l by directly solving (7) and (13). The output layer W L is updated in (12) by SGD, with the same parameters as set in (3). The above parameter settings are adopted both for MNIST and CIFAR10.

1) Performance comparison on CIFAR10:
For the layerwise learning with CSTP, we derive a test error of 47.76% on a three-layer network of size 3072-2500-1500-10. In contrast, in [3] ANTP reports an error of 49.29% on a fourlayer network of size 3072-1000-1000-1000-10. This means that CSTP could achieve higher accuracy than ANTP, when given appropriate network sizes. Recall that the network size of CSTP is constrained by compressed sensing. Similarly to ANTP [3], the CSTP-based layerwise learning performs worse than the prevailing BP method, with an accuracy gap about 4% on the network we test. As discussed in [31], the inferior performance is due to the mismatching between the poorly-separable features and the strong supervision constraint imposed on each layer.
2) Performance comparison on MNIST: In [3], ANTP achieves a test error of 1.94% on a network with 7 hidden layers each consisting of 240 units. The error is reduced about 0.15%, as we test CSTP on a wide but shallow network of size 784-3500-2500-10. From Table 1, it is seen that wider layers tend to yield better performance for CSTP. The reason is as follows. The performance of CSTP depends on the dictionary learning-based feature selection and compressed sensing-based target propagation. For dictionary learning, higher dimensions (namely wider layers) usually could lead to sparser transforms, which are beneficial not only to feature selection but also to compressed sensing.
3) Ablation study of D l and f l (·): To see the importance of the dictionary D l and nonlinear function f l (·) in CSTP, we remove them in Algorithm 1, and after fine tuning, we still witness an accuracy reduction about 1.2% for CIFAR10 and 0.7% for MNIST. If only removing f l (·), the accuracy decreases about 0.3% for CIFAR10 and 0.1% for MNIST.

IV. CONCLUSION
In the letter, we have shown that the cascade of compressed sensing can be established as a reversible network, which allows for layerwise learning and analysis. The optimization target of each layer is derived by inversely propagating the target of the network via compressed sensing. The method avoids learning an auxiliary network to specially propagate the target, reducing the complexity of network learning. Also, the method could achieve favorable performance owing to the stableness of compressed sensing. The proposed target propagation involves two major computations: dictionary learning and sparse recovery, which both have polynomial complexity. To reduce the complexity, it is interesting to sparsify the structures of the dictionaries and projection matrices, such that some combinatorial algorithms with linear complexity could be adopted [32], [33]. This will be left for our future work. Note that in the letter we focus our interest on the layerwise learning method that optimizes exactly one layer at a time and thus reduces the network analysis to the layer level. Currently, the method seems difficult to achieve BP level performance, especially on large-scale data [31]. The reason is that the finite number of parameters in one layer can hardly handle the matching between poorly-separable features and strong supervision constraints [31]. The problem could be addressed if more than one layer is learned at a time [5], [6].