Learning by Passing Tests, with Application to Neural Architecture Search

Learning through tests is a broadly used methodology in human learning and shows great effectiveness in improving learning outcome: a sequence of tests are made with increasing levels of difficulty; the learner takes these tests to identify his/her weak points in learning and continuously addresses these weak points to successfully pass these tests. We are interested in investigating whether this powerful learning technique can be borrowed from humans to improve the learning abilities of machines. We propose a novel learning approach called learning by passing tests (LPT). In our approach, a tester model creates increasingly more-difficult tests to evaluate a learner model. The learner tries to continuously improve its learning ability so that it can successfully pass however difficult tests created by the tester. We propose a multi-level optimization framework to formulate LPT, where the tester learns to create difficult and meaningful tests and the learner learns to pass these tests. We develop an efficient algorithm to solve the LCT problem. Our method is applied for neural architecture search and achieves significant improvement over state-of-the-art baselines on CIFAR-100, CIFAR-10, and ImageNet.


Introduction
In human learning, an effective and widely used methodology for improving learning outcome is to let the learner take increasingly more-difficult tests. To successfully pass a more challenging test, the learner needs to gain better learning ability. By progressively passing tests that have increasing levels of difficulty, the learner strengthens his/her learning capability gradually.
Inspired by this test-driven learning technique of humans, we are interested in investigating whether this methodology is helpful for improving machine learning as well. We propose a novel learning framework called learning by passing tests (LPT). In this framework, there is a "learner" model and a "tester" model. The tester creates a sequence of "tests" with growing levels of difficulty. The learner tries to learn better so that it can pass these increasingly more-challenging tests. Given a large collection of data examples called "test bank", the tester creates a test T by selecting a subset of examples from the test bank. The learner applies its intermediately-trained model M to make predictions on the examples in T . The prediction error rate R reflects how difficult this test is. If the learner can make correct predictions on T , it means that T is not difficult enough. The tester will (RL) approaches (Zoph and Le, 2017;Pham et al., 2018;, evolutionary learning approaches (Liu et al., 2018b;Real et al., 2019), and differentiable approaches (Cai et al., 2019;. In RL-based approaches, a policy is learned to iteratively generate new architectures by maximizing a reward which is the accuracy on the validation set. Evolutionary learning approaches represent the architectures as individuals in a population. Individuals with high fitness scores (validation accuracy) have the privilege to generate offspring, which replaces individuals with low fitness scores. Differentiable approaches adopt a network pruning strategy. On top of an over-parameterized network, the weights of connections between nodes are learned using gradient descent. Then weights close to zero are pruned later on. There have been many efforts devoted to improving differentiable NAS methods. In P-DARTS , the depth of searched architectures is allowed to grow progressively during the training process. Search space approximation and regularization approaches are developed to reduce computational overheads and improve search stability. PC-DARTS (Xu et al., 2020) reduces the redundancy in exploring the search space by sampling a small portion of a super network. Operation search is performed in a subset of channels with the held-out part bypassed in a shortcut. Our proposed LCT framework can be applied to any differentiable NAS methods.

Adversarial Learning
Our formulation involves a min-max optimization problem, which is analogous to that in adversarial learning. Adversarial learning (Goodfellow et al., 2014a) has been widely applied to 1) data generation (Goodfellow et al., 2014a;Yu et al., 2017) where a discriminator tries to distinguish between generated images and real images and a generator is trained to generate realistic data by making such a discrimination difficult to achieve; 2) domain adaptation (Ganin and Lempitsky, 2015) where a discriminator tries to differentiate between source images and target images while the feature learner learns representations which make such a discrimination unachievable; 3) adversarial attack and defence (Goodfellow et al., 2014b) where an attacker adds small perturbations to the input data to alter the prediction outcome and the defender trains the model in a way that the prediction outcome remains the same given perturbed inputs. Different from these existing works, in our work, a tester aims to create harder tests to "fail" the learner while the learner learns to "pass" however hard tests created by the tester. Shu et al. (2020) proposed to use an adversarial examiner to identify the weakness of a trained model. Our work differs from this one in that we progressively re-train a learner model based on how it performs on the tests dynamically created by a tester model while the learner model in (Shu et al., 2020) is fixed and not affected by the examination results.

Methods
In this section, we propose a framework to perform learning by passing tests (LPT) and develop an optimization algorithm for solving the LPT problem. In our framework, both the learner and the tester performs learning. The learner studies how to best fulfill the target task J 1 . The tester studies how to create tests that are difficult and meaningful. In the learner' model, there are two sets of learnable parameters: model architecture and network weights. The architecture and weights are both used to make predictions in J 1 . The tester's model performs two tasks simultaneously: creating tests and performing target-task J 1 . The model has three learnable modules: data encoder, test creator, and target-task executor, where the test creator performs the task of generating tests and the target-task executor conducts J 1 . The test creator and target-task executor share the same data encoder. The data encoder takes a data example d as input and generates a latent representation for this example. Then the representation is fed into the test creator which determines whether d should be selected into the test. The representation is also fed into the target-task executor which performs prediction on d during performing the target task J 2 .
In our framework, the learning of the learner and the tester is organized into three stages. In the first stage, the learner learns its network weights W by minimizing the training loss L(A, W, D (tr) ln ) defined on the training data D (tr) ln in the task J 1 . The architecture A is used to define the training loss, but it is not learned in this stage. If A is learned by minimizing this training loss, a trivial solution will be yielded where A is very large and complex that it can perfectly overfit the training data but will generalize poorly on unseen data. Let W * (A) denote the optimally learned W in this stage. Note that W * is a function of A because W * is a function of the training loss and the training loss is a function of A. In the second stage, the tester learns its data encoder E and target-task executor X by minimizing the training loss L(E, X, D (tr) tt ) + γL(E, X, σ(C, E, D b )) in the task J 2 . The training loss consists of two parts. The first part L(E, X, D (tr) tt ) is defined on the training dataset D tr tt in J 2 . The second part L(E, X, σ(C, E, D b )) is defined on the test σ(C, E, D b ) created by the test creator. For each example d in the test bank D b , it is first fed into the encoder E, then the creator C, which outputs a binary value indicating whether d should be selected into the test. σ(C, E, D b ) is the collection of examples whose binary value is equal to 1. γ is a tradeoff parameter between these two parts of losses. The creator C is used to define the second-part loss, but it is not learned in this stage. Otherwise, a trivial solution will be yielded where C always sets the binary value to 0 for each test-bank example so that the second-part loss becomes 0. Let E * (C) and X * (C) denote the optimally trained E and X in this stage. Note that they are both functions of C since they are functions of the training loss and the training loss is a function of C. In the third stage, the learner learns its architecture by trying to pass the test σ(C, E * (C), D b ) created by the tester. Specifically, the learner aims to minimizes the predictive loss of its model on the test: where d is an example in the test and (A, indicates that the learner performs well on this test. Meanwhile, the tester learns its test creator C in a way that C can create a test with more difficulty and meaningfulness. Difficulty is measured by the learner's predictive loss L(A, W * (A), σ(C, E * (C), D b )) on the test. Given a model (A, W * (A)) of the learner and two tests of the same size (same number of examples): . Therefore, the tester can learn to create a more challenging test by maximizing L is to enlarge the size of the test. But a larger size does not imply more difficulty. To discourage this degenerated solution from happening, we normalize the loss using the size of the test: 1 where |σ(C, E * (C), D b )| is the cardinality of the set σ(C, E * (C), D b ). To measure the meaningfulness of a test, we check how well the optimally-trained task executor E * (C) and data encoder X * (C) of the tester perform on the validation data D (val) tt in the target task J 2 , and the performance is measured by the validation loss: L(E * (C), X * (C), D (val) tt ). E * (C) and X * (C) are trained using the test generated by C in the second stage. If the validation loss is small, it means that the created test is helpful in training the task executor and therefore is considered as being meaningful. To create a meaningful test, the tester learns C by minimizing L(E * (C), X * (C), D (val) tt ). In sum, C is learned by maximizing tt ), where λ is a tradeoff parameter between these two objectives.
The three stages are mutually dependent: W * (A) learned in the first stage and E * (C) and X * (C) learned in the second stage are used to define the objective function in the third stage; the updated C and A in the third stage in turn change the objective functions in the first and second stage, which subsequently render W * (A), E * (C), and X * (C) to be changed. Putting these pieces together, we formulate LPT as the following multi-level optimization problem.
This formulation nests three optimization problems. On the constraints of the outer optimization problem are two inner optimization problems corresponding to the first and second learning stage respectively. The objective function of the outer optimization problem corresponds to the third learning stage.
As of now, the test σ(C, E, D b ) is represented as a subset, which is highly discrete and therefore difficult for optimization. To address this problem, we perform a continuous relaxation of σ(C, E, D b ): where for each example d in the test bank, the original binary value indicating whether d should be selected is now relaxed to a continuous probability f (d, C, E) representing how likely d should be selected. Under this relaxation, L(E, X, σ(C, E, D b )) can be computed as follows: where we calculate the loss (E, X, d) on each test-bank example and weigh this loss using f (d, C, E)). If f (d, C, E)) is small, it means that d is less likely to be selected into the test and its corresponding loss should be down-weighted. Similarly, Similar to , we represent the architecture A of the learner in a differentiable way. The search space of A is composed of a large number of building blocks. The output of each block is associated with a variable a indicating how important this block is. After learning, blocks whose a is among the largest are retained to form the final architecture. In this end, architecture search amounts to optimizing the set of architecture variables A = {a}.

Optimization Algorithm
In this section, we derive an optimization algorithm to solve the LPT problem defined in Eq.
(3). Inspired by , we approximate E * (C) and X * (C) using one-step gradient descent update of E and X with respect to L(E, X, D tt ) + γL(E, X, σ(C, E, D b )) and approximate W * (A) using one-step gradient descent update of W with respect to L(A, W, D (tr) ln ). Then we plug in these approximations into and perform gradient-descent update of C and A with respect to this approximated objective. In the sequel, we use ln ) where ξ ln is a learning rate and simplifying the notation of σ(C, E * (C), D b ) as σ, we can calculate the approximated gradient of L (A, W * (A) , σ) w.r.t A as: The second term in the third line involves expensive matrix-vector product, whose computational complexity can be reduced by a finite difference approximation: where W ± = W ±α ln ∇ W L (A, W , σ) and α ln is a small scalar that equals 0.01/ ∇ W L (A, W , σ)) 2 . We approximate E * (C) and X * (C) using the following one-step gradient descent update of E and C respectively: where ξ E and ξ X are learning rates. Plugging in these approximations into the objective function in Eq. (7), we can learn C by maximizing the following objective using gradient methods: The derivative of the second term in this objective with respect to C can be calculated as: where Similar to Eq.(9), using finite difference approximation to calculate ∇ 2 C,E L(E, X, σ(C, E, D b )) ∇ E L(E , X , D (val) tt ) and ∇ 2 C,X L(E, X, σ(C, E, D b ))∇ X L(E , X , D tt ), we have: tt ). For the first term L(A, W , σ(C, E , D b ))/|σ(C, E , D b )| in the objective, we can use chain rule to calculate its derivative w.r.t C, which involves calculating the derivative of L(A, W , σ(C, E , D b )) and |σ(C, E , D b )| w.r.t to C. The derivative of L(A, W , σ(C, E , D b )) w.r.t C can be calculated as: where ∂E ∂C is given in Eq.(13) and ∇ 2 C,E L(E, X, σ(C, E, D b )) ×∇ E L(A, W , σ(C, E , D b )) can be approximated with 1 2α E (∇ C L(E + , X, σ(C, where ∂E ∂C is given in Eq.(13). The algorithm for solving LPT is summarized in Algorithm 1.

Experiments
We apply LPT for neural architecture search in image classification tasks. Following , we first perform architecture search which finds out an optimal cell, then perform architecture evaluation which composes multiple copies of the searched cell into a large network, trains it from scratch, and evaluates the trained model on the test set. We let the target task of the learner and that of the tester be the same.

Datasets
We used three datasets in the experiments: CIFAR-10, CIFAR-100, and ImageNet (Deng et al., 2009). The CIFAR-10 dataset contains 50K training images and 10K testing images, from 10 classes (the number of images in each class is equal). Following , we split the original 50K training set into a new 25K training set and a 25K validation set. In the sequel, when we mention "training set", it always refers to the new 25K training set. During architecture search, the training set is used as the training data D of the tester. During architecture evaluation, the combination of the training data and validation data is used to train the large network stacking multiple copies of the searched cell. The CIFAR-100 dataset contains 50K training images and 10K testing images, from 100 classes (the number of images in each class is equal). Similar to CIFAR-100, the 50K training images are split into a 25K training set and 25K validation set. The usage of the new training set and validation set is the same as that for CIFAR-10. The ImageNet dataset contains a training set of 1.2M images and a validation set of 50K images, from 1000 object classes. The validation set is used as a test set for architecture evaluation. Following , we evaluate the architectures searched using CIFAR-10 and CIFAR-100 on ImageNet: given a cell searched using CIFAR-10 and CIFAR-100, multiple copies of it compose a large network, which is then trained on the 1.2M training data of ImageNet and evaluated on the 50K test data.

Experimental Settings
Our framework is a general one that can be used together with any differentiable search method. Specifically, we apply our framework to the following NAS methods: 1) DARTS , 2) P-DARTS , 3) DARTS + (Liang et al., 2019b), 4) DARTS − (Chu et al., 2020a). The search space in these methods are similar. The candidate operations include: 3 × 3 and 5 × 5 separable convolutions, 3 × 3 and 5 × 5 dilated separable convolutions, 3 × 3 max pooling, 3 × 3 average pooling, identity, and zero. In LPT, the network of the learner is a stack of multiple cells, each consisting of 7 nodes. For the data encoder of the tester, we tried ResNet-18 and ResNet-50 (He et al., 2016b). For the test creator and target-task executor, they are set to one feed-forward layer. λ and γ are both set to 1.
For CIFAR-10 and CIFAR-100, during architecture search, the learner's network is a stack of 8 cells, with the initial channel number set to 16. The search is performed for 50 epochs, with a batch size of 64. The hyperparameters for the learner's architecture and weights are set in the same way as DARTS, P-DARTS, DARTS + , and DARTS − . The data encoder and target-task executor of the tester are optimized using SGD with a momentum of 0.9 and a weight decay of 3e-4. The initial learning rate is set to 0.025 with a cosine decay scheduler. The test creator is optimized with the Adam (Kingma and Ba, 2014) optimizer with a learning rate of 3e-4 and a weight decay of 1e-3. During architecture evaluation, 20 copies of the searched cell are stacked to form the learner's network, with the initial channel number set to 36. The network is trained for 600 epochs with a batch size of 96 (for both CIFAR-10 and CIFAR-100). The experiments are performed on a single Tesla v100. For ImageNet, following , we take the architecture searched on CIFAR-10 and evaluate it on ImageNet. We stack 14 cells (searched on CIFAR-10) to form a large network and set the initial channel number as 48. The network is trained for 250 epochs with a batch size of 1024 on 8 Tesla v100s. Each experiment on LPT is repeated for ten times with the random seed to be from 1 to 10. We report the mean and standard deviation of results obtained from the 10 runs. Table 2 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-100. From this table, we make the following observations. First, when our method LPT is applied to different NAS baselines including DARTS-1st (first order approximation), DARTS-2nd (second order approximation), DARTS − , DARTS + , and P-DARTS, the classification errors of these baselines can be significantly reduced. For example, applying our method to P-DARTS, the error reduces from 17.49% to 16.28%. Applying our method to DARTS-2nd, the error reduces from 20.58% to 18.40%. This demonstrates the effectiveness of our method in searching for a better architecture. In our method, the learner continuously improves its architecture by passing the tests created by the tester with increasing levels of difficulty. These tests can help the learner to identify the weakness of its architecture and provide guidance on how to improve it. Our method creates a new test on the fly based on how the learner performs in the previous round. From the test bank, the tester selects a subset of difficult examples to evaluate the learner. This new test poses a greater challenge to Method Error(%) Param(M) Cost *ResNet (He et al., 2016a) 22.10 1.7 -*DenseNet (Huang et al., 2017) 17.18 25.6 -*PNAS (Liu et al., 2018a) 19.53 3.2 150 *ENAS (Pham et al., 2018) 19.43 4.6 0.5 *AmoebaNet (Real et al., 2019) 18.93 3.1 3150 *GDAS (Dong and Yang, 2019) 18.38 3.4 0.2 *R-DARTS (Zela et al., 2020) 18.01±0.26 -1.6 *DropNAS (Hong et al., 2020) 16.39 4.4 0.7 † DARTS-1st  20.  LPT-R18-DARTS-1st denotes that our method LPT is applied to the search space of DARTS. Similar meanings hold for other notations in such a format. R18 and R50 denote that the data encoder of the tester in LPT is set to ResNet-18 and ResNet-50 respectively. DARTS-1st and DARTS-2nd denotes that first order and second order approximation is used in DARTS. * means the results are taken from DARTS − (Chu et al., 2020a). † means we re-ran this method for 10 times. ∆ means the algorithm ran for 600 epochs instead of 2000 epochs in the architecture evaluation stage, to ensure a fair comparison with other methods (where the epoch number is 600). The search cost is measured by GPU days on a Tesla v100.
number of weight parameters and search costs corresponding to our methods are on par with those in differentiable NAS baselines. This shows that LPT is able to search betterperforming architectures without significantly increasing network size and search cost. A few additional remarks: 1) On CIFAR-100, DARTS-2nd with second-order approximation in the optimization algorithm is not advantageous compared with DARTS-1st which uses first-order approximation; 2) In our run of DARTS − , the performance reported in (Chu et al., 2020a) cannot be achieved; 3) In our run of DARTS + , in the architecture evaluation stage, we set the number of epochs to 600 instead of 2000 as used in (Liang et al., 2019a), to ensure a fair comparison with other methods (where the epoch number is 600). Table 3 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-10. As can be seen, applying our proposed LPT to DARTS-1st, DARTS-2nd, DARTS − , and DARTS + significantly reduces the errors of these baselines. For example, with the usage of LPT, the error of DARTS-2nd is reduced from 2.76% to 2.68%. This further demonstrates the efficacy of our method in searching better-performing architectures, by creating tests with increasing levels of difficulty and improving the learner through taking these tests. Table 4 shows the results on ImageNet, including top-1 and top-5 classification errors on the test set, number of weight parameters (millions), and search costs (GPU days). Following , we take the architecture searched by LPT-R18-DARTS-2nd on CIFAR-10 and evaluate it on ImageNet. As can be seen, applying our LPT method to DARTS-2nd, the top-1 error reduces from 26.7% to 25.3% and the top-5 error reduces from 8.7% to 7.9%, without increasing the search cost and parameter number. This further demonstrates the effectiveness of our method.

Ablation Studies
In order to evaluate the effectiveness of individual modules in LPT, we compare the full LPT framework with the following ablation settings.
• Ablation setting 1. In this setting, the tester creates tests solely by maximizing their level of difficulty, without considering their meaningfulness. Accordingly, the second stage in LPT where the tester learns to perform a target-task by leveraging the created tests is removed. The tester directly learns a selection scalar s(d) ∈ [0, 1] for each example d in the test bank without going through a data encoder or a test creator. The corresponding formulation is: where S = {s(d)|d ∈ D b }. In this study, λ and γ are both set to 1. The data encoder of the tester is ResNet-18. For CIFAR-100, to avoid performance collapse because of skip connections, LPT is applied to P-DARTS. For CIFAR-10, LPT is applied to DARTS-2nd.
• Ablation setting 2. In this setting, in the second stage of LPT, the tester is trained solely based on the create test, without using the training data of the target task. The corresponding formulation is:   2.72±0.07 Table 6: Results for ablation setting 2. "Test only" denotes that the tester is trained only using the create test to perform the target task. "Test + Training data" denotes that the tester is trained using both the test and the training data of the target task.
In this study, λ and γ are both set to 1. The data encoder of the tester is ResNet-18. For CIFAR-100, to avoid performance collapse because of skip connections, LPT is applied to P-DARTS. For CIFAR-10, LPT is applied to DARTS-2nd.
• Ablation study on λ. We are interested in how the learner's performance varies as the tradeoff parameter λ in Eq.
(3) increases. In this study, the other tradeoff parameter γ in Eq.(3) is set to 1. For both CIFAR-100 and CIFAR-10, we randomly sample 5K data from the 25K training and 25K validation data, and use it as a test set to report performance in this ablation study. The rest 45K data (22.5K training data and 22.5K validation data) is used for architecture search and evaluation. Tester's data encoder is ResNe-18. LPT is applied to P-DARTS.
• Ablation study on γ. We investigate how the learner's performance varies as γ increases.
In this study, the other tradeoff parameter λ is set to 1. Similar to the ablation study on λ, on 5K randomly-sampled test data, we report performance of architectures searched and evaluated on 45K data. Tester's data encoder is ResNe-18. LPT is applied to P-DARTS.  problem, it is necessary to make the created tests meaningful. LPT achieves meaningfulness of the tests by making the tester leverage the created tests to perform the target task. The results demonstrate that this is an effective way of improving meaningfulness. Table 6 shows the results for ablation setting 2. As can be seen, for both CIFAR-100 and CIFAR-10, using both the created test and the training data of the target task to train the tester performs better than using the test only. By leveraging the training data, the data encoder can be better trained. And a better encoder can help to create higher-quality tests. Figure 2 shows how classification errors change as λ increases. As can be seen, on both CIFAR-100 and CIFAR-10, when λ increases from 0.1 to 0.5, the error decreases. However, further increasing λ renders the error to increase. From the tester's perspective, λ explores a tradeoff between difficulty and meaningfulness of the tests. Increasing λ encourages the tester to create tests that are more meaningful. Tests with more meaningfulness can more reliably evaluate the learner. However, if λ is too large, the tests are biased to be more meaningful and less difficult. Lacking enough difficulty, the tests may not be compelling enough to drive the learner for improvement. Such a tradeoff effect is observed in the results on CIFAR-10 as well. Figure 3 shows how classification errors change as γ increases. As can be seen, on both CIFAR-100 and CIFAR-10, when γ increases from 0.1 to 0.5, the error decreases. However, further increasing γ renders the error to increase. Under a larger γ, the created test plays a larger role in training the tester to perform the target task. This implicitly encourages the test creator to generate tests that are more meaningful. However, if γ is too large, the training is dominated by the created test which incurs the following risk: if the test is not meaningful, it will result in a poor-quality data-encoder which further degrades the quality of test creation.

Conclusions
In this paper, we propose a new machine learning approach -learning by passing tests (LPT), inspired by the test-driven learning technique of humans. In LPT, a tester model creates a sequence of tests with growing levels of difficulty. A learner model continuously improves its learning ability by striving to pass these increasingly more-challenging tests. We propose a multi-level optimization framework to formalize LPT where the tester learns to select hard validation examples rendering the learner to make large prediction errors and the learner refines its model to rectify these prediction errors. Our framework is applied for neural architecture search and achieves significant improvement on CIFAR-100, CIFAR-10, and ImageNet.