Skillearn: Machine Learning Inspired by Humans' Learning Skills

Humans, as the most powerful learners on the planet, have accumulated a lot of learning skills, such as learning through tests, interleaving learning, self-explanation, active recalling, to name a few. These learning skills and methodologies enable humans to learn new topics more effectively and efficiently. We are interested in investigating whether humans' learning skills can be borrowed to help machines to learn better. Specifically, we aim to formalize these skills and leverage them to train better machine learning (ML) models. To achieve this goal, we develop a general framework -- Skillearn, which provides a principled way to represent humans' learning skills mathematically and use the formally-represented skills to improve the training of ML models. In two case studies, we apply Skillearn to formalize two learning skills of humans: learning by passing tests and interleaving learning, and use the formalized skills to improve neural architecture search. Experiments on various datasets show that trained using the skills formalized by Skillearn, ML models achieve significantly better performance.


Introduction
Given a group of human students, assuming they work equally hard, there are three major factors determining which students learn better than others, including intelligence, learning skills, and learning materials. People with higher intelligence quotient (IQ) are stronger learners. Learning materials, such as textbooks, video lectures, practice questions, etc. are also crucial in determining the quality of learning. Another vital factor impacting learning outcomes is learning skills. Oftentimes, students in the same class have similar IQ and have access to the same learning materials, but their final grades (which measure learning quality) have a large variance. The major differentiating factor is that different students have different levels of mastery of learning skills. Some students have better learning methodologies, which enable them to learn faster and better. In the long history of learning, humans have accumulated a lot of effective learning skills, such as learning through tests, interleaving learning, self-explanation, active recalling, etc.
Similar to human learning, the performance of machine learning (ML) models is also determined by several factors. In the current practice of ML, two dominant factors determining ML performance are the capacity of models and the abundance of data. ML model capacity is analogous to the intelligence of humans. From linear models such as support vector machine to nonlinear models such as deep neural networks, ML researchers have been continuously building more powerful ML models to deal with more complicated tasks. It is like the evolution of humans' brains, which become increasingly intelligent. Data for ML is analogous to learning materials for humans. ML models trained with more labeled data in general perform better. For intelligence and learning materials in human learning (HL), we identify their counterparts in machine learning as model capacity and data. We are interested in asking: for learning skills in HL, do they have counterparts in ML as well? Can machines be equipped with effective learning skills as humans are? In this paper, we aim to address these questions. We propose a general framework -Skillearn, which draws inspiration from humans' learning skills and formulates them into machines' learning skills (MLS). These MLS are leveraged to train better ML models. In Skillearn, there are one or multiple learner models, each with one or multiple sets of learnable parameters such as weight parameters, architectures, hyperparameters, etc. Different learners interact with each other through interaction functions. The learning of all learners is organized into multiple stages, each involving a subset of learners. The stages have an order, but they are performed end-to-end in a multilevel optimization framework where latter stages influence earlier stages and vice versa. We develop a unified optimization algorithm for solving the multi-level optimization problem in Skillearn. In two case studies, we apply Skillearn to formalize two learning skills of humans -learning by passing tests (LPT) and interleaving learning (IL) -into machines' learning skills (MLS) and leverage these MLS for neural architecture search (Zoph and Le, 2017;Real et al., 2019;. In LPT, a tester model dynamically creates tests with increasing levels of difficulty to evaluate a testee model; the testee continuously improves its architecture by passing however difficult tests created by the tester. In IL, a set of models collaboratively learn a data encoder in an interleaving fashion: the encoder is trained by model 1 for a while, then passed to model 2 for further training, then model 3, and so on; after trained by all models, the encoder returns back to model 1 and is trained again, then moving to model 2, 3, etc. This process repeats for multiple rounds. Experiments on various datasets demonstrate that ML models trained by these two learning skills achieve significantly better performance.
The major contributions of this work are as follows.
• We propose to leverage the broadly-used and effective learning skills in human learning to develop better machine learning methods.
• We propose Skillearn, a general framework for formulating humans' learning skills into machines' learning skills that can be leveraged by ML models for achieving better learning outcomes.
• We apply Skillearn to formalize two skills in human learning -learning by passing tests (LPT) and interleaving learning (IL), and apply them to improve neural architecture search.
• On various datasets, we demonstrate the effectiveness of the two skills -LPT and IL formalized by Skillearn -in learning better neural architectures.
The rest of the paper is organized as follows. Section 2 presents the general Skillearn framework. In Section 3 and 4, we present two case studies, where Skillearn is applied to formalize two skills in human learning: learning by passing tests and interleaving learning. Section 5 reviews related works and Section 6 concludes the paper.

Skillearn: Machine Learning Inspired by Human's Learning Skills
In this section, we present a general framework called Skillearn, which gets inspiration from humans' learning skills, formalize these skills, and leverage them to improve machine learning. We begin with a brief overview of humans' learning skills and summarize their properties. Then we present the Skillearn framework and the optimization algorithm for this framework.

Humans' Learning Skills
Humans, as the most powerful learners on the planet, have accumulated a lot of skills and techniques in learning faster and better. Here are some examples.
• Learning through testing. After learning a topic, a student can solve some test problems (created or selected by a teacher) about this topic to identify the strong and weak points in his/her understanding of this topic, and re-learn the topic based on the identified strong and weak points. In re-learning, the identified strong and weak points help the student to know what to focus on. The quality of test problems plays a crucial role in effectively evaluating the student. How to create or select high-quality test problems is an important skill that the teacher needs to learn.
• Interleaving learning is a learning technique where a learner interleaves the studies of multiple topics: study topic A for a while, then switch to B, subsequently to C; then switch back to A, and so on, forming a pattern of ABCABCABC · · · . Interleaving learning is in contrast to blocked learning, which studies one topic very thoroughly before moving to another topic. Compared with blocked learning, interleaving learning increases long-term retention and improves ability to transfer learned knowledge.
• Learning by ignoring. In course learning, given a large collection of practice problems provided in the textbook, the teacher selects a subset of problems as homework for the students to practice instead of using all problems in the textbook. Some practice problems are ignored because 1) they are too difficult which might confuse the students; 2) they are too simple which are not effective in helping the students to practice their knowledge learned during lectures; 3) they are repetitive.

Properties of Humans' Learning Skills
From the above examples of humans' learning skills, we observe the following properties of them.
• A learning event involves multiple learners. For example, in learning through testing, there are two learners: a student and a teacher. The teacher learns how to create test problems and the student learns how to solve these test problems.
• In a learning task, a learner has multiple aspects to learn about this task. For example, in learning by ignoring, to create effective homework problems, the teacher needs to learn: 1) how to solve these problems; 2) which problems are more valuable to use as homework.
• Different learners interact with each other during learning. For example, in learning through testing, the teacher creates test problems and uses them to evaluate the student.
• In a learning task, the learning process is divided into multiple stages. These stages have a certain order. Each stage involves a subset of learners. For example, in learning through testing, there are three stages: 1) the teacher learns a topic; 2) the teacher creates test problems about this topic and uses them to evaluate the student; 3) based on the strong and weak points identified during solving the test problems, the student re-learns this topic. The three stages have a sequential order and cannot be switched. The first stage involves the teacher only; the second stage involves both the teacher and the student; the third stage involves the student only.
• Testing and validation are widely used to evaluate the outcome of learning and provide feedback for improving learning. For example, in learning through testing, the student takes a test to identify the strong and weak points in his/her learning of a topic.
• Learning is performed on various learning materials, including textbooks used for initial learning, homeworks used for enhancing the understanding of knowledge learned from textbooks, tests used for evaluating the outcome of learning, etc. teach" a student model (e.g., a decision tree) where the teacher predicts pseudo labels on unlabeled data, then these pseudo-labeled data examples are used to train the student model. In a human learning event, there are multiple stages of learning events. For example, in classroom learning, there could be three learning stages: 1) a teacher learns the course materials; 2) the teacher teaches these materials to students; 3) the students take tests to evaluate how well they learn. Analogously, in Skillearn, the learning involves multiple stages. For example, in knowledge distillation, there could be three learning stages: 1) a teacher model is trained; 2) the teacher performs knowledge distillation to "teach" a student model as described above; 3) the performance of the student model is evaluated. In human learning, tests are widely used to evaluate the learners and provide feedback for improving the learners. Analogously, ML models are validated for further improvement. In human learning, the learners learn from learning materials such as textbooks, lecture notes, homework, etc. Likewise, ML models are learned on various datasets, such as training data, validation data, and other auxiliary data.

General Framework of Skillearn
Based on the properties of humans' learning skills, we propose a framework called Skillearn to formalize the learning skills of humans and incorporate them into machine learning. In Skillearn, we have the following elements.
• Learners. There could be one or multiple learners. Each learner is an ML model, such as a deep convolutional network, a deep generative model, a nonparametric kernel density estimator, etc. This is analogous to human learning which involves one or multiple human learners.
• Learnable parameters. Each learner has one or more sets of learnable parameters, which could be weight parameters of a network, architecture of a network, weights of training examples, hyperparameters, etc. This is analogous to human learning where each human learner learns multiple aspects in a learning task.
• Interaction function, which describes how two or more learners interact with others. Some examples of interaction include: 1) in knowledge distillation, given an unlabeled image dataset, model A predicts the pseudo labels of these images; then model B is trained using these images and the pseudo labels generated by model A; 2) given a set of texts, two text encoders A and B extract embeddings of the texts; A and B are tied together via distributional matching: the distribution of embeddings extracted by A is encouraged to have small total-variance with the distribution of embeddings extracted by B. This is analogous to human learning where multiple human learners interact with each other.
• Learning stages. The learning of all learners is not conducted at one shot simultaneously. The learning is performed at multiple stages with an order. At each stage, a subset of learners participate in the learning. For example, in knowledge distillation, there are two stages: 1) a teacher model is trained; 2) the teacher model predicts pseudo labels on an unlabeled dataset and the pseudo-labeled dataset is used to train the student model. The first stage involves a single learner, which is the teacher. The second stage involves two learners: the teacher and the student. This is analogous to human learning where the learning process is divided into multiple stages. Mathematically, we formulate the learning at each stage as an optimization problem. The outcome of one learning stage is passed to another learning stage via the interaction function.
• Validation stage. This stage evaluates the outcome of learning and provides feedback to improve the learning at the learning stages. This is analogous to the testing and validation in human learning. The validation stage is formulated as an optimization problem as well. The learning outcomes produced in the learning stages are passed to the validation stage.
• Datasets. Datasets in ML are analogous to learning materials in human learning. Each learner has a training dataset and a validation dataset. The training dataset is used in the learning stages and the validation dataset is used in the validation stage. Besides, there are auxiliary datasets (labeled or unlabeled) on which the learners interact with each other.
Next, we define the learning stages. Each learning stage performs a focused learning activity which is defined as an optimization problem. The optimization problem involves a training loss and (optionally) an interaction function which describes how the learners involved in this stage interact with each other. A learning stage consists of the following elements: • Active learners. A subset of learners (one or more) are involved at this learning stage. These learners are called active learners.
• Active learnable parameters. For each active learner, a sub-collection of its learnable parameter sets are trained in this stage.
• Supporting learnable parameters. For each active learner, a sub-collection of its learnable parameter sets are used to define the loss function and interaction function, but they are not updated at this stage.
• Active training datasets, which include the training dataset of every active learner.
• Active auxiliary datasets, which include the auxiliary datasets where the interaction function in this learning stage is defined on.
• Training loss, which is defined on the active training data collection, active learnable parameters, and supporting learnable parameters.
• Interaction function, which depicts the interaction between two or more active learners. It is defined on the active auxiliary datasets, active learnable parameters, and supporting learnable parameters.
In Skillearn, there is a single validation stage where an optimization problem is defined. The optimization problem involves one or more validation losses and (optionally) an interaction function which describes how the learners in the validation stage interact with each other. The validation stage consists of the following elements.
• Active learners, which are the learners to validate.
• Remaining learnable parameters. At each learning stage, a subset of parameters are learned. After all learning stages, the parameters that have not been learned are called remaining parameters. The remaining parameters are updated in the validation stage.
• Validation datasets: validation datasets of all active learners.
• Active auxiliary datasets, which include the auxiliary datasets where the interaction function in the validation stage is defined on.
• Validation losses, which are defined on remaining learnable parameters, validation datasets, and (optionally) active auxiliary datasets.
• Interaction function, which depicts the interaction between two or more active learners. It is defined on remaining learnable parameters and active auxiliary datasets. The set of active learners in the k-th learning stage a (k) i The i-th active learner in the k-th learning stage O ki The number of active parameter sets of the i-th active learner in the k-th learning stage W kij The j-th active parameter set of the i-th learner in the k-th learning stage W ki The collection of active parameter sets of the i-th active learner in the k-th learning stage W k All active parameter sets in the k-th learning stage P ki The number of supporting parameter sets of the i-th active learner in the k-th learning stage U kij The j-th supporting parameter set of the i-th active learner in the k-th learning stage U ki The collection of supporting parameter sets of the i-th active learner in the k-th learning stage U k All supporting parameter sets in the k-th learning stage D

(tr) ki
Training dataset of the i-th active learner in the k-th learning stage D (tr) k Active training datasets in the k-th learning stage F k Active auxiliary datasets in the k-th learning stage L k Training loss in the k-th learning stage I k Interaction function in the k-th learning stage m . Meanwhile, all learners share a common collection of auxiliary datasets F, which could be unlabeled datasets used for self-supervised pretraining (He et al., 2019), additional labeled datasets used for validation, and so on. The learner m has one or more sets of learnable parameters {W The learnable parameters could be network weights, architectures, hyperparameters, weights of training examples, etc.
We assume there are K learning stages. At each stage k, a subset of M k learners are involved in the learning, which are called active learners. For each active learner a (k) i , a sub-collection of its learnable parameter sets W ki = {W kij } O ki j=1 are trained at this stage, which are called active learnable parameters. Let W k = {W ki |i = 1, · · · , M k } denote the active learnable parameters for all active learners. Meanwhile, another subcollection of its learnable parameter sets U ki = {U kij } P ki j=1 are used to define the training loss function and interaction function. But U ki are not updated at this stage. They are called supporting learnable parameters. Let U k = {U ki |i = 1, · · · , M k } denote the supporting learnable parameters for all active learners. Let D k denote the active training datasets, consisting of the training dataset of each active learner in A k . Let F k denote the active auxiliary datasets used in this stage to define the interaction function. The learning activity at stage k is formulated as an optimization problem where the optimization variables are active learnable parameters and the objective involves 1) a training loss L k defined on the active training datasets, active learnable parameters, and supporting learnable parameters; 2) (optionally) an interaction function I k that depicts the interaction between learners in A k . The notations are summarized in Table 1.

The Mathematical Framework for Skillearn
The formulation of Skillearn is shown in Eq.(1).
Learning stage K: 1 , F 1 ) + γ 1 I 1 (W 1 , U 1 , F 1 ) (1) It is a multi-level optimization framework, which involves K + 1 optimization problems. On the constraints are K optimization problems, each corresponding to a learning stage. The K learning stages are ordered. From bottom to top, the optimization problems correspond to the learning stage 1, 2, · · · , K respectively. In the optimization problem of the learning stage k, the optimization variables are the active learnable parameters W k of all active learners in this stage. The objective function consists of a training loss k , F k ) defined on the active learnable parameters W k , supporting learnable parameters U k , optimal solutions {W * j ({U i } j i=1 )} k−1 j=1 obtained in previous learning stages, active training datasets D (tr) k , and active auxiliary datasets F k . Typ- k , F k ) can be decomposed into a summation of active learners' individual training losses: ki , F k ) is the training loss of the active learner i defined on its active parameters W ki , supporting parameters U ki , and training dataset D which depicts how the M k active learners interact with each other in this learning stage. It is defined on the active learnable parameters W k , supporting learnable parameters U k , optimal solutions in previous stages, and active auxiliary datasets F k . γ k is a tradeoff parameter between the training loss and interaction function. U k is needed to define the objective, but it is not updated at this stage. After completing the learning at stage k, we obtain the optimal solution W * k ({U j } k j=1 ). Note that W * k is function of {U j } k j=1 since W * k is a function of the objective and the objective is a function of ) is used to define the objectives in later stages.
At the very top of Eq.(1), the optimization problem (outside the constraint block) corresponds to the validation stage which validates the optimal solutions {W * obtained in the K learning stages. The optimization variables are remaining learnable parameters {U i } K i=1 that have not been learned in the K learning stages. The objective function consists of a validation loss and an interaction function. γ val is a tradeoff parameter.

Remarks:
• Note that for simplicity, we assume the optimization problem at each stage is a minimization problem. The optimization problem can be more complicated problems such as min-max problems.
• At a certain stage, a learnable parameter cannot be simultaneously an active parameter and a supporting parameter. For active parameters in stage k, once learned, they cannot be active parameters or supporting parameters in later stages. For supporting parameters in stage k, they can be active parameters or supporting parameters in later stages.
• The supporting parameters are not learned in previous stages.

Optimization Algorithm for Skillearn
In this section, we develop an algorithm to solve the Skillearn problem in Eq.
(1), inspired by the algorithm in . For each learning stage k with an optimization problem: ) and get an approximated objective. When approximating W * l ({U j } l j=1 ), we use the gradient of the approximated objective: (4) For the objective in the validation stage, it can be approximated as: We update the remaining learnable parameters {U i } K i=1 by minimizing this approximated objective. The optimization algorithm for Skillearn is summarized in Algorithm 1.

Algorithm 1 Optimization algorithm for Skillearn
by minimizing the approximated objective in Eq.(5) end

Case Study I: Learning by Passing Tests
In this section, we apply our general Skillearn framework to formalize a human learning technique -learning by passing tests, and apply it to improve machine learning. In human learning, an effective and widely used methodology for improving learning outcome is to let the learner take increasingly more-difficult tests. To successfully pass a more challenging test, the learner needs to gain better learning ability. By progressively passing tests that have increasing levels of difficulty, the learner strengthens his/her learning capability gradually.
Inspired by this test-driven learning technique of humans, we are interested in investigating whether this methodology is helpful for improving machine learning as well. We use the Skillearn framework to formalize this human learning technique, which results in a novel machine learning framework called learning by passing tests (LPT). In this framework, there are two learners: a "testee" model and a "tester" model. The tester creates a sequence of "tests" with growing levels of difficulty. The testee tries to learn better so that it can pass these increasingly more-challenging tests. Given a large collection of data examples called "test bank", the tester creates a test T by selecting a subset of examples from the test bank. The testee applies its intermediately-trained model M to make predictions on the examples in T . The prediction error rate R reflects how difficult this test is. If the testee can make correct predictions on T , it means that T is not difficult enough. The tester will create a more challenging test T by selecting a new set of examples from the test bank in a way that the new error rate R achieved by M is larger than R. Given this more demanding test T , the testee re-learns its model to pass T , in a way that the newly-learned model M achieves a new error rate R on T where R is smaller than R . This process iterates until convergence.
In our framework, both the testee and tester perform learning. The testee learns how to best conduct a target task J 1 and the tester learns how to create difficult and meaningful tests. To encourage a created test T to be meaningful, the tester trains a model using T to perform a target task J 2 . If the model performs well on J 2 , it indicates that T is meaningful. The testee has two sets of learnable parameters: neural architecture and network weights. The tester has three learnable modules: data encoder, test creator, and target-task executor. The learning is organized into three stages. In the first stage, the testee trains its network weights on the training set of task J 1 with the architecture fixed. In the second stage, the tester trains its data encoder and target-task executor on a created test to perform the target task J 2 , with the test creator fixed. In the third stage, the testee updates its model architecture by minimizing the predictive loss L on the test created by the tester; the tester updates its test creator by maximizing L and minimizing the loss on the validation set of J 2 . The testee and tester interact on the loss function L in an adversarial manner, where the testee minimizes this loss while the tester maximizes this loss. The three stages are performed jointly end-to-end in a multi-level optimization framework, where a latter stage influences an earlier stage and vice versa. We apply our method for neural architecture search (Zoph and Le, 2017;Real et al., 2019) in image classification tasks on CIFAR-100, CIFAR-10, and ImageNet (Deng et al., 2009). Our method achieves significant improvement over state-of-the-art baselines.

Method
In this section, we describe how to instantiate the general Skillearn framework to the LPT framework, and how to instantiate the general optimization procedure of Skillearn to a specialized optimization algorithm for LPT.

Learning by Passing Tests
In the learning by passing tests (LPT) framework, there are two learners: a testee model and a tester model, where the testee studies how to perform a target task J 1 such as classification, regression, etc. The eventual goal is to make the testee achieve a better learning outcome with the help of the tester. There is a collection of data examples called "test bank". The tester creates a test by selecting a subset of examples from the test bank. Given a test T , the testee applies its intermediately-trained model M to make predictions on T and measures the prediction error rate R. From the perspective of the tester, R indicates how difficult the test T is. If R is small, it means that the testee can easily pass this test. Under such circumstances, the tester will create a more difficult test T which renders the new error rate R achieved by M on T is larger than R. From the testee's perspective, R indicates how well the testee performs on the test. Given this more difficult test T , the testee refines its model to pass this new test. It aims to learn a new model M in a way that the error  Training dataset of target-task J 1 performed by the testee Active auxiliary datasets -Training loss Training loss of target-task J 1 :  In our framework, both the testee and the tester performs learning. The testee studies how to best fulfill the target task J 1 . The tester studies how to create tests that are difficult and meaningful. In the testee' model, there are two sets of learnable parameters: model architecture and network weights. The architecture and weights are both used to make predictions in J 1 . The tester's model performs two tasks simultaneously: creating tests and performing target-task J 1 . The model has three modules with learnable parameters: data encoder, test creator, and target-task executor, where the test creator performs the task of generating tests and the target-task executor conducts J 1 . The test creator and target-task executor share the same data encoder. The data encoder takes a data example d as input and generates a latent representation for this example. Then the representation is fed into the test creator which determines whether d should be selected into the test. The representation is also fed into the target-task executor which performs prediction on d during performing the target task J 2 .
In our framework, the learning of the testee and the tester is organized into three stages. In the first stage, the testee learns its network weights W by minimizing the training loss L(A, W, D (tr) ee ) defined on the training data D (tr) ee in the task J 1 . The architecture A is used to define the training loss, but it is not learned in this stage. If A is learned by minimizing this training loss, a trivial solution will be yielded where A is very large and complex that it can perfectly overfit the training data but will generalize poorly on unseen data. Let W * (A) denotes the optimally learned W in this stage. Note that W * is a function of A because W * is a function of the training loss and the training loss is a function of A. Table 3 shows the key elements of this learning stage under the Skillearn terminology. The testee is the active learner, which performs learning in this stage. Network weights of the testee are the active learnable parameters, which are updated at this stage. The architecture variables of the testee are the supporting learnable parameters, which are used to define the loss function, but are not updated at this stage. Active training datasets include the training data of the task J 1 performed by the testee. There are no active auxiliary datasets. Training loss is L(A, W, D (tr) ee ). There is no interaction function at this stage. The optimization problem is: In the second stage, the tester learns its data encoder E and target-task executor X by minimizing the training loss L(E, X, D (tr) er ) + γL(E, X, σ(C, E, D b )) in the task J 2 . The training loss consists of two parts. The first part L(E, X, D created by the test creator. For each example d in the test bank D b , it is first fed into the encoder E, then the creator C, which outputs a binary value indicating whether d should be selected into the test. σ(C, E, D b ) is the collection of examples whose binary value is equal to 1. γ is a tradeoff parameter between these two parts of losses. The creator C is used to define the second-part loss, but it is not learned in this stage. Otherwise, a trivial solution will be yielded where C always sets the binary value to 0 for each test-bank example so that the second-part loss becomes 0. Let E * (C) and X * (C) denote the optimally trained E and X in this stage. Note that they are both functions of C since they are functions of the training loss and the training loss is a function of C. Table 4 shows the key elements of this learning stage under the Skillearn terminology. The tester is the active learner. The active learnable parameters include the data encoder and target-task executor of the tester. The supporting learnable parameters include the test creator. The active training datasets include the training data of target-task J 2 performed by the tester. The active auxiliary datasets include the test bank. The training loss is L(E, X, D (tr) There is no interaction function at this stage. The optimization problem is: In the third stage, the testee learns its architecture by trying to pass the test σ(C, E * (C), D b ) created by the tester. Specifically, the testee aims to minimizes the predictive loss of its model on the test: where d is an example in the test and (A, W * (A), d) is the loss defined in this example. A smaller L(A, W * (A), σ(C, E * (C), D b )) indicates that the testee performs well on this test. Meanwhile, the tester learns its test creator C in a way that C can create Active learners Testee, tester Remaining learnable parameters 1) Architecture of the testee; 2) Test creator of the tester

Validation datasets
Validation dataset of the tester Active auxiliary datasets Test bank Testee's prediction loss defined on the test created by the tester: . Therefore, the tester can learn to create a more challenging test by maximizing is to enlarge the size of the test. But a larger size does not imply more difficulty. To discourage this degenerated solution from happening, we normalize the loss using the size of the test: where |σ(C, E * (C), D b )| is the cardinality of the set σ(C, E * (C), D b ). Under the Skillearn terminologies, the loss in Eq. (9) is the interaction function where the testee and tester interact. The testee aims to minimize this loss to "pass" the testee and the tester aims to maximize this loss to "fail" the testee. To measure the meaningfulness of a test, we check how well the optimally-trained task executor E * (C) and data encoder X * (C) of the tester perform on the validation data D (val) er in the target task J 2 , and the performance is measured by the validation loss: L(E * (C), X * (C), D (val) er ). E * (C) and X * (C) are trained using the test generated by C in the second stage. If the validation loss is small, it means that the created test is helpful in training the task executor and therefore is considered as being meaningful. To create a meaningful test, the tester learns C by minimizing L(E * (C), X * (C), D (val) er ). In sum, C is learned by maximizing er ), where λ is a tradeoff parameter between these two objectives. Under the Skillearn terminology, this stage is a validation stage. Table 5 summarizes the key elements of this stage. The active learners include both the testee and the tester. The remaining learnable parameters include the architecture of the testee and the test creator of the tester. The validation datasets include the validation data in the target-task J 2 performed by the tester. The active auxiliary Skillearn: Machine Learning Inspired by Humans' Learning Skills

Interaction function
Testee's prediction loss defined on the test created by the tester: Learning stage I: the testee learns its network weights on its training data: W * (A) = min W L(A, W, D ee ) Learning stage II: the tester uses its test creator to select a subset of examples from the test bank, then it learns its data encoder and target-task executor on its training data and on the selected examples from the test bank: E * (C), X * (C) = min E,X L(E, X, D (tr) er ) + γL(E, X, σ(C, E, D b )). Validation stage 1) The testee updates its architecture to minimize the prediction loss on the test created by the tester; 2) The tester updates its test creator to maximize the testee's prediction loss and minimize its own validation loss.
Datasets 1) Training data of the testee; 2) Training data of the tester; 3) Validation data of the tester; 4) Test bank.
(10) The three stages are mutually dependent: W * (A) learned in the first stage and E * (C) and X * (C) learned in the second stage are used to define the objective function in the third stage; the updated C and A in the third stage in turn change the objective functions in the first and second stage, which subsequently render W * (A), E * (C), and X * (C) to be changed. Putting these pieces together, we instantiate the Skillearn framework into the following LPT formulation: This formulation nests three optimization problems. On the constraints of the outer optimization problem are two inner optimization problems corresponding to the first and second learning stage respectively. The objective function of the outer optimization problem corresponds to the validation stage. Table 6 summarizes the instantiation of Skillearn to LPT.
As of now, the test σ(C, E, D b ) is represented as a subset, which is highly discrete and therefore difficult for optimization. To address this problem, we perform a continuous relaxation of σ(C, E, D b ): where for each example d in the test bank, the original binary value indicating whether d should be selected is now relaxed to a continuous probability f (d, C, E) representing how likely d should be selected. Under this relaxation, L(E, X, σ(C, E, D b )) can be computed as follows: where we calculate the loss (E, X, d) on each test-bank example and weigh this loss using f (d, C, E)). If f (d, C, E)) is small, it means that d is less likely to be selected into the test and its corresponding loss should be down-weighted. Similarly, Similar to , we represent the architecture A of the testee in a differentiable way. The search space of A is composed of a large number of building blocks. The output of each block is associated with a variable a indicating how important this block is. After learning, blocks whose a is among the largest are retained to form the final architecture. In this end, architecture search amounts to optimizing the set of architecture variables A = {a}.

Optimization Algorithm
In this section, we instantiate the general optimization framework in Section 2.3 to derive an optimization algorithm for LPT. We approximate E * (C) and X * (C) using one-step gradient descent update of E and X with respect to L(E, X, D er ) + γL(E, X, σ(C, E, D b )) and approximate W * (A) using one-step gradient descent update of W with respect to L(A, W, D (tr) ee ). Then we plug in these approximations into and perform gradient-descent update of C and A with respect to this approximated objective. In the sequel, we use where ξ ee is a learning rate and simplifying the notation of σ(C, E * (C), D b ) as σ, we can calculate the approximated gradient of L (A, W * (A) , σ) w.r.t A as: The second term in the third line involves expensive matrix-vector product, whose computational complexity can be reduced by a finite difference approximation: where W ± = W ±α ee ∇ W L (A, W , σ) and α ee is a small scalar that equals 0.01/ ∇ W L (A, W , σ)) 2 . We approximate E * (C) and X * (C) using the following one-step gradient descent update of E and C respectively: where ξ E and ξ X are learning rates. Plugging in these approximations into the objective function in Eq.(15), we can learn C by maximizing the following objective using gradient methods: The derivative of the second term in this objective with respect to C can be calculated as: where Similar to Eq.(17), using finite difference approximation to calculate ∇ 2 C,E L(E, X, σ(C, E, D b )) ∇ E L(E , X , D (val) er ) and ∇ 2 C,X L(E, X, σ(C, E, D b ))∇ X L(E , X , D er ), we have: er ) and X ± = X ± α X ∇ X L(E , X , D (val) er ). For the first term L(A, W , σ(C, E , D b ))/|σ(C, E , D b )| in the objective, we can use chain rule to calculate its derivative w.r.t C, which involves calculating the derivative of L(A, W , σ(C, E , D b )) and |σ(C, E , D b )| w.r.t to C. The derivative of L(A, W , σ(C, E , D b )) w.r.t C can be calculated as: where ∂E ∂C is given in Eq.(21) and ∇ 2 C,E L(E, X, σ(C, E, D b )) ×∇ E L(A, W , σ(C, E , D b )) can be approximated with 1 where ∂E ∂C is given in Eq.(21). The algorithm for solving LPT is summarized in Algorithm 2.

. Experiments
We apply LPT for neural architecture search in image classification tasks. Following , we first perform architecture search which finds out an optimal cell, then perform architecture evaluation which composes multiple copies of the searched cell into a large network, trains it from scratch, and evaluates the trained model on the test set. We let the target task of the learner and that of the tester be the same.

Datasets
We used three datasets in the experiments: CIFAR-10, CIFAR-100, and ImageNet (Deng et al., 2009). The CIFAR-10 dataset contains 50K training images and 10K testing images, from 10 classes (the number of images in each class is equal). Following , we split the original 50K training set into a new 25K training set and a 25K validation set. In the sequel, when we mention "training set", it always refers to the new 25K training set. During architecture search, the training set is used as the training data D of the tester. During architecture evaluation, the combination of the training data and validation data is used to train the large network stacking multiple copies of the searched cell. The CIFAR-100 dataset contains 50K training images and 10K testing images, from 100 classes (the number of images in each class is equal). Similar to CIFAR-100, the 50K training images are split into a 25K training set and 25K validation set. The usage of the new training set and validation set is the same as that for CIFAR-10. The ImageNet dataset contains a training set of 1.2M images and a validation set of 50K images, from 1000 object classes. The validation set is used as a test set for architecture evaluation. Following , we evaluate the architectures searched using CIFAR-10 and CIFAR-100 on ImageNet: given a cell searched using CIFAR-10 and CIFAR-100, multiple copies of it compose a large network, which is then trained on the 1.2M training data of ImageNet and evaluated on the 50K test data.

Experimental Settings
Our framework is a general one that can be used together with any differentiable search method. Specifically, we apply our framework to the following NAS methods: 1) DARTS , 2) P-DARTS , 3) DARTS + (Liang et al., 2019b), 4) DARTS − (Chu et al., 2020a). The search space in these methods are similar. The candidate operations include: 3 × 3 and 5 × 5 separable convolutions, 3 × 3 and 5 × 5 dilated separable convolutions, 3 × 3 max pooling, 3 × 3 average pooling, identity, and zero. In LPT, the network of the learner is a stack of multiple cells, each consisting of 7 nodes. For the data encoder of the tester, we tried ResNet-18 and ResNet-50 (He et al., 2016b). For the test creator and target-task executor, they are set to one feed-forward layer. λ and γ are both set to 1.
For CIFAR-10 and CIFAR-100, during architecture search, the learner's network is a stack of 8 cells, with the initial channel number set to 16. The search is performed for 50 epochs, with a batch size of 64. The hyperparameters for the learner's architecture and weights are set in the same way as DARTS, P-DARTS, DARTS + , and DARTS − . The data encoder and target-task executor of the tester are optimized using SGD with a momentum of 0.9 and a weight decay of 3e-4. The initial learning rate is set to 0.025 with a cosine decay scheduler. The test creator is optimized with the Adam (Kingma and Ba, 2014) optimizer with a learning rate of 3e-4 and a weight decay of 1e-3. During architecture evaluation, 20 copies of the searched cell are stacked to form the learner's network, with the initial channel number set to 36. The network is trained for 600 epochs with a batch size of 96 (for both CIFAR-10 and CIFAR-100). The experiments are performed on a single Tesla v100. For ImageNet, following , we take the architecture searched on CIFAR-10 and evaluate it on ImageNet. We stack 14 cells (searched on CIFAR-10) to form a large network and set the initial channel number as 48. The network is trained for 250 epochs with a batch size of 1024 on 8 Tesla v100s. Each experiment on LPT is repeated for ten times with the random seed to be from 1 to 10. We report the mean and standard deviation of results obtained from the 10 runs. Table 7 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-100. From this table, we make the following observations. First, when our method LPT is applied to different NAS baselines including DARTS-1st (first order approximation), DARTS-2nd (second order approximation), DARTS − , DARTS + , and P-DARTS, the classification errors of these baselines can be significantly reduced. For example, applying our method to P-DARTS, the error reduces from 17.49% to 16.28%. Applying our method to DARTS-2nd, the error reduces from 20.58% to 18.40%. This demonstrates the effectiveness of our method in searching for a better architecture. In our method, the learner continuously improves its architecture by passing the tests created by the tester with increasing levels of difficulty. These tests can help the learner to identify the weakness of its architecture and provide guidance on how to improve it. Our method creates a new test on the fly based on how the learner performs in the previous round. From the test bank, the tester selects a subset of difficult examples to evaluate the learner. This new test poses a greater challenge to the learner and encourages the learner to improve its architecture so that it can overcome the new challenge. In contrast, in baseline NAS approaches, a single fixed validation set is used to evaluate the learner. The learner can achieve a good performance via "cheating": focusing on performing well on the majority of easy examples and ignoring the minority of difficult examples. As a result, the learner's architecture does not have the ability to deal
first-order approximation; 2) In our run of DARTS − , the performance reported in (Chu et al., 2020a) cannot be achieved; 3) In our run of DARTS + , in the architecture evaluation stage, we set the number of epochs to 600 instead of 2000 as used in (Liang et al., 2019a), to ensure a fair comparison with other methods (where the epoch number is 600). Table 8 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-10. As can be seen, applying our proposed LPT to DARTS-1st, DARTS-2nd, DARTS − , and DARTS + significantly reduces the errors of these baselines. For example, with the usage of LPT, the error of DARTS-2nd is reduced from 2.76% to 2.68%. This further demonstrates the efficacy of our method in searching better-performing architectures, by creating tests with increasing levels of difficulty and improving the learner through taking these tests. Table 9 shows the results on ImageNet, including top-1 and top-5 classification errors on the test set, number of weight parameters (millions), and search costs (GPU days). Following , we take the architecture searched by LPT-R18-DARTS-2nd on CIFAR-10 and evaluate it on ImageNet. As can be seen, applying our LPT method to DARTS-2nd, the top-1 error reduces from 26.7% to 25.3% and the top-5 error reduces from 8.7% to 7.9%, without increasing the search cost and parameter number. This further demonstrates the effectiveness of our method.

Ablation Studies
In order to evaluate the effectiveness of individual modules in LPT, we compare the full LPT framework with the following ablation settings.
• Ablation setting 1. In this setting, the tester creates tests solely by maximizing their level of difficulty, without considering their meaningfulness. Accordingly, the second stage in LPT where the tester learns to perform a target-task by leveraging the created tests is removed. The tester directly learns a selection scalar s(d) ∈ [0, 1] for each example d in the test bank without going through a data encoder or a test creator. The corresponding formulation is: where S = {s(d)|d ∈ D b }. In this study, λ and γ are both set to 1. The data encoder of the tester is ResNet-18. For CIFAR-100, to avoid performance collapse because of skip connections, LPT is applied to P-DARTS. For CIFAR-10, LPT is applied to DARTS-2nd.
• Ablation setting 2. In this setting, in the second stage of LPT, the tester is trained solely based on the create test, without using the training data of the target task. The corresponding formulation is: (26) In this study, λ and γ are both set to 1. The data encoder of the tester is ResNet-18. For CIFAR-100, to avoid performance collapse because of skip connections, LPT is applied to P-DARTS. For CIFAR-10, LPT is applied to DARTS-2nd.
• Ablation study on λ. We are interested in how the learner's performance varies as the tradeoff parameter λ in Eq.(11) increases. In this study, the other tradeoff parameter γ in Eq.(11) is set to 1. For both CIFAR-100 and CIFAR-10, we randomly sample 5K data from the 25K training and 25K validation data, and use it as a test set to report performance in this ablation study. The rest 45K data (22.5K training data and 22.5K Method Error (%) Difficulty only  18.12±0.11 Difficulty + meaningfulness (CIFAR-100) 17.18±0.12 Difficulty only (CIFAR-10) 2.79±0.06 Difficulty + meaningfulness (CIFAR-10) 2.72±0.07 Table 10: Results for ablation setting 1. "Difficulty only" denotes that the tester creates tests solely by maximizing their level of difficulty, without considering their meaningfulness, i.e., the tester does not use the tests to learn to perform the target task. "Difficulty + meaningfulness" denotes the full LPT framework where the tester creates tests by maximizing both difficulty and meaningfulness.

Method
Error (%) Test only (CIFAR-100) 17.54±0.07 Test + Training data (CIFAR-100) 17.18±0.12 Test only (CIFAR-10) 2.75±0.03 Test + Training data (CIFAR-10) 2.72±0.07 Table 11: Results for ablation setting 2. "Test only" denotes that the tester is trained only using the create test to perform the target task. "Test + Training data" denotes that the tester is trained using both the test and the training data of the target task.
validation data) is used for architecture search and evaluation. Tester's data encoder is ResNe-18. LPT is applied to P-DARTS.
• Ablation study on γ. We investigate how the learner's performance varies as γ increases.
In this study, the other tradeoff parameter λ is set to 1. Similar to the ablation study on λ, on 5K randomly-sampled test data, we report performance of architectures searched and evaluated on 45K data. Tester's data encoder is ResNe-18. LPT is applied to P-DARTS. Table 10 shows the results for ablation setting 1. As can be seen, on both CIFAR-10 and CIFAR-100, creating tests that are both difficult and meaningful is better than creating tests solely by maximizing difficulty. The reason is that a difficult test could be composed of badquality examples such as outliers and incorrectly-labeled examples. Even a highly-accurate learner model cannot achieve good performance on such erratic examples. To address this problem, it is necessary to make the created tests meaningful. LPT achieves meaningfulness of the tests by making the tester leverage the created tests to perform the target task. The results demonstrate that this is an effective way of improving meaningfulness. Table 11 shows the results for ablation setting 2. As can be seen, for both CIFAR-100 and CIFAR-10, using both the created test and the training data of the target task to train the tester performs better than using the test only. By leveraging the training data, the data encoder can be better trained. And a better encoder can help to create higher-quality tests.     Figure 4 shows how classification errors change as λ increases. As can be seen, on both CIFAR-100 and CIFAR-10, when λ increases from 0.1 to 0.5, the error decreases. However, further increasing λ renders the error to increase. From the tester's perspective, λ explores a tradeoff between difficulty and meaningfulness of the tests. Increasing λ encourages the tester to create tests that are more meaningful. Tests with more meaningfulness can more reliably evaluate the learner. However, if λ is too large, the tests are biased to be more meaningful and less difficult. Lacking enough difficulty, the tests may not be compelling enough to drive the learner for improvement. Such a tradeoff effect is observed in the results on CIFAR-10 as well. Figure 5 shows how classification errors change as γ increases. As can be seen, on both CIFAR-100 and CIFAR-10, when γ increases from 0.1 to 0.5, the error decreases. However, further increasing γ renders the error to increase. Under a larger γ, the created test plays a larger role in training the tester to perform the target task. This implicitly encourages the test creator to generate tests that are more meaningful. However, if γ is too large, the training is dominated by the created test which incurs the following risk: if the test is not meaningful, it will result in a poor-quality data-encoder which further degrades the quality of test creation.

Summary
In this section, we apply Skillearn to formalize a skill in human learning -learning by passing tests (LPT) and use it for neural architecture search. In LPT, a tester model creates a sequence of tests with growing levels of difficulty. A learner model continuously improves its learning ability by striving to pass these increasingly more-challenging tests. The tester learns to select hard validation examples rendering the learner to make large prediction errors and the learner refines its model to rectify these prediction errors. Our framework achieves significant improvement in neural architecture search on CIFAR-100, CIFAR-10, and ImageNet.

Case Study II: Interleaving Learning
In this section, we instantiate our general Skillearn framework to formalize another human learning technique -interleaving learning, and apply it to improve machine learning. Interleaving learning is a learning technique where a learner interleaves the studies of multiple topics: study topic A for a while, then switch to B, subsequently to C; then switch back to A, and so on, forming a pattern of ABCABCABC · · · . Interleaving learning is in contrast to blocked learning, which studies one topic very thoroughly before moving to another topic. Compared with blocked learning, interleaving learning increases long-term retention and improves ability to transfer learned knowledge.
We are interested in investigating whether the interleaving strategy is helpful for training machine learning models. We instantiate the Skillearn framework to an interleaving learning (IL) framework. We assume there are K learning tasks, each performed by a learner model. Each learner has a data encoder and a task-specific head. The data encoders of all learners share the same architecture, but may have different weight parameters. The K learners perform M rounds of interleaving learning with the following order: l 1 , l 2 , · · · , l K Round 1 l 1 , l 2 , · · · , l K Round 2 · · · l 1 , l 2 , · · · , l K Round m · · · l 1 , l 2 , · · · , l K Round M where l k denotes that the k-th learner performs learning. In the first round, we first learn l 1 , then learn l 2 , and so on. At the end of the first round, l K is learned. Then we move to the second round, which starts with learning l 1 , then learns l 2 , and so on. This pattern repeats until the M rounds of learning are finished. Between two consecutive learners l k l k+1 , the encoder weights of the latter learner l k+1 are encouraged to be close to the optimally learned encoder weights of the former learner l k .

Method
In this section, we present the details of the interleaving learning framework. There are K learners. Each learner learns to perform a task. These tasks could be the same, e.g., image classification on CIFAR-10; or different, e.g., image classification on CIFAR-10, image classification on ImageNet (Deng et al., 2009), object detection on MS-COCO (Lin et al., 2014), etc. Each learner k has a training dataset D Each learner has a data encoder and a task-specific head performing the target task. For example, if the task is image classification, the data encoder could be a convolutional neural network extracting visual features of the input images and the task-specific head could be a multi-layer perceptron which takes the visual features of an image extracted by the data encoder as input and predicts the class label of this image. We assume the architecture of The optimal weight parameters of the task-specific head in the k-th learner in the m-th round γ Tradeoff parameter Table 12: Notations in interleaving learning the data encoder in each learner is learnable. The data encoders of all learners share the same architecture, but their weight parameters could be different in different learners. The architectures of task-specific heads are manually designed by humans and they could be different in different learners. The K learners perform M rounds of interleaving learning with the following order: l 1 , l 2 , · · · , l K Round 1 l 1 , l 2 , · · · , l K Round 2 · · · l 1 , l 2 , · · · , l K Round m · · · l 1 , l 2 , · · · , l K Round M where l k denotes that the k-th learner performs learning. In the first round, we first learn l 1 , then learn l 2 , and so on. At the end of the first round, l K is learned. Then we move to the second round, which starts with learning l 1 , then learns l 2 , and so on. This pattern repeats until the M rounds of learning are finished. Between two consecutive learners l k l k+1 , the weight parameters of the latter learner l k+1 are encouraged to be close to the optimally learned encoder weights of the former learner l k . For each learner, the architecture of its encoder remains the same across all rounds; the weights of the encoder and head can be different in different rounds. Each learner k has the following learnable parameter sets: 1) architecture A of the encoder; 2) in each round m, the learner's encoder has a set of weight parameters W (m) k specific to this round; 3) in each round m, the learner's task-specific head has a set of weight parameters H (m) k specific to this round. The encoders of all learners share the same architecture and this architecture remains the same in different rounds. The encoders of different learners have different weight parameters. The weight parameters of a learner's encoder are different in different rounds. Different learners have different task-specific heads in terms of both architectures and weight parameters. In the interleaving process, the learning of the k-th learner is assisted by the (k −1)-th learner. Specifically, during learning, the encoder weights W k of the k-th learner are encouraged to be close to the optimal encoder Skillearn: Machine Learning Inspired by Humans' Learning Skills

Active learners
The first learner Active learnable parameters Weights of the data encoder and weights of the task-specific head in the first learner Supporting learnable parameters Encoder architecture shared by all learners Active training datasets Training dataset of the first learner Active auxiliary datasets -Training loss The first learner trains the weights of its data encoder and the weights of its task-specific head on its training dataset: L(A, W 1 , H 1 , D 1 ). Interaction function -Optimization problem W (1) 1 , H 1 , D 1 ) Table 13: Learning stage 1 in interleaving learning weights W k−1 of the (k − 1)-th learner. This is achieved by minimizing an interactive function: W k − W k−1 2 2 . There are M × K learning stages: in each of the M rounds, each of the K learners is learned in a stage. In the very first learning stage, the first learner in the first round is learned. It trains the weight parameters of its data encoder and the weight parameters of its task-specific head on its training dataset. In this learning stage (Table 13), the active learner is the first learner. The active learnable parameters are the weight parameters of the data encoder and the weight parameters of the task-specific head in the first learner in the first round. The supporting learnable parameters include the encoder architecture shared by all learners. The active training dataset is the training data of the first learner. There is no auxiliary dataset. The training loss is the target-task's loss defined on the training dataset of the first learner: L(A, W In this optimization problem, A is not learned. After learning, the optimal head is discarded. The optimal encoder weights W (1) 1 (A) are a function of A since the training loss is a function of A and W 1 is a function of the training loss. W In any other learning stage (Table 14), e.g., the l-th stage where the learner is k and the round of interleaving is m, the active learner is the learner k. The active learnable parameters include weights of the data encoder and weights of the task-specific head in the k-th learner in the m-th round. The supporting learnable parameters are the encoder architecture shared by all learners. The active training dataset is the training dataset of the k-th learner. There is no active auxiliary dataset. The training loss is the target-task's loss defined on the training dataset of the k-th learner: L(A, W at this stage to be close to the optimal encoder weights W l−1 learned in the previous stage. The optimization problem

Active learners
The k-th learner Active learnable parameters Weights of the data encoder and weights of the task-specific head in the k-th learner Supporting learnable parameters Encoder architecture shared by all learners Active training datasets Training dataset of the k-th learner Active auxiliary datasets -Training loss The k-th learner trains the weights of its data encoder and the weights of its task-specific head on its training dataset: The learner encourages its encoder weights to be close to the optimal encoder weights W l−1 learned in the l − 1 stage: where λ is a tradeoff parameter. The optimal encoder weights are a function of the encoder architecture. The encoder architecture is not updated at this learning stage. In the round of 1 to M − 1, the optimal heads are discarded after learning. In the round of M , the optimal heads are retained and will be used in the validation stage.
In the validation stage (Table 15), the active learners are all K learners. The remaining learnable parameters are the encoder architecture shared by all learners. The validation datasets are the validation datasets of all learners. There is no active auxiliary dataset. The validation loss is the sum of every learner's validation loss calculated using the optimal encoder weights and head weights learned in the final round: ). There is no interaction function. The optimization problem is: Putting all these pieces together, we instantiate the Skillearn framework to an interleaving learning framework, as shown in Eq.(32). From bottom to top, the K learners perform M rounds of interleaving learning. Learners in adjacent learning stages are coupled via the interaction function. The architecture A is not updated in the learning stages. It is learned by minimizing the validation loss. Table 16 summarizes the key elements of interleaving learning under the Skillearn terminology.

Optimization Algorithm
In this section, we develop an optimization algorithm for interleaving learning by instantiating the general optimization framework of Skillearn in Section 2.3. For each optimization problem W (m) k−1 (A) 2 2 in a learning stage, we approximate the optimal solution W

Skillearn
Interleaving Learning Learners K learners Learnable parameters 1) Encoder architecture shared by all learners; 2) In each round, each learner has weight parameters for the data encoder and weight parameters for the task-specific head.

Interaction function
The encoder weights W l at learning stage l are encouraged to be close to the optimal encoder weights W l−1 at stage l − 1: W l − W l−1 2 2 . Learning stages 1) In the first learning stage (the first learner in the first round), the learner trains the weights of its data encoder and the weights of its task-specific head on its training dataset: W (1) 1 , H 1 , D 1 ); 2) In other learning stages, the learner trains the weights of its data encoder and the weights of its task-specific head on its training dataset where the encoder weights are encouraged to be close to the optimal encoder weights trained in the previous stage: W Each learner validates its optimal data encoder and taskspecific head learned in the last round on its validation dataset.

Datasets
Each learner has a training dataset and a validation dataset.
1 (A), the approximation is: 1 , H For W (m) k (A), the approximation is: In the validation stage, we plug in the approximations of { W into the validation loss function, calculate the gradient of the approximated objective w.r.t the encoder architecture A, then update A via: The update steps from Eq.(34) to Eq.(37) until convergence. The entire algorithm is summarized in Algorithm 3.

Experiments
We apply interleaving learning for neural architecture search in image classification tasks. Two tasks are interleaved: image classification on CIFAR-10 and image classification on CIFAR-100. We search the shared architecture of data encoders in these two tasks. The search space is the same as that in DARTS . For CIFAR-10 which has 10 classes, the task-specific head is a 10-way linear classifier. For CIFAR-100 which has 100 classes, the head is a 100-way linear classifier. Similar to Section 3.2, following , we first perform architecture search which finds out an optimal cell, then perform architecture evaluation which composes multiple copies of the searched cell into a large network, trains it from scratch, and evaluates the trained model on the test set. Architecture search is performed jointly on CIFAR-10 and CIFAR-100 via interleaving. Architecture evaluation is performed separately on CIFAR-10 and CIFAR-100. In the interleaving process, we set the number of rounds to 2. The training and validation datasets of CIFAR-10 and CIFAR-100 are the same as those described in Section 3.2.1. The tradeoff parameter γ is set to 1. The rest of experimental settings are the same as those described in Section 3.2.2. Table 17 and Table 18 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-100 and CIFAR-10 respectively. As can be seen, when applied to DARTS-2nd, our interleaving learning (IL) method achieves great improvement on CIFAR-100 and slight improvement on CIFAR-10. On CIFAR-100, our proposed IL-DARTS-2nd achieves an average error of 17.46%, which is significantly lower than the 20.58% error of DARTS-2nd. These results demonstrate the effectiveness of interleaving learning. In IL, the encoder trained on CIFAR-100 is used to initialize the encoder for CIFAR-10. Likewise, the encoder trained on CIFAR-10 is used to help with the learning of the encoder on CIFAR-100. These two procedures iterates,
are the training and validation sets of CIFAR-100. D are the training and validation sets of CIFAR-10. A is the encoder architecture shared by CIFAR-100 and CIFAR-10. λ and γ are both set to 1. Table 19 compares the classification errors achieved by this multi-task learning method and our proposed interleaving learning method. As can be seen, interleaving learning (IL) performs better than multi-task learning (MTL). In the inner optimization problem of the MTL formulation, the encoder weights W 100 for CIFAR-100 and the encoder weights W 10 for CIFAR-10 are trained independently without a mechanism of mutually benefiting each other. In contrast, IL enables W 100 and W 10 to help each other for better training via the interleaving mechanism. These results further demonstrate the effectiveness of interleaving.

Summary
In this section, we apply Skillearn to formalize the interleaving learning (IL) skill of humans. In IL, a set of models collaboratively learn a data encoder in an interleaving fashion: the encoder is trained by model 1 for a while, then passed to model 2 for further training, then model 3, and so on; after trained by all models, the encoder returns back to model 1 and is trained again, then moving to model 2, 3, etc. This process repeats for multiple rounds. Via interleaving, different models transfer their learned knowledge to each other to better represent data and avoid being stuck in bad local optimums. Experiments of neural architecture search on CIFAR-100 and CIFAR-10 demonstrate the effectiveness of interleaving learning.

Neural Architecture Search
Neural architecture search (NAS) has achieved remarkable progress recently, which aims at searching for the optimal architecture of neural networks to achieve the best predictive performance. In general, there are three paradigms of methods in NAS: reinforcement learning (RL) approaches (Zoph and Le, 2017;Pham et al., 2018;, evolutionary learning approaches (Liu et al., 2018b;Real et al., 2019), and differentiable approaches (Cai et al., 2019;. In RL-based approaches, a policy is learned to iteratively generate new architectures by maximizing a reward which is the accuracy on the validation set. Evolutionary learning approaches represent the architectures as individuals in a population. Individuals with high fitness scores (validation accuracy) have the privilege to generate offspring, which replaces individuals with low fitness scores. Differentiable approaches adopt a network pruning strategy. On top of an over-parameterized network, the weights of connections between nodes are learned using gradient descent. Then weights close to zero are pruned later on. There have been many efforts devoted to improving differentiable NAS methods. In P-DARTS , the depth of searched architectures is allowed to grow progressively during the training process. Search space approximation and regularization approaches are developed to reduce computational overheads and improve search stability. PC-DARTS (Xu et al., 2020) reduces the redundancy in exploring the search space by sampling a small portion of a super network. Operation search is performed in a subset of channels with the held-out part bypassed in a shortcut. Our proposed LCT framework can be applied to any differentiable NAS methods.