Abstract
In order to satisfy the demand for designing efficient neural networks
with an appropriate tradeoff between model performance (e.g.,
classifification accuracy) and computational complexity, the
differentiable neural architecture distillation (DNAD) algorithm is
developed based on two cores, namely, search by deleting and
search by imitating. On the one hand, to derive neural
architectures in a search space where cells of the same type no longer
share the same topology, the super-network progressive shrinking (SNPS)
algorithm is developed based on the framework of differentiable
architecture search (DARTS), i.e., search by deleting. Unlike
conventional DARTS-based approaches which produce neural architectures
with simple structures and derive only one architecture during the
search procedure, the SNPS algorithm is able to derive a Pareto-optimal
set of architectures with more complex structures (more powerful
representation ability) by forcing the dynamic super-network shrink from
a dense structure to a sparse one progressively. On the other hand,
since knowledge distillation (KD) has shown great effectiveness to train
a compact student network with the help of a powerful over-parameterized
teacher network, we combine the KD with SNPS to derive the DNAD
algorithm, i.e., search by imitating. By minimizing the
behavioral difference between the dynamic super-network and the teacher
network, the over-fifitting of one-level DARTS is avoided and
well-performed neural architectures are derived. Experiments on CIFAR10
and ImageNet demonstrate that both SNPS and DNAD are able to derive
neural architectures in one search procedure with competitive tradeoffs
between the model performance and the computational complexity.