DNAD: Differentiable Neural Architecture Distillation

Xuan Rao; Bo Zhao; Derong Liu

doi:10.36227/techrxiv.21547878.v1

loading page

DNAD: Differentiable Neural Architecture Distillation

Xuan Rao ,
Bo Zhao ,
Derong Liu

Abstract

In order to satisfy the demand for designing efficient neural networks with an appropriate tradeoff between model performance (e.g., classifification accuracy) and computational complexity, the differentiable neural architecture distillation (DNAD) algorithm is developed based on two cores, namely, search by deleting and search by imitating. On the one hand, to derive neural architectures in a search space where cells of the same type no longer share the same topology, the super-network progressive shrinking (SNPS) algorithm is developed based on the framework of differentiable architecture search (DARTS), i.e., search by deleting. Unlike conventional DARTS-based approaches which produce neural architectures with simple structures and derive only one architecture during the search procedure, the SNPS algorithm is able to derive a Pareto-optimal set of architectures with more complex structures (more powerful representation ability) by forcing the dynamic super-network shrink from a dense structure to a sparse one progressively. On the other hand, since knowledge distillation (KD) has shown great effectiveness to train a compact student network with the help of a powerful over-parameterized teacher network, we combine the KD with SNPS to derive the DNAD algorithm, i.e., search by imitating. By minimizing the behavioral difference between the dynamic super-network and the teacher network, the over-fifitting of one-level DARTS is avoided and well-performed neural architectures are derived. Experiments on CIFAR10 and ImageNet demonstrate that both SNPS and DNAD are able to derive neural architectures in one search procedure with competitive tradeoffs between the model performance and the computational complexity.