Sample-based Dynamic Hierarchical Transformer with Layer and Head
Flexibility via Contextual Bandit
Abstract
Transformer requires a fixed number of layers and heads which makes them
inflexible to the complexity of individual samples and expensive in
training and inference. To address this, we propose a sample-based
Dynamic Hierarchical Transformer (DHT) model whose layers and heads can
be dynamically configured with single data samples via solving
contextual bandit problems. To determine the number of layers and heads,
we use the Uniform Confidence Bound algorithm while we deploy
combinatorial Thompson Sampling in order to select specific head
combinations given their number. Different from previous work that
focuses on compressing trained networks for inference only, DHT is not
only advantageous for adaptively optimizing the underlying network
architecture during training but also has a flexible network for
efficient inference. To the best of our knowledge, this is the first
comprehensive data-driven dynamic transformer without any additional
auxiliary neural networks that implement the dynamic system. According
to the experiment results, we achieve up to 74\%
computational savings for both training and inference with a minimal
loss of accuracy.