loading page

A Compact and Flexible FPGA Accelerator for Regular and Octave Convolutional Neural Networks
  • +2
  • See Jin Chuan ,
  • Hui-Fuang Ng ,
  • Hung-Khoon Tan ,
  • Jing-Jing Chang ,
  • Wai-Kong Lee
See Jin Chuan
Universiti Tunku Abdul Rahman (UTAR)

Corresponding Author:[email protected]

Author Profile
Hui-Fuang Ng
Author Profile
Hung-Khoon Tan
Author Profile
Jing-Jing Chang
Author Profile
Wai-Kong Lee
Author Profile

Abstract

Convolutional Neural Networks has (CNN) have been widely used in various Internet-of-Things (IoT) applications to offer smart solutions in our daily lives. Many hardware accelerators were proposed to speed up CNN, so that it can be used in the IoT sensor nodes. Recently, octave convolution was proposed to remove spatial redundancy from input feature maps, thus reducing the CNN computation and memory cost. Although octave CNN is an attractive candidate for IoT applications, it cannot replace regular CNN, which is already widely used in many applications. Hence, it is desirable to have a hardware accelerator that can flexibly support both regular and octave convolution. In this work, we present the first compact and flexible CNN accelerator on an Field-Programmable-Gate-Array (FPGA), which supports normal, pointwise and depthwise convolutions for state-of-the-art octave convolution, as well as the existing regular convolution. To ensure efficient accelerator utilization, a novel adaptive scheduling scheme is presented to schedule the number of computation tasks based on varying feature map dimensions across CNN layers. A novel integration technique is presented to execute the varying computational patterns in octave convolution due to multiple branching in a single computation unit. An efficient memory scheme and execution scheduling for octave convolution are also proposed to overlap the long data fetch time with the on-going octave computation for better overall computation performance. Comparing to other state-of-the-art work, our proposed accelerator achieved 1.98x higher computation density for MobileNetV2 implementation on XC7ZU9EG, as well as about 83.9% lesser power for ResNet-50 implementation on VU9P.