loading page

HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks
  • +8
  • Hongyi Li ,
  • Songchen Ma ,
  • Taoyi Wang ,
  • Weihao Zhang ,
  • Guanrui Wang ,
  • Chenhang Song ,
  • Huanyu Qu ,
  • Junfeng Lin ,
  • Cheng Ma ,
  • Jing Pei ,
  • Rong Zhao
Hongyi Li
Author Profile
Songchen Ma
Author Profile
Taoyi Wang
Author Profile
Weihao Zhang
Author Profile
Guanrui Wang
Author Profile
Chenhang Song
Author Profile
Huanyu Qu
Author Profile
Junfeng Lin
Author Profile
Rong Zhao
Tsinghua University

Corresponding Author:[email protected]

Author Profile


The rapid development of deep learning has propelled many real-world artificial intelligence (AI) applications. Many of these applications integrate multiple neural network (multi-NN) models to cater to various functionalities. Although a number of multi-NN acceleration technologies have been explored, few can fully fulfill the flexibility and scalability required by emerging and diverse AI workloads, especially for mobile. Among these, homogeneous multi-core architectures have great potential to support multi-NN execution by leveraging decentralized parallelism and intrinsic scalability. However, the advantages of multi-core systems are underexploited due to the adoption of bulk synchronization parallelism (BSP), which is inefficient to meet the diversity of multi-NN workloads. This paper reports a hierarchical multi-core architecture with asynchronous parallelism to enhance multi-NN execution for higher performance and utilization. Hierarchical asynchronous parallel (HASP) is the theoretical foundation, which establishes a programmable and grouped dynamic synchronous-asynchronous framework for multi-NN acceleration. HASP can be implemented on a typical multi-core processor for multi-NN with minor modifications. We further developed a prototype chip to validate the hardware effectiveness of this design. A mapping strategy that combines spatial partitioning and temporal tuning is also developed, which allows the proposed architecture to promote resource utilization and throughput simultaneously.