HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks

The rapid development of deep learning has propelled many real-world artificial intelligence applications. Many of these applications integrate multiple neural networks (multi-NN) to cater to various functionalities. There are two challenges of multi-NN acceleration: (1) competition for shared resources becomes a bottleneck, and (2) heterogeneous workloads exhibit remarkably different computing-memory characteristics and various synchronization requirements. Therefore, resource isolation and fine-grained resource allocation for each task are two fundamental requirements for multi-NN computing systems. Although a number of multi-NN acceleration technologies have been explored, few can completely fulfill both of these requirements, especially for mobile scenarios. This paper reports a Hierarchical Asynchronous Parallel Model (HASP) to enhance multi-NN performance to meet both requirements. HASP can be implemented on a multicore processor that adopts Multiple Instruction Multiple Data (MIMD) or Single Instruction Multiple Thread (SIMT) architectures, with minor adaptive modification needed. Further, a prototype chip is developed to validate the hardware effectiveness of this design. A corresponding mapping strategy is also developed, allowing the proposed architecture to simultaneously promote resource utilization and throughput. With the same workload, the prototype chip demonstrates 3.62<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3329937.gif"/></alternatives></inline-formula>, and 3.51<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq2-3329937.gif"/></alternatives></inline-formula> higher throughput over Planaria and 8.68<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq3-3329937.gif"/></alternatives></inline-formula>, 2.61<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq4-3329937.gif"/></alternatives></inline-formula> over Jetson AGX Orin for MobileNet-V1 and ResNet50, respectively.


HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks
Hongyi Li , Songchen Ma , Taoyi Wang , Member, IEEE, Weihao Zhang , Guanrui Wang , Chenhang Song , Huanyu Qu , Junfeng Lin , Cheng Ma , Jing Pei , and Rong Zhao Abstract-The rapid development of deep learning has propelled many real-world artificial intelligence applications.Many of these applications integrate multiple neural networks (multi-NN) to cater to various functionalities.There are two challenges of multi-NN acceleration: (1) competition for shared resources becomes a bottleneck, and (2) heterogeneous workloads exhibit remarkably different computing-memory characteristics and various synchronization requirements.Therefore, resource isolation and fine-grained resource allocation for each task are two fundamental requirements for multi-NN computing systems.Although a number of multi-NN acceleration technologies have been explored, few can completely fulfill both of these requirements, especially for mobile scenarios.This paper reports a Hierarchical Asynchronous Parallel Model (HASP) to enhance multi-NN performance to meet both requirements.HASP can be implemented on a multicore processor that adopts Multiple Instruction Multiple Data (MIMD) or Single Instruction Multiple Thread (SIMT) architectures, with minor adaptive modification needed.Further, a prototype chip is developed to validate the hardware effectiveness of this design.A corresponding mapping strategy is also developed, allowing the proposed architecture to simultaneously promote resource utilization and throughput.With the same workload, the prototype chip demonstrates 3.62×, and 3.51× higher throughput over Planaria and 8.68×, 2.61× over Jetson AGX Orin for MobileNet-V1 and ResNet50, respectively.Index Terms-Multi-NN, muticore architecture, AI accelerator.

I. INTRODUCTION
T HE ending of Dennard Scaling and Moore's Law has led to the flourishing of domain-specific architecture research [1], especially for artificial intelligence (AI).Due to the massive demand for storage and computing power, deep neural networks (DNNs) pose considerable challenges to traditional processors [2], such as CPUs and GPUs.In this regard, it has spawned hundreds of works for DNNs acceleration over the past few years.Most proposed DNN accelerators focus on single neural network acceleration [2], [3], [4], or even for specific neural networks (NNs) [5], [6]; many cloud and mobile/edge AI applications require simultaneous execution of multi-NN, as shown in Fig. 1(a).However, tasks in mobile and cloud environments exhibit a great difference.Taking the intelligent robot [7] illustrated in Fig. 1(a) as an example, the robots require the simultaneous deployment of four different NNs for object detection, voice wake-up, and sound localization to achieve complete system functionality.In these circumstances, the NNs deployed on the robot are often fixed and process the input stream at a high throughput, which is remarkably different from cloud computing, where tasks change as tenants are switched and fairness is of greater concern.
Most of the reported accelerators for multi-NN are optimized for the multi-tenancy paradigm in cloud computing, adopting temporal or spatial parallelization strategies.PREMA [8] and AI-MT [9] use time-multiplexing to process tasks.However, time-multiplexing may incur a huge context-switching overhead and is unlikely to execute NNs in parallel.On the other hand, multiple systolic arrays in Planaria [10], multiple heterogeneous accelerators in Herald [11] and GPUs with the multiprocess service (MPS) [12] have explored spatial colocating of NN tasks to improve resource utilization.The tile-grained scheduler design for multi-NN in data centers enables the hardware to handle multi-tenant requests.Nevertheless, this approach introduces inter-task interference, additional data movement, and scheduling latency in mobile machines with long-term multi-NN deployment and high-throughput processing demands, resulting in unstable system performance.Few studies address multi-NN acceleration on mobile and edge devices but with customized architectures, restricted programmability, and limited scalability [13].A programmable and high-performance multi-NN processor for mobile and edge computing is still lacking.We list the characteristics of these previous works in Table I.
A high-performance multi-NN execution system should meet the following requirements: 1) Task isolation for dynamic and various execution requirements among NNs (Interference-free).In a real Multi-stream [12] SIMT (GPU supports CUDA 7) Mainly Temporal MPS [12] SIMT (GPU after Volta) Spatial MIG [12] SIMT (A100 / H100) Spatial PREMA [8] SIMD (Systolic Array) Temporal AI-MT [9] SIMD (Systolic Array) Temporal Planaria [10] SIMD (Systolic Array) Spatial Herald [11] SIMT (Multi-accelerator) Spatial f-CNN x [14] Customized Spatial Simba [15] MIMD / IPU [16] MIMD / Our Target Mainly MIMD, SIMT Compatible Spatial scenario, resource contention can be the bottleneck of multi-NN execution.Therefore, isolating NNs from each other is crucial to avoid execution interference.2) Fine-grained flexibility to alleviate unbalanced memory and computation workload within NNs.The structures of NNs are becoming increasingly diverse, resulting in a wide variation of the computing / memory ratio between different NNs. (Fig. 1b).Fine-grained flexibility helps improve the scope of the search space in catering to heterogeneous workloads, which can be further divided into dynamic resource adjustment at run-time and finegrained resource allocation.Nevertheless, there is a contradiction between these two requirements in various types of multicore processors.
Many MIMD architectures feature near-memory computing via private memory.However, existing MIMD parallel models can not satisfy both of the requirements above.The BSP model [17] significantly undermines the isolation of NNs with hardware-based global synchronization, which can lead to a manifest performance decrease for the variations of workloads (Fig. 1c).The LogP model attempts to allow most of the execution process to be scheduled by software, [18] but it significantly raises design difficulties in deadlock avoidance and performance estimation.
Therefore, a well-designed multi-NN execution mechanism needs to satisfy two requirements: isolation between NNs and high hardware utilization for each NN's processing.However, providing such a mechanism is challenging, because it involves achieving a proper division of responsibility between hardware and software.To address these problems, we propose a Hierarchical Asynchronous Parallelism (HASP) model for multi-NN acceleration, along with corresponding hardware design and mapping policy.
We summarize the key contributions as follows: 1) We develop a Hierarchical Asynchronous Parallel Model (HASP) for multi-NN (Section III) and discuss its application scope.2) To accomplish the HASP model, we develop a multicore chip with a dynamic and programmable trigger module as the crucial module and an asynchronous communication mechanism to fully investigate the MIMDintrinsic advantages for multi-NN (Section IV). 3) A mapping policy is also introduced to explore the potential of HASP to deploy multi-NN onto the MIMD chip (Section V).Generally, this paper proposes a systematic multi-NN acceleration scheme that balances inter-NN isolation and performance optimization flexibility by exposing part of the hardware configuration to the software, in conjunction with the HASP model.Moreover, we design corresponding workloads and experiments to evaluate the effectiveness of HASP.The HASPbased chip demonstrates 3.51× and 3.62× higher throughput over Planaria; 2.61×, 8.68× over Jetson AGX Orin (Orin for simplification), 1.68×, 3.04× over Orin with Multi-stream and 1.22×, 4.02× over RTX2080Ti, for ResNet50 and MobileNet-V1, respectively.These results suggest that exploring the parallel execution of multi-NN by optimizing the parallel model provides remarkable benefits for the multicore architecture.

A. Upsurging of MIMD Multicore Accelerators
In addition to GPUs as typical AI accelerators, recent trends in the industry have seen a surge in MIMD paradigms on AI chips [19].Most of these multicore chips and systems are massively parallel processors with independently programmable cores, large and distributed on-chip memory, and high-bandwidth network-on-chip (NoC).
However, most AI multicore chips are commercial products, and their architecture details are unavailable, such as the Cerebras WSE series [20] and Tenstorrent [21].To our best knowledge, only Graphcore's Intelligent Processing Unit reveals its parallel model [16], and only Tenstorrent claims the support of model-level parallelism [22].

B. Problems of Current Parallel Model for Multicore Accelerators
The design of a parallel model is critical for exerting the highest performance of multicore hardware for DNNs acceleration.Bulk Synchronous Parallel (BSP) adopted by Graphcore's IPU [16] is one of the representative parallel models for multicore architectures.A BSP computing system usually requires processors with local memory and interconnected by communication networks [23].
In a BSP system, a program consists of a series of supersteps retaining three phases: the local computation phase, the communication phase for data exchange, and the barrier synchronization phase.However, BSP cannot fully satisfy the requirements of multi-NN execution.The empirical results on Graphcore IPU show that after different NN tasks are sliced by phase, the execution time varies greatly between different phases of the same NN and the corresponding phases of different NNs.If all the tasks are performed concurrently, the BSP model that forces global synchronization will result in a huge overhead of waiting time for the longest phase.Obtaining high hardware resource utilization without a fine-tuned task allocation policy is challenging.

C. A Trend Toward Spatial Multi-Tasking
The selection of the multi-tasking scheme in terms of temporal and spatial greatly influences the performance of multi-NN.We investigate the general multi-tasking pattern based on an open-source simulator, PREMA_SIM [8].In Fig. 2(a), temporal sharing exhibits a remarkable performance loss for task switching.The results show that spatial segmentation (1:1:2 for three networks) achieves 1.38× better throughput than the temporal one with 4× processing elements for Tiny-YOLO(M), and outperforms the temporal one with the same resource for all NNs.Fig. 2(b) illustrates the reason: isolated network performance increases slowly as the MAC array is scaled out, indicating that the performance loss caused by resource shrinkage is less significant than frequent temporal task switching.Fig. 2(c) further explores the pattern of spatial and temporal multitasking with the increasing number of MACs: spatial multitasking has better potential with adequate computing resources.These phenomena explain the advantage of spatial schemes for multi-NN architecture design.

III. HASP: HIERARCHICAL ASYNCHRONOUS PARALLEL MODEL
In the Introduction section, we identify the challenges of multicore processors performing multi-NN computing and reveal the need for a system solution from the parallel model to the hardware mechanism in order to solve this problem.We first discuss the problem at the model level.This section proposes a Hierarchical Asynchronous Parallel (HASP) computing model of multicore architecture for multi-NN execution.
According to the proposed considerations, we develop the HASP model that guarantees a multi-NN hardware architecture with two essential capabilities: providing isolated execution environments for NN tasks and performing fine-grained load balance for high hardware utilization.

A. Providing Isolated Execution Environments
Each task has its own input sequence and latency constraint.When adopting the BSP model, the length of each phase is determined by the core with the longest latency, the synchronization of one NN task will cause a delay for other NN tasks.We use the following example to illustrate this.
As Fig. 3(a) illustrated, an NN task can be defined as < NN, F [n], L >. NN is the neural network.F [n] is the incoming frame sequence; for example, F [2] is represented as the second frame upcoming that the hardware should process.L is the upper bound of acceptable latency.For frame F [n], we must promise the waiting time before execution t 1 and execution time t 2 to meet the latency restriction t 1 + t 2 < L. In Fig. 3(b), the F [2] of task1 arrives before the end of the previous phase and must wait until the previous phase ends for the synchronization limitation of BSP.Furthermore, task2 is also affected by the execution of task1, which extends to t 2 .The mutually interfering execution environment makes it hard to satisfy throughput or latency requirements for both two tasks.The first level of HASP provides a solution: divide the cores into groups and assign different tasks to the groups, as shown in Fig. 3(c).We refer to these groups as Environment Groups (Env Groups), and the grouping scheme is referred to as grouped synchronization parallelism (GSP).GSP adopts synchronization within Env Groups and asynchronization between Env Groups.Usually, each NN task should be assigned to one isolated Env Group; while NN tasks with similar throughput requirements or massive inter-NN communications can be assigned to the same Env Group.

B. Realizing Load Balance for High Hardware Utilization
A huge challenge is improving resource utilization for each Env Group after assigning tasks.NNs usually have an unbalanced ratio of storage to computation [24], which will increase the difficulty of balancing the load within the group.Additionally, the communication quantity between layers within the model can be much higher than input or output in classic NNs, such as the communication between stage 1 and stage 2 in ResNet50 can be 5-6x larger than the input size as shown in Fig. 4.
An uneven distribution of communication load in Env Groups will lead to a significant decrease in overall performance.A common solution is shown in Fig. 5, which pipelines upstream computing, communication, and downstream computing to maximize their overlap [22].In Fig. 5, C-1 represents the first segment of Conv in L1.C-1, S-1, R-1, and P-1 construct the pipeline serially of the first segment.
However, for multicore architecture, this method may aggravate load imbalance.An example is shown in Fig. 6(a), in  task1, layer 1 and layer 2 are mapped to execute in four phases for communication latency reduction.If all cores in the Env Group strictly obey the GSP, finely-grained fragments in layers 1 and 2 have to wait for cores to execute layers 3.In this case, profits from pipelining communication will be eliminated, and load-balancing mapping will be difficult to achieve.Based on such observations, we conclude that there are two challenges: 1) Large demands for communication usually exist locally in NN dataflows.2) Global synchronization within the Env Group can affect the performance of the task.In response to the challenges, we introduce the second level of asynchronous parallelism in HASP (shown in Fig. 6b).We release global synchronization within each Env Group and instead use subset synchronization for data exchange to handle massive communication with a local synchronous pipeline and relatively small size with asynchronous communication (introduced in Section IV-C).
We formally define the following concepts for utilization measurement and optimization.Within each Env Group, we assume that a task would be deployed on N cores and requires P phases to complete one input frame's processing.The workload

Nouns Level Definition
Step Env Group Time for Core Group to finish computing and cross-group data exchanges Phase Core Group Minimal temporal execution unit of HASP of the core C i in phase j can be expressed as L Ci j .For simplicity, we assume that the computation capability of a core is constant and the workload is measured by the length of the execution time.All cores in the same Env Group will synchronize at the end of each phase for data exchange.We assume that the length of each phase is We then define U as the hardware utilization rate.The U of phase j is defined as U j .
To optimize U , we divide the Env Group into some secondlevel groups, namely Core Groups, based on the storage and computation distribution of the task.Each Core Group is responsible for one or more layers of the NN.Cores in the same Core Group execute synchronously at the phase level, while execution between Core Groups is asynchronous.To provide dynamic scheduling for pipelining, we set a synchronous point when all the Core Groups in one Env Group finish their processing of the current stage.This coarse-grained timing unit for Env Group is referred to as step, containing different numbers of phases in different Core Groups (defined in Table II).
With the HASP model, load balancing is more achievable, and hardware utilization would increase.Because the scale of the sub-task of one Core Group is much smaller than the whole task, the model will allocate similar or closely connected NN layers into one Core Group, such as ResBlock in ResNet50 [25] and Inception Module in GoogLeNet [26].
More specifically, we divide N cores into HCore Groups.Φ h is the h-th Core Group, and the number of cores in Φ h is defined as N h .We define the number of phases in Φ h as P h and the length of phase P h j as T h j = max By setting multiple Core Groups and steps pipelining, we transform global optimization U into several local optimizations of U h , remarkably reducing complexity.
In summary, we develop a hierarchical asynchronous parallel model (HASP) for multi-NN execution on multicore architecture.By using hierarchical grouping, the isolated environment is established, and hardware utilization can be optimized under resource limitations, such as the number of cores used).
Two important advantages of HASP are: (1) The decoupling of compile-time optimization and run-time optimization.An important design consideration for isolation in HASP is to reduce multi-NN run-time optimization to single-NN compiletime optimization with specific resources.This design can be highly compatible with existing single-NN compilation strategies, as it avoids the need for run-time considerations such as resource competition.(2) Ease of hardware extension.By addressing both isolation and utilization optimization from the perspective of synchronization, we can directly modify the hardware synchronization controller in multicore processors at a small cost to support the HASP model.

IV. HARDWARE IMPLEMENTATION FOR HASP
To validate the hardware feasibility of HASP, we develop a typical multicore chip as a demonstration.Our chip offers MIMD parallelism with a large number of cores with local memory and 2D Mesh NoC (Fig. 7a and b), adapting irregular computation and data accesses.
There are two design considerations for supporting HASP: (1) the hardware-based design for hierarchical asynchronous execution orchestration, and (2) the communication mechanism between asynchronous processing groups.In this article, the trigger module and the instant primitive are the corresponding mechanisms to realize these two functions.After a brief description of the general design of the chip, this section focuses on how to get the HASP model supported on such a MIMD multicore processor.

A. The Hardware Architecture
As shown in Fig. 7(a), our demonstration chip adopts a 2D Mesh multicore array and on-chip NoC architecture with private memory only, including 160 cores.Each chip integrates four inter-chip routing modules on its four sides, used for data buffering, collation, and cross-chip data transmission.Chip registers are used to program necessary parameters, such as Core Grouping information and enabled core numbers.
Each core in the chip is homogeneous, independent, and capable of completing a part of the NN task individually.The core adopts a coarse-grained primitive set to support efficient NN execution.The primitive set contains 14 primitives (PIs) in 3 types: 1) linear computing primitives, including convolution, scalar-matrix multiplication, vector-matrix multiplication, matrix-matrix multiplication; 2) nonlinear computing primitives, including pooling, nonlinear activation, and Leaky Integrate-and-Fire (LIF) neuron model; 3) data operation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.primitives, including intra-core data movement and inter-core router communication.
• Core Controller maintains the run-time state to orchestrate the execution of other modules by dispatching proper control signals.with configurable routing algorithms.In one phase, the core controller assigns a sequence of primitives, including linear computing primitive, nonlinear computing primitives (×2), and router primitive, to dedicated modules and executes them in parallel.The sequence satisfies the requirement of parallel execution of computation and communication within one phase proposed by the HASP model.If a data hazard exists, waiting is inevitable and may lead to the elongation of the corresponding phase, as shown in Fig. 8(a).In each core, primitives can be executed in two ways: static PI for local computing and instant PI (Fig. 8b) discussed in Section IV-C for intergroup communication.

B. Trigger Module for HASP
The trigger module is the most critical extension for HASP supporting on the multicore processor (shown in Fig. 8a, including the trigger controller with registers and IO controller).According to the proposed synchronization hierarchy, the trigger controller is capable of generating two-level trigger signals, Phase click and Step click, controlling a Core Group and an Env Group correspondingly.Our prototype chip supports 4 Env Groups and 32 Core Groups at most.Under this hierarchical grouping, 160 cores can be arbitrarily programmed to be divided into up to 32 Core Groups and further allocated to up to four Env Groups.The trigger controller should allocate these two-level triggers to the dedicated cores according to the grouping information configured in the chip registers, and deduce the finish of certain phases or steps by the signal BUSY.
As shown in Fig. 8(a), each step consists of several nonuniform phases.Each step and phase are activated by the Step click pulse and Phase click pulse respectively.Cores activated by the same Phase click to execute one phase are in the same Core Group, so as to Step click and Env Groups.Cores in the different Core Groups or Env Groups execute asynchronously.On the other hand, the IO controller manages the IO ports that are tightly related to the chip execution state: Trigger and Gfinish.Trigger represents the request signal from the outer equipment (sensors or other chips, typically) bundled with the input data.Gfinish is a configurable output signal to indicate the ending of the concerned Core Groups, viewed as the valid signal of computing results.In other words, Step click trigger pulse is controlled by the chip IOs, supporting dynamically adjusting the throughput of each NN task.
Within the IO controller, a state machine is maintained for each Core Groups, and the signal allocation for Phase click and Step click (mapping 32 pairs of outputs from the Phase Trigger Generator to 160 cores) is accomplished by a configurable mask array.
The chip can be autonomous or subsidiary in a computing system for various execution requirements by configuring the adaption mode of the trigger controller, which determines whether the execution frequency depends on the upstream device.Additionally, if multiple chips are required to deploy multi-NN applications, the trigger module plays an important role in timing control.For example, by configuring the trigger module, several Core Groups of different chips can corporately process one NN in the same step and phase click.
In summary, as an approach to realizing the HASP model, the trigger module of the chip schedules the tasks on the multicore processor and reserves configuration space for different applications and system interaction requirements.With the help of the trigger module, different Env Groups offer isolated execution domains with a similar concept of virtualization and provide flexible programmability to adapt to various task needs.Simultaneously, the Core Group enables local grouped synchronization to improve utilization.

C. Instant Primitive for Asynchronous Communication
According to HASP, NN tasks are mapped in different Env/Core Groups and run asynchronously.For ease of description, the part of an NN task assigned to a Core Group for processing is called a sub-task.The data communication mechanism across Env/Core Groups is crucial for guaranteeing the correctness and overall performance of multi-NN and single-NN execution.Setting a global synchronization point like the BSP is straightforward but wastes time and reduces utilization.To enable efficient data transfer between asynchronous tasks, we propose an instant primitive mechanism (Instant PI) for inter-group data communication.A detailed timing diagram is described in Fig. 8(b).
A typical solution is to use the mailboxes or queues mentioned in the Actor model, but deadlocks exist unavoidably, and hardware design is difficult to reconcile with the synchronization design [27].Our goal is to insert the handshake establishment process into the normal phase by sending router packets in a certain format and asking the target cores to spare a certain phase for data transfer.Unlike the normal phase, the occurrence of such a spared phase is event-driven and cannot be fully predicted.The primitives used in this phase are specially managed, like the instructions of the interruption table in CPUs.These primitives are named as Instant Primitives (Instant PI) to distinguish them from static primitives.The context-switching process between the Instant PI phase and the normal phase is managed by the core controller.
Fig. 8(b) illustrates a case of multiple tasks exchanging data when running at asynchronous behavior.It can also be considered as the case of pipelined or branched sub-tasks in one NN task exchanging data.Instant PI is launched for data transfer under such circumstances.In the beginning, the core [3, 0] in the core group 2, the data Request group (Req_GRP), runs a PI calling for data transferring.And then, it sends a request (REQ) packet to core [2,2] in the core group 3, which is the data acknowledge group (Ack_GRP) for handshaking.After an intra-group multi-casting and acknowledgment (ACK) process for internal synchronization, core [2,2] sends an ACK packet back to core [3,0] in Req_GRP.After the same intra-group message passing, all cores in Req_GRP or Ack_GRP start data transmission to the destination cores.After completing the Instant PI process, the cores in different Core Groups continue to run their original static PI sequence.
In conclusion, the Instant PI hides the communication overhead and provides high-efficiency data transfer between NN tasks and sub-tasks, leading to the improvement of the overall throughput of multi-NN execution and hardware utilization.

V. MULTI-NN MAPPING POLICY
The mapping policy is critical to the performance of the multi-NN tasks on multicore processors.Three issues need to be taken into account for the mapping policy design: 1) The limitation of the core memory size.
2) Deadlock.Closed loops in the computing graph need to be packed in isolated Core Groups with specific routing considerations to prevent deadlock.3) Load balancing.The mapping policy should take full advantage of the convenience of HASP to achieve load balancing to improve performance.Mapping workflow.The mapping shown in Fig. 9 represents the procedure from the code written in popular deep learning frameworks (Keras [28], PyTorch [29], TensorFlow [30].) to the configuration file for the multicore chip.Open Neural Network Exchange (ONNX) [31] is selected as an intermediate layer, translating NNs into a computing graph to decouple network programming and mapping.Each vertex in the unified graph (vertices represent operators and edges indicate data flow) will be compiled into a PI stack, with quantified parameters and combinations for reasonable adjustment.A PI stack example is illustrated in Fig. 10(a), one layer of ResBlock can be extended to a sequence of several primitives.At least two phases are needed to perform the layer computation.Finally, the optimized graph will be structuralized to a configuration file containing all the PIs each core should execute.The optimization process will be illustrated below.
Mapping policy.The overall mapping policy in Algorithm 1 summarizes the mapping workflow described above and exploits spatial workload balance and temporal compression.
After translating each task from the operators in the ONNX graph into a set of PIs, the function BlockSeparate (details in Algorithm 2) will split some of the vertexes into duplicates and dispatch the task under the limitation of hardware resources.Besides, commonly used feedback in NNs [32], [33] can introduce a cycle into the graph, which not only makes it difficult for algorithms that demand directed acyclic graphs (DAG) but also creates the potential danger of deadlock.Another task of BlockSeparate is to pack such cycles into a "Super Vertex" recursively and let the graph be a DAG.Fig. 10(b) illustrates an example where, when it is found that if the layer is deployed on a single core, the required memory exceeds the single core storage limit, the convolution operation is deployed on multiple cores for computation by means of some splitting (SplitVertex).
Then, if the memory size is sufficient for one core to accommodate some vertices, these vertices would be merged into one vertex by SpatialMerge, as shown in Fig. 10(c).Essentially doing BlockSeperate directly layer by layer creates storage redundancy.SpatialMerge is dedicated to reducing operator fragmentation and trades time for space and bandwidth.The procedure above tends to perform optimizations within each task, TemporalTuning based on the greedy algorithm alleviates load unbalances in primitive level tuning by core grouping and task division.Finally, the optimized graph set G opt will be allocated to the chip, and each vertex in G opt corresponds to one core.
Limitation.Indeed, an important advantage of HASP is its compatibility with existing single-NN mapping strategies.However, the optimization procedure can be reduced to the quadratic assignment problem, an NP-hard problem, which inevitably limits its capabilities.In these circumstances, our mapping policy aims to find a feasible solution rather than an optimal one.

VI. EVALUATION
In this section, several evaluations are performed to verify the effectiveness of HASP.First, we introduce the experimental preparations, including metrics, benchmarks, and methodologies.Then, we individually evaluate different levels of our work, including HASP and mapping policy.To validate that the performance gains are brought by multicore-based spatial multi-tasking and HASP, the controlled variables for each experiment are summarized below: 1) HASP: Keeping other factors the same, we measure the performance from BSP, GSP, and HASP based on simulation and conclude that the performance gain is purely from HASP. 2) Overall architecture: We normalize the computing resources of our chips (multicore) and Planaria [10] (multiple systolic arrays) to exhibit overall performance advantages.For on-chip execution of practical workloads, the same workloads are also evaluated on GPUs, typical temporal multi-NN chips for mobile tasks.

A. Evaluation Settings
Metrics.First, we simulate and compare the U and U h of the synchronization mechanisms in different multicore architectures.We also define the intergroup communication ratio (IGC) to evaluate the proportion of communications (Comm) between different groups to overall communications.[25], [33], [35], [37], [38]  Benchmark.To evaluate performance, we select three multi-NN workloads as listed in Table III.Different workloads are constructed to represent various scenarios.Workload A represents a practical multi-NN task in an unmanned robot [7], [34], which contains four different kinds of neural models.It includes a modified convolution neural network based on Tiny-Yolo [35] (Tiny-Yolo(M)), a recurrent neural network [33] (RNN), a spiking neural network (SNN) and a multilayer perceptron (MLP).Workload B combines typical convolution neural networks in MLPerf [36].Workload C, an image classification task proposed in [10], is used to evaluate the performance of the multi-NN accelerator.The evaluation results in this paper are all based on these workloads.Some comparisons lack Workload A because some platforms do not support the LIF model.

Workloads
Methodology.Fig. 11 shows the experimental hardware setup for configuring and measuring the chip's performance.The chip evaluation board in the red box is connected to an Intel Arria 10 FPGA through the SERDES interface for configuration and data transmission, and through general-purpose IO for control signal triggering.The FPGA test board is attached to a host computer through Ethernet for performance data collection.In the next section, the results of throughput for comparison with baselines are all generated by this chip measurement system.
We also develop a specific cycle-accurate simulator for the chip with HASP and Instant PI mechanism to evaluate metrics that are not available in the chip measurement, including utilization and inter-group communication quantity.For validating the results generated from the simulator, Fig. 12 illustrates the comparison between latency from the simulator and the chip.For most cases, the error is limited to 10%.In the following parts, the results of utilization are generated by the simulator.
Moreover, this paper does the implementation and integration of the HASP mechanism on the GPU, based on the hardware simulator GPGPU-Sim 4.0 [39].The configured reference GPU is Turing RTX 2060 with 30 SIMT cores (shaders) and 12 memory partitions.
Baselines for HASP.To illustrate the effectiveness of HASP, we perform the ablation experiment between HASP, GSP, and BSP.Our chip can directly perform the evaluation based on the configurable trigger module.
Baselines for our HASP-integrated chip: GPUs.We evaluate our chip's performance in real-world application workloads on two kinds of widely used GPUs, especially for edge and mobile [40].(1) NVIDIA Jetson AGX Orin, a state-of-theart mobile-oriented GPU.(2) An inference server equipped with Intel Xeon E5-2680 CPUs and an NVIDIA RTX 2080Ti GPU.We program the GPUs to run NNs in parallel for all workloads.For Workload A, we choose PyTorch to implement NNs on the GPUs because of the incompatibility of the MLP-LIF with other inference acceleration techniques such as TensorRT.However, PyTorch cannot fully leverage the computational capabilities of GPUs for inference.To enhance the fairness of comparison, we implement NNs with the precision of INT8 from Workload B and C in TensorRT to optimize inference performance on GPUs.We adopt software parallelization strategies on GPUs, including Multi-stream and MPS.The list of configuration modes can be found in Table IV.Two approaches are adopted to implement multi-NN with multi-stream.The first approach is to execute NNs in multiple processes, resulting in context switches on the GPUs and computing resources temporally shared.The second approach involves running NNs within a single CPU process, using non-blocking CUDA stream and event mechanisms including querying multiple NNs asynchronously, and the resource is spatially shared.For modes with MPS, we run NNs in multiple processes and set the resource ratio to {ResNet50 50%, Tiny-Yolo (M) 25%, MobileNet-V1 25%} for Workload B and {EfficientNet-B0 50%, MobileNet-V1 50%} to avoid unstable performance brought by the default mode of MPS.
Baseline for our HASP-integrated chip: Planaria.We compare our chip to Planaria [10] which supports multi-NN execution via spatial parallelization strategy.However, such an INFerence-as-a-Service (INFaaS) scenario differs from ours.Thus, We change the task arrival time to a fixed interval closing  to our scenarios.Planaria adopts 16,384 PEs and operates at a frequency of 700MHz.Due to the resource limit of one chip, we map large NNs such as ResNet50 to multiple chips connected with SERDES.For a fair comparison, we normalize the ratio of the PE number and frequency of Planaria to the MAC number consumption and frequency of our chip.

B. HASP Mechanism-Level Performance Evaluation
This section analyzes Fig. 13 to present the low cost of HASP hardware implementation.The simulator generates results in Figs. 14 and 15 to illustrate the profits of the HASP mechanism.Finally, we evaluate the performance of the workloads on the prototype chip to exhibit the overall efficiency of our HASPintegrated accelerator.
HASP hardware overhead and chip performance.Fig. 13 presents the area overhead for HASP.The trigger module for hierarchical grouping is integrated out of the cores, the 5% parts shown on the left.For these components, only 1% (0.05% of the whole chip) is used for HASP implementation, representing that such a mechanism can be achieved in custom architectures with low costs.
The performance of our prototype chip is illustrated in Table V, which can well verify that the chip we evaluate is a typical one in multicore systems with a high density of computing power (4.81×TOP S/mm 2 over Simba).The power is limited to 5 W, which can well fit mobile applications.
Performance analysis of HASP.One of the motivations of the HASP is to improve the utilization of the mapped tasks by depressing redundant waiting time, further elevating the throughput.In Fig. 14, we compare the HASP with the BSP and   GSP.Fig. 14 illustrates the workflow of different Core Groups in one step for the Tiny-Yolo(M) in Workload A in the HASP and the BSP model, respectively.In the BSP, all cores belonging to the same task are forced to synchronize after one phase, which results in delay overheads.
On the other hand, GSP mentioned in Section III is evaluated in this practical application by comparing the latency of each Core Group.In GSP, the overall latency of Tiny-Yolo(M) is extended to 2.79×; such latency can be up to 52.62×, 12.21× and 9.37× at GRU, MLP-LIF, and MLP for multi-NN execution if synchronization between NNs is a must.
Utilization and communication evaluation of the mapping policy.The results in Fig. 15 show that our policy can well guarantee a remarkable result in U h about more than 90% for most NNs, but the optimization of U is complicated.We also evaluate the IGC of each task.The results reveal that U of most workloads are less than 80%, and IGCs are limited to 40%.
For further analysis, the result of U h indicates that the hardware resources within the Core Group are used as much as possible in each phase.Nevertheless, in a macro view, the U reflects the pipeline uniformity of Core Groups at different stages.In Fig. 15, the results of U h , IGC, and U reflect that our policy can ensure low latency and IGC within each task in high priority.Although the limited uniformity may become a bottleneck for throughput, we believe such a problem can be well handled in different scenarios.For embedding systems, core-level clock gating can reduce redundant power consumption and U can be ignored.
Another insight within such consideration is for the dark silicon effect [41], that a large sum of cores may not be utilized effectively and perpetually for power limitation, especially for multicore chips.In this paper, we propose the definition of U , and U h to decouple the different levels of utilization constraints.

C. System-Level Performance Analysis With HASP Integration
Comparison with GPUs.Fig. 16(a) presents a comparison of the throughput of multi-NN execution on GPUs with our chip on Workload A. For Workload A, which is the most comprehensive workload that is used in a real unmanned system [7], our chip improves the throughput by 5.70x and 3.84x for Tiny-Yolo(M), 1.97× and 1.85× for GRU, compared to Orin and RTX2080Ti respectively.Fig. 16(b) illustrates that Workload B has demonstrated 2.61× and 8.68× enhancement for ResNet50 and MobileNet-V1 on Orin (mode 1), respectively.Similarly, for Orin (mode 4) and RTX2080Ti (mode 1), the improvement is 1.68× and 3.04×, 1.22× and 4.02× for ResNet50 and MobileNet-V1.For EfficientNet-B0 in Workload C, our chips deliver 3.68×, 2.34× and 1.72× better throughput than Orin (mode 1), Orin (mode 4), and RTX 2080Ti (mode 1), respectively.Additionally, we analyze the overhead of parallelization strategies on GPUs (Fig. 17).In this experiment, each NN has exclusive access to GPU resources in the serial scenario to achieve its best performance and the performance degradation is calculated by the throughput ratio of serial and parallel scenarios.The throughput of each NN in parallel and serial execution is shown on the left and right bars, respectively, and the performance degradation is identified on each bar.This result emanates from the fact that temporal multi-tasking (mode 1, 3) usually causes a large context switch overhead and computational resource underutilization, while spatial multi-tasking (mode 2, 4, 5, 6) is limited by unisolated memory access, resource contention and software overheads.Hence, current GPU technologies cannot guarantee multi-NN performance needs in mobile and edge systems.Comparison with Planaria.Fig. 18 compares the throughput of our chips with Planaria across Workloads B and C. Workload A is not included in this comparison because the mapper of Planaria does not support MLP-LIF.For Workload B, a typical CNN performance benchmark, our chips improve the throughput by 3.51x and 3.62x for ResNet50 and MobileNet-V1.Although Planaria can execute depth-wise convolution in EfficientNet-B0 and MobileNet-V1 as mentioned in [10] with competitive performance, the throughput of these two NNs executed on our chips outperforms Planaria by a factor of 1.12 and 1.94.The results demonstrate that Planaria creates unbalanced multi-NN performance as a result of the dynamic scheduling that requires frequent weights reloading from global buffers, making it difficult to meet multi-NN performance requirements in mobile situations requiring long-term model deployment, such as autonomous machines.

VII. DISCUSSION
The above sections focus on the design considerations for the MIMD architecture because we believe that for multi-NN tasks, the MIMD architecture has greater potential to address the corresponding challenges.Therefore, we implement a fullstack design for the MIMD architecture, including the model, the hardware, and the mapping strategy.
However, HASP can not only improve multi-NN performance in the MIMD architecture but also can be used in the SIMT architecture.As the shared resources (such as the shared global memory) can be the bottleneck of multi-NN workloads in GPUs, HASP's resource isolation for each task has the potential to optimize the performance of the multi-NN tasks on GPUs.In this section, we use the implementation of HASP on GPUs Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 19.HASP enabled memory partitioning with 0x0-0x400 and memory partition 3-5 for task 1; 0x400-0x900 and memory partition 0-2 for task 2.
as an example to analyze how to isolate NNs on architectures with more complex memory hierarchies and demonstrate the generality of HASP for heterogeneous multi-tasking acceleration in multicore architecture.The three differences in the implementation of HASP on SIMT compared to MIMD are listed below: 1) Dynamic resource allocation.In contrast to the distributed execution characteristics of MIMD, the tasks performed by each core (shader in GPU) in the SIMT architecture are dynamically assigned in a temporal mapping style.2) Memory hierarchy.SIMT architectures often have complicated memory hierarchies, including shared memory and last-level cache.3) Memory mapping scheme.Blocked or interlaced.To adapt to the dynamic resource allocation, the trigger module in the SIMT architecture needs to record each shader's running state in real-time, unlike the MIMD architecture where the trigger targets a specific core.To address this challenge, we design a function-specific interface in the CUDA run-time library.CUDA programmers can explicitly specify the number of shaders and memory partitions (MP) required for each task and then call the proposed interface before task allocation.At the same time, the trigger module records the occupancy status of each shader and memory partition so that resources can be allocated in real-time.
In current GPUs, as shown in Fig. 19(a), the last-level cache is shared by SIMT cores (shaders) and blocked in several MPs, and the memory address mapping indicates that successive regions of memory are assigned to successive memory partitions [39].We modify this memory mapping to realize memory privatization.
The privatization of the memory partition requires the task to bind with a specified area of the address space.Fig. 19(b) illustrates the HASP-based mapping scheme by changing the GPU architecture in GPGPU-Sim.The core idea is to change the mapping of address spaces to MPs from an interlaced interval, which is accessible to all SIMT cores, to a specified and private space which is only accessible to SIMT cores within the corresponding task.For each block of memory, users simply need to specify the belonging task when calling cudaMalloc.
Fig. 20 presents an exploration process of the HASP-enabled GPGPU with a simplified multi-NN demo: S1 represents a 1000-item vector adding task with 30 repeats and S2 represents a 500-item vector multiplication task with 10 repeats.We use Multi-stream to emulate the multi-tasking execution due to the GPGPU-Sim context limitation of the current version.Firstly, we perform HASP-enabled MPS (shared memory partitions, isolate the shader only; colored blue) to find shader partitioning with the lowest latency, then the optimal memory partition ratio is searched for under this shader ratio allocation (colored purple).
HASP avoids cache miss caused by different tasks sharing the last-level cache.In MPS, each task monopolizes the lastlevel cache during memory access, forcing data of other tasks to be swapped out.Consequently, a suitable memory partitioning strategy with HASP can yield greater performance profits than MPS.
It is worth noting that our HASP-based experiments have shown that an extremely unbalanced memory partitioning approach can lead to increased cache miss within tasks if the allocated cache for a single task is too limited.On the one hand, it is important to explore the appropriate split of memory and computing resources on GPUs.On the other hand, compared to SIMT architecture, MIMD architecture with the feature of private memory inherently avoids the caching partitioning difficulty in multi-NN tasks.Our studies on the trend and advantages of spatial multi-tasking and pipelining also reveal the performance improvement potential of MIMD, thus we believe such architecture for complex artificial intelligence applications has substantial research value in the future.
Another noteworthy issue is that an important contribution of HASP is to emphasize the importance of hardware-supported isolation and load balancing for multi-NN tasks.This is important to allow multi-NN compilation optimization methods to inherit past work, otherwise, the resource competition of multiple tasks can bring about a huge gap between compiler expected and actual performance.

VIII. RELATED WORKS
The growing demands for accelerating AI algorithms have led to extensive research on NN acceleration hardware, from industry [42], [43] to academia [2], [44], [45].This paper lies in the intersection of multi-NN workloads execution and multicore architecture for NN acceleration.Below we illustrate some related work categories.
Multi-NN accelerators.In the multi-tenant scenario, as the first work to explore multi-NN execution, PREMA [8] designed a preemptive scheduling algorithm with inference time evaluating.AI-MT [9] further proposed a more cost-effective scheme by scheduling memory blocks and merging the computing blocks.In contrast to the temporal multi-NN supports, Planaria [10] focused on spatial multi-tasking.Considering multi-NN in mobile and edge scenarios.Herald [11] proposed a heterogeneous architecture with the combination of different types of data-flow for multi-NN workloads.However, none of these works achieves interference-free and fine-grained flexibility at the same time.
Single-NN accelerators.A large quantity of work focuses on improving either throughput or energy efficiency in DNN acceleration architecture design.Several start-ups have dedicated themselves to this roadmap in recent years, such as Graphcore [43], Cerebras [20], Tenstorrent [21], and some others.However, only a few published reports illustrate the multi-NN support of these systems.

IX. CONCLUSION
In all, HASP is a systematic hardware and software solution.To our best knowledge, this paper is the first to identify the fundamental contradictions of the design requirements for multi-NN acceleration, offering a systematic scheme to cope with interference-free and fine-grained flexibility of multi-NN at the same time with minor costs.
Two key technical ideas are responsible for HASP's ability to solve the previous problem: (1) exposing a semi-transparent hardware configuration interface to the software enabling run-time isolation configuration capabilities, and (2) reducing performance loss due to synchronization while ensuring programmability through coarse-grained asynchronous execution.Both technologies allow HASP to decouple compile-time and run-time multi-NN optimization through the interferencefree feature while maintaining flexibility and enabling performance tuning.
The evaluation results reveal that HASP achieves execution isolation between tasks and reduces extra waiting in processing, thus improving resource utilization and throughput.

Fig. 2 .
Fig. 2. Temporal and spatial selection decided by computing resource.(a) illustrates the performance of Multi-NN tasks in temporal/spatial sharing modes.(b) explains the limitations of performance improvement gains as computational resources are increased and (c) further demonstrates the task advantages of spatial sharing when resources are sufficient.

Fig. 3 .
Fig. 3.The first level of HASP: NN-level isolation.(a) Example of tasks with layer-level latency variation.(b) and (c) illustrates the corresponding execution flow based on BSP and GSP.

Fig. 6 .
Fig. 6.The second level of HASP: sub-set synchronization for load balancing.

Fig. 7 .
Fig. 7.The architecture of the prototype chip.

Fig. 10 .
Fig. 10.Mapping policy illustration.Algorithm 1 Mapping Policy for Multi-NN Input: Task/NN set: S = {G i }, Computing graph: G i = (v i , e i ), v i , e i : vertex/edge set of a computing graph G i representing operands/data Output: Primitive configuration of each core: Conf igList Definition: G (i) P I : Computing graph with primitive annotation; G (i) DAG : translated directed acyclic graph (DAG) of G i ; G

Algorithm 2
Block Separating Algorithm Definition: G DAG : After abstract Rings in G into SuperV ertexs, the DAG of G; v.corenum/num: the number of core/memory size used to compute the vertex v; CoreM em: core memory size (144KB in this paper).function BLOCKSEPARATE(Graph :

Fig. 15 .
Fig. 15.IGC and utilization of workloads mapping generated by proposed mapping policy.

Fig. 16 .
Fig. 16.Comparison of throughput of parallel multi-NN with GPU systems at batch size one.(a) system level comparison for workload A. (b) mechanism level comparison based on workload B and C.

TABLE I SUMMARY
OF PREVIOUS MULTICORE ACCELERATORS AND MULTI-NN ACCELERATORS

TABLE IV GPU
MODE FOR COMPARISON (MS: MULTI-STREAM, H: HOST, D: DEVICE)