An ASIC Accelerator for QNN With Variable Precision and Tunable Energy Efficiency

This article presents TULIP, a new architecture for a variable precision quantized neural network (QNN) inference. It is designed with the goal of maximizing energy efficiency per classification. TULIP is constructed by arranging a collection of unique processing elements (TULIP-PEs) in a single-instruction–multiple-data (SIMD) fashion. Each TULIP-PE contains binary neurons that are interconnected using multiplexers. Each neuron also has a small dedicated local register connected to it. The binary neurons are implemented as standard cells and used for implementing threshold functions, i.e., an inner-product and thresholding operation on its binary inputs. The neurons can be reconfigured with a single change in the control signals to implement all the standard operations used in a QNN. This article presents novel algorithms for implementing the operations of a QNN on the TULIP-PEs in the form of a schedule of threshold functions. TULIP was implemented as an ASIC in TSMC 40nm-LP technology. A QNN accelerator that employs a conventional multiply and accumulate-based arithmetic processor was also implemented in the same technology to provide a fair comparison. The results show that TULIP is $30\times -50\times $ more energy-efficient than an equivalent design, without any penalty in performance, area, or accuracy. Furthermore, TULIP achieves these improvements without using traditional techniques such as voltage scaling or approximate computing. Finally, this article also demonstrates how the run-time tradeoff between accuracy and energy efficiency is done on the TULIP architecture.

machine learning.DNNs are computationally and energetically intensive algorithms that perform billions of floating-point multiply-accumulate operations on very large dimensional datasets, some involving tens of billions of parameters [1].
Because training of large networks entails much greater computational effort and storage than inference, it is performed on high-performance servers with numerous CPU and GPU cores.
The energy cost and the environmental impact of training and inference of large DNNs are fast becoming unsustainable.For instance, training of the GPT-3 model with 175B parameters using NVidia's A100 with 1024 GPUs would consume 936 MWh of energy and take 34 days at a cost of $4.6M.Models even larger than the GPT-3 are being developed [1].
Improvements in the energy efficiency of DNNs are not just limited to high-performance servers or desktop machines.The latest midrange and high-end mobile SoCs [2], [3] are being equipped with custom NN hardware accelerators to perform inference on mobile (e.g., mostly smartphones) and edge devices (e.g., IoT devices deployed in numerous spaces) for many of the above applications.The energy efficiency of inference on battery-powered devices is also of critical importance in terms of value to the customer and environmental impact.Given the rapid proliferation of ML techniques, several orders of magnitude improvement in energy efficiency over CPU-GPU implementations for training and inference of DNNs is needed for ML technology to be sustainable.
FPGA and ASICs are the two alternates to CPU-GPU implementations.The energy efficiency and throughtput of FPGA implementations of DNNs is in between ASICs and CPU-GPUs [4], [5].Past and ongoing works on executing DNNs on FPGAs include the development of optimizing compilers that automatically map DNNs expressed in standard frameworks onto FPGAs with the objective of minimizing latency or throughput subject to constraints on energy, memory bandwidth, the number of DSPs [6], [7], [8], [9], [10].
ASIC implementations have orders of magnitude higher energy efficiency than the CPU-GPU implementations.Analog and mixed-signal solutions implement the inner product of fixed-weight matrices and input vectors by summing currents in crossbar arrays, where the weights are realized by various types of resistive elements (ReRAM [11], MTJ [12], and Flash [13]).This approach continues to be an active area of on-going research, driven by the constant introduction of novel nonvolatile multistate memory devices.
1937-4151 c 2024 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Purely digital ASIC implementations are constructed by synthesis of custom logic blocks for specific operations, such as 2-D convolution, inner product, matrix multiplication, and others [14], [15], each optimized for throughput and energy efficiency.A common approach to maximizing energy efficiency is to pare down the functionality of the circuit, e.g., eliminate processing of the integer layers as in XNORBIN [16], or reduce the size of the kernels and/or reduce the bitwidth of operands as in [17] and [18].
A few architectures, however, support complete end-to-end neural networks, i.e., convolution layers and fully connected layers, such as YodaNN [14] (which supports 12-bit inputs and 1-bit weights), UNPU [19] (which supports 1-16 bit variable precision inputs and weights), and BitBlade [20] (which supports bit precision of 2, 4, and 8 for inputs and weights).They also support variable-sized kernels.The UNPU [19] architecture is based on bit-serial processing of the weights with the activations to compute the partial products.It uses lookup-table-based processing elements (LBPEs) and the overall architecture is designed to perform dense matrix multiplications with high parallelism.The independent DNN cores in the UNPU use a fixed-size accumulator to obtain partial sums which are finally processed in a separate 1-D single-instruction-multiple-data (SIMD) core.The SIMD core performs vector operations such as nonlinear activation or element-wise multiplication to generate the final output.The UNPU includes a RISC controller to orchestrate the intercore communication via an NoC during the DNN operation.
The processing element (PE) in BitBlade [20], introduces a new bitwise summation scheme which reduces the shift and add logic in the PE to reduce overall area and power consumption.The PE of BitBlade consists of 16 2-bit multipliers and summation logic to perform the multiply and accumulate (MAC) operation.The operands are decomposed into chunks of 2-bits to utilize 2-bit multipliers.The other operations of the quantized neural network (QNN) such as linear activation is performed using either additional dedicated logic or by using the host CPU.
YodaNN [14] is an SIMD processor that consists of an array of MAC PEs with on-chip standard-cell memory (that can be synthesized).The design presented in this article, named TULIP, is also a complete end-to-end system.A detailed comparison of YodaNN and TULIP is presented in Section VIII.
Regardless of whether it is an FPGA or ASIC implementation, throughput and energy efficiency can also be improved by modifying the structure of the NN.This includes tuning the hyperparameters [21], [22], modifying the network structure by removing the weights and connections [23], [24], or by altering the degree of quantization [25], [26].Another category of methods focuses on reducing the huge energy expenditure for moving data between the processor and offchip memory, which is especially acute in NNs because of the large number of weights involved.The techniques to mitigate this include maximizing the reuse of data fetched from memory [27], [28], or transferring compressed data from the memory to the processor [29], [30].
Of the many available techniques for modifying NN structure, quantization remains the best way to achieve high energy efficiency and reduce computation time [31], [32], especially for energy-constrained systems.Quantization refers to using smaller bit-widths for the weights and/or the inputs during training, reducing them from 32-bit values to anywhere from 8-bit to 1-bit values.The term BNN refers to neural networks with 1-bit weights and inputs.Anything larger than that, but below full 32-bit precision is referred to as QNN.Quantization takes advantage of the fact that the accuracy of NNs is not very sensitive to substantial reductions in bit-widths until some critical value.Depending on the network, 4-bit to 1-bit QNNs for mobile applications provide an excellent tradeoff between energy efficiency and throughput versus accuracy [32].

II. OVERVIEW OF THIS ARTICLE
This article presents the design of an ASIC for accelerating QNNs.The design, named TULIP, achieves substantial improvements in energy efficiency compared to the state-ofthe-art design of QNNs [14].Energy efficiency is defined as throughput-per-watt, or equivalently, operations-per-joule.
Fig. 1 shows the main components of TULIP.The following is a summary of these components, which will be elaborated upon in the subsequent sections.
1) Fig. 1(a) shows the top-level system diagram of TULIP.
It is a scalable SIMD machine that consists of a collection of independent, concurrently executing TULIP processing elements (TULIP-PEs), shown in Fig. 1

(b).
The architecture of the TULIP-PE is very different from the PEs used in any other QNN accelerators [11], [14], [18], [33].It consists of a small network of binary neurons, whose circuit structure is shown in Fig. 1(c), and described in greater detail in Section III. 2) Briefly, a neuron is a clocked logic cell that computes a threshold function of its inputs, on a clock edge.It is a mixed-signal circuit, whose inputs and outputs are logic signals but internally it computes the inner-product and threshold operation of a neuron, i.e., f (x 1 , . . ., x n |w 1 , . . ., w n , T) = n i w i x i ≥ T. Implemented as a standard cell, and after optimized for robustness and accounting for process variations, a neuron in 40nm is just a little larger than a conventional D-type flip-flop [34].The neurons in a TULIP-PE can be configured at run-time to execute all the operations of a QNN, namely the accumulation of partial sums, comparison, max-pooling, and RELU.Consequently, only a single PE is required to implement all the operations in a QNN, and switching between operations is accomplished by supplying an appropriate set of logic signals to its inputs, which incurs no extra overhead in terms of area, power, or delay.3) Unlike conventional MAC [14] or fixed-size accumulator-based [19] PEs, that are designed to operate at maximum bit-width (determined at design-time), the bit precision of TULIP-PEs can be changed within a single cycle without incurring a delay or energy penalty.
The TULIP-PEs enable control over the precision of both inputs (weights and activation) and output unlike [19].
The operation of TULIP-PEs prevents overprovisioning of the hardware for an operation of a certain bitwidth, thus improving the energy efficiency of the overall computation.This characteristic allows for making tradeoffs between energy efficiency and accuracy at run-time.4) Against the state-of-the-art MAC units used in QNN accelerators [14], the TULIP-PE is ≈ 16× smaller and consumes 125× less power.Although it is 9.6× slower, this can be compensated by replicating 16 PEs and operating TULIP in an SIMD mode, executing multiple workloads in parallel that share inputs, which reduces the need to repeatedly fetch data from off-chip memory.5) Since the neurons in the TULIP-PE have limited fan-in, much larger inner product calculations have to first be decomposed into smaller bit-width operations and then scheduled on the TULIP-PEs.For this, a novel routingaware, resource-constrained, scheduling algorithm is presented that maps the nodes of a QNN onto TULIP-PEs.6) The combined effect of the low-area of TULIP-PE, the uniform computation at the individual node and network levels, and the mapping algorithm results in an improvement of up to 50× in energy efficiency for QNNs over a MAC-based design for the same area and performance.This article is organized as follows.Sections III, IV, and VI describe the architecture of the binary neuron, TULIP-PE, and the top-level architecture of TULIP, respectively.Section V presents the scheduling algorithm needed to execute each node of the QNN on a TULIP-PE.Section VII then describes how the small size of the TULIP-PEs enables us to deploy a number of them in the same space as a conventional processing unit, thereby enabling better weight reuse.Finally, Section VIII presents a both quantitative and qualitative evaluation of TULIP-PEs and the TULIP architecture against equivalent state-of-the-art architectures.
Note that we presented a preliminary version of this work in [35].This article includes an updated hardware architecture that extends support for varying precision of QNNs while also significantly improving the overall energy efficiency.We perform an extensive evaluation using multiple NNs and datasets to demonstrate the efficacy of the updated TULIP.This article also provides a generalized formulation for mapping arbitrary compute graphs to the TULIP-PEs.

III. BACKGROUND
A Boolean function f (x 1 , x 2 , . . ., x n ) is called a threshold function if there exist weights w i for i = 1, 2, . . ., n and a threshold T1 such that where denotes the arithmetic sum.A threshold function is denoted by the pair (W, T) = [w 1 , w 2 , . . ., A binary neuron is a circuit that realizes a threshold function defined by (1).Fig. 1(c) shows the design of the binary neuron that is used in TULIP.A detailed description of its operation, the algorithms for optimizing its robustness, performance, power, and area, and its use in ASIC synthesis appear in [34].As the design of the binary neuron is not the focus of this article, only a summary of its operation is presented here.
The binary neuron shown in Fig. 1(c) has four main components2 : 1) the left input network (LIN); 2) the right input network (RIN); 3) a sense amplifier (SA); and 4) an output latch (LA).The SA outputs are differential digital signals (N1, N2), with (1, 0) and (0, 1) setting and resetting the latch.The LIN and RIN consist of a set of branches, each branch consisting of two devices in series, one (labeled Z) which provides a configurable conductance between its two terminals, and a MOSFET driven by an input signal x i .The conductance of a branch controlled by x i serves as a proxy of the weight w i in (1).Let G L (X|W) and G R (X|W) denote the conductance of the LIN and RIN, respectively.For a given threshold function f , the conductance of each branch is configured so G L (X|W) > G R (X|W) for all on-set minterms of f , and vice versa for all off-set minterms of f .
When CLK = 0, the LIN and RIN play no role and (N1, N2) = (1, 1), and the output Y of the latch remains unchanged.Before the clock rises, inputs are applied to the LIN and RIN.Suppose that an on-set minterm is applied.When CLK 0 → 1, both N1 and N1 will start to discharge.However, since G L (X|W) > G R (X|W), N1 will discharge much faster than N2, which will also turn off the discharge of N2, resulting in N2 going back to 1.The result is (N1, N2) = (0, 1), which will set the latch output Y = 1.Thus, the binary neuron in Fig. 1(c) may be viewed as a multi-input, edgetriggered flip-flop that computes a threshold function of its inputs on a clock edge.Note that there are a number of choices for realizing the configurable conductance devices, which are explored in [34], [37], and [38].

A. Primitive Operations
The TULIP-PE is designed to implement the nodes of all the layers in a QNN.This is achieved by decomposing the node's operations (multiplication, ReLU, etc.) into K-bit primitive operations.These are addition, comparison, or logic operations that are executed in at-most two cycles.They are realized as threshold functions and computed by artificial neurons.N-bit (N > K) operations are executed as a sequence of Kbit operations.In this section, we describe the representation of the primitive operations threshold functions.The following notation will be used to describe single and multibit values.
1) Characters (e.g., A or A 0 , etc.) without dimensions specified will denote variables that may either be a single-bit or a multibit value.2) Square brackets (e.g., A [0] , A [K−1:0] , etc.) are used to represent bit vectors.3) Characters having subscripts but no square brackets (e.g., A 0 ) denote single-bit variables.4) Bit replication is denoted with the variable enclosed in curly braces with the multiplier in the subscript.For instance, {A [0] } ×N represents an N-bit vector with all bits equal to A [0] .Equation (2) shows a template that is used to describe the primitive operations using threshold functions.In the expression, p is an integer, X and Y are p-bit operands, and Z 0 and Z 1 are 1-bit values 1) Logic Operations: Primitive logic functions AND, OR, and NOT are threshold functions [36].The corresponding logic operations on K-input operands A and B are denoted as LK(A, B) (binary) or LK(A) (unary).They are realized as a vector of K threshold functions on each corresponding bit As an example, consider a 2-bit AND operation between two 1-bit operands A and B, which can be calculated using Q(1, 0, A, 1, B).By substituting appropriate values in (2), we get 0+A ≥ 1+B, which in turn can be rewritten as A+B ≥ 2.
Other K-bit logic operations are similarly defined.These can be computed in one cycle by a neuron cluster in an NPE.On the other hand, XOR( ) is realized as a twolevel threshold network and therefore requires two cycles.In terms of Q, it is derived as follows: XOR operation is represented as a pseudo-Boolean equation This can be written in the form of an inequality . Consequently, by using the representation in (2), and substituting the term AB with (3), we get For instance, an XOR operation between two 1-bit operands A and B can be rewritten using a combination of ( 4) and (2) as While the carryout function is a threshold function regardless of the size of the lookahead, the sum function S i is a threshold function of carry-out C i+1 and carry-in C i , as shown in Fig. 2. Hence, a K-bit addition, denoted by ADDK, takes two cycles.C i+1 and S i are expressed as To illustrate, consider an addition operation involving three 1-bit operands A, B, and C 0 .This operation can be computed using ( 5) and (6).When we substitute the appropriate values into (2), we obtain the following.
1) The carry bit C 1 can be expressed as This can be further rewritten as 2) The sum bit S 0 can be represented as This, in turn, can be rewritten as  For N-bit operands (N > K), addition, comparison, logic, multiplication, and ReLU operations (among other operations) can be realized using K-bit primitive operations.Examples are shown in Fig. 3.These primitive operations can be executed sequentially on a TULIP-PE.

B. Hardware Architecture of TULIP-PE
A TULIP-PE [Fig.1(b)] contains four clusters, each cluster containing K neurons.The neurons in each cluster are labeled N κ , where κ is the fan-in of the neuron in a cluster (indexed left to right).The ith significant bit (i ∈ [1, K]) of a primitive operation is computed by the ith neuron of a cluster.Therefore, the fan-in needed for a cluster's ith neuron is determined by the maximum number of inputs needed to represent the threshold function corresponding to the ith bit of every primitive operation.
As shown in Fig. 1(b), multiplexers are used to connect each neuron to its external inputs, to its neighboring neurons, its local registers (designed using latches), and to its own output (feedback).In the present implementation of the TULIP-PE, the weights associated with the neurons are chosen so to allow the implementation of all the primitive operations by simply applying the appropriate signals to each neuron's inputs, and also to ensure that neuron N i can realize all the functions realizable by N j , j < i.
TULIP-PE requires a minimum of four clusters to ensure a single cycle delay between the launch of any two consecutive primitive operations.Considering that each primitive operation can be represented as a two-level (or one-level) computation of threshold functions, only two clusters are needed to perform the computation at any given time (compute mode), while the remaining two clusters are needed to read operands from their respective local registers and share them with the first two clusters (routing mode).The clusters switch between the compute and routing modes depending on the local registers in which the operands are stored and the local register to which the output must be written.
Note that the number of bits that can be processed in each cycle increases with the number of neurons K in each cluster.The larger the K, the better the performance.However, as K increases, the maximum fan-in of the binary neurons in each cluster also increases.Since there is a maximum fan-in limitation of the binary neuron [34], in the present implementation of TULIP-PE, K = 5.

V. REALIZING QNN NODE ON TULIP-PE
A QNN is a directed acyclic graph (DAG), where each node either represents an inner product that involves a sum of multibit products, or a nonlinear activation function (e.g., ReLU, etc.).The multibit products are computed using multibit logic and addition operations (see Fig. 3), which are primitive operations that are performed by a network of neurons.Thus, at the lowest level of granularity, a QNN node is a network of threshold functions that must be scheduled on the neurons (the compute elements) in the TULIP-PE with the objective of minimizing the completion time, subject to the registers and the routing constraints.
The threshold graph scheduling (TGS) problem is the same as the well-studied problem of mapping a dataflow graph (DFG) of computations onto a course grain reconfigurable array (CGRA).There is an extensive body of literature on CGRA architectures and scheduling computations onto them that spans more than two decades.A precise formulation of the CGRA scheduling problem first appeared in [39] and was shown to be NP-complete.In the Appendix, we present a precise formulation of TGS, which is the problem of scheduling a compute graph of threshold functions onto a specific network of neurons that constitute a TULIP-PE.
Since existing approaches to solve the above problem have exponential time complexity for the number of nodes in the compute and resource graphs, they do not scale well.In the following, we present an alternate approach that is efficient and scalable.This is done by increasing the granularity of the nodes in the compute and resource graphs, which results in a drastic reduction in their sizes.The nodes in the compute graph are now primitive operations and the compute units are now clusters.The mapping problem is further simplified because a new operation can be initiated on a cluster on every cycle, i.e., its initiation interval is one.We first compute the register-aware, minimum latency schedule of the primitive graph on the clusters and then configure the neurons on each cluster to compute the function of the primitive node assigned to each cluster.This is illustrated in Fig. 4, which shows a feasible mapping of a primitive graph to a resource graph.The compute graph in Fig. 4 A. Scheduling Primitive Graph on TULIP-PE Definition 1 (Primitive Graph G P (V P , E P )): This is a DAG where each node v ∈ V P represents a K-input primitive operation, i.e., K-bit addition, comparison, or logic.Each edge e ∈ E P represents a data dependency between the primitive operations.
An integer linear programming (ILP) formulation is presented for the problem of scheduling a primitive graph that represents a single QNN node, on a TULIP-PE.The principle behind the design of ILP is based on the high-level scheduling algorithm presented in [40], but differs from it because of the unique register and data routing constraints and the fact that the initiation interval of a cluster is one.Table I shows the notation used in the ILP formulation.
The primitive scheduling problem has to establish bindings between operations v, time steps t, and resources (local registers) r, since clusters store outputs in their respective local registers.Such bindings are represented using triple-indexed binary decision variables χ v,r,t shown in Using the above equation, two additional binding variables are derived: ρ v,r , which represents the mapping of v with local register r, and τ v,t which represents the mapping of v with time t.These variables are used to express resource and timespecific constraints, respectively There are L local registers, each of size B bits.The minimum time required to execute all the primitives on G R is T = 2|V P |.With the goal of minimizing the makespan E (execution time) of G P on TULIP-PE, the following constraints are needed to define the set of feasible solutions: 1) Resource Availability Constraints: These constraints are added to ensure that the local registers are not overutilized.The first constraint (12) ensures that the storage used by a local register r at any time t must never exceed the maximum capacity The second constraint (13) ensures that each primitive's output is stored in only one local register Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) Precedence Constraints: Constraint in ( 14) is added to ensure that the data dependency due to the precedence relationship between any two primitives u and v is satisfied in the schedule For the schedule in Fig. 5, 3) Timing Validity Constraints: These constraints ensure that the start and end times of all the nodes are valid and feasible (15), and that the start times of any two nodes are not equal (16) Constraint in ( 18) is added to identify the end time of the last primitive that will be scheduled on TULIP-PE, so that it can be minimized in the objective function ∀v ∈ V P , e v ≤ E. (18)

4) Routing Constraints:
The following constraints ensure that the data-routing capabilities of local registers are not violated, and are explained as follows.A local register can perform either a read or a write operation at any given time, but not both simultaneously.Therefore, two nodes that share an edge cannot be assigned to the same local register.In Fig. 5, since u is the immediate predecessor of v, the output of u is stored in a different local register than v.While the output of u is read from a local register, the output of v is simultaneously written to a different local register (19).Furthermore, two sibling nodes cannot be assigned the same local register.As shown in Fig. 5, u and v are immediate predecessors of w.Therefore, u and v cannot have the same local register.This constraint is because the local registers supply only one operand to each primitive in TULIP-PE.As a result, we need two separate local registers to provide two operands (20) ∀u, v ∈ V P ∀r ∈ [0, 3] u ≺ v:

TABLE II NUMBER OF ILP DECISION VARIABLES AND THE RUN-TIME REQUIRED TO GENERATE SCHEDULE ON TULIP-PE. HERE, THE COMPUTE GRAPH IS
A NEURON THAT COMPUTES N−1 i=0 w i x i FOR VARYING N The mapping of the primitive operations to the clusters of TULIP-PE is determined by analyzing the decision variables ρ v,r and s v .A node v is executed on the cluster associated with the local register r if ρ v,r = 1, at the time specified by s v .The stored data of v is then maintained in the local register till time instance e v .Table II shows the number of decision variables generated and the time required when using the ILP to generate the schedule of compute graphs of neurons that compute N−1 i=0 w i x i for a varying number of inputs (N).This is a one-time cost to obtain the schedule for QNN nodes on a TULIP-PE.The size of the largest neuron is N = 4096 in AlexNet [41].
The ILP described above enables TULIP-PE to modify its schedule depending on the number of neurons enabled in each cluster.Fig. 6(a) and (b) shows how the schedule of addition operation can be varied based on the available neurons (denoted by K).For example, assume that we need to execute an addition operation of two 4-bit numbers, X and Y. TULIP-PE uses five cycles (4 cycles before the next primitive can be launched) to finish its addition operation if only one neuron (K = 1) is enabled in each cluster.However, if the number of neurons in each cluster is doubled (K = 2), the schedule can be readjusted to finish the addition operation in three cycles (2 cycles before the next primitive can be launched).If all five neurons are enabled in each cluster, then TULIP-PE would only require two cycles (1 cycle before the next primitive can be launched) to finish the addition operation.This critical feature enables a run-time tradeoff between delay and energy efficiency on the TULIP-PE.Furthermore, if some neurons in the manufactured chip stop working, those neurons can be bypassed by modifying the schedule.

VI. MAPPING COMPLETE QNN ON TULIP
In the previous section, we described how a single QNN node, which is a DAG of operations, is executed on a single TULIP-PE.We now present the final step of mapping QNN nodes to TULIP array, taking into account its specific structure, which is shown in Fig. 7.Although the QNN is a DAG, its nodes are arranged in layers with all nodes in a layer performing the same function but on different inputs.As the computations have to proceed layer by layer, the main goal  3) Each dimension of the input (L) is reused M times.However, since the data and computation resources required for an arbitrary layer of a QNN might exceed what is available on the TULIP architecture, the nodes of a QNN must be scheduled so that the cost of refetching inputs and weights to the cache from off-chip memory is minimized, subject to the following constraints.
1) A 2-D array of TULIP-PEs operating in an SIMD fashion, such that the TULIP-PEs in the same row share input pixels, and TULIP-PEs in the same column share weights.2) A fixed cache size for storing input pixels.
3) A fixed cache size for storing weights.Similar problems have been addressed in several prior works, targeting different platforms [9], [14], [42], [43].These works minimize data fetches from the external memory by exploiting the fact that the core computations in all CNNs are convolutions and are expressed as deeply nested loops, which can be unrolled either in software or in hardware.The data fetching scheme described in [14] was for a 1-D array of PEs.In this article, we extend the data flow and the node scheduling algorithm presented in [14] for TULIP in a way that maximizes the reuse of pixels and weights, and achieves high energy efficiency.This allows us to keep the external memory interface uniform between the architectures and allows for a fair comparison between the two architectures.
An illustration of the schedule for the convolution layer of a QNN based on [14] is shown in Fig. 8.Given an image and kernel buffers of a given capacity, a subset of the required data (image pixels and weights) for a convolution operation is loaded from external memory.The computation on the TULIP-PEs is started as soon as the required data is available and partial results are computed.To complete the convolution operation across all input channels (L) and output channels (M), new input pixels and kernels are loaded to the respective on-chip buffers replacing the previous data.
The architecture presented in [14] has one row with C MAC units (columns) whereas, TULIP has R rows with C TULIP-PEs (columns) which share image pixels along the row and weights along the column.Therefore, at any given instance TULIP computes the convolution operation for R times more output pixels than in [14], with R times higher kernel reuse than [14].Therefore, number of external data transfers to load the kernel weights to the kernel buffer reduces by a factor of R in the case of TULIP as compared to [14].In Section VIII, we show that these comparisons are based on both designs with the same area.

VII. ENHANCING DATA-REUSE USINGTULIP-PES
This section provides a quantitative analysis to show how the use of TULIP-PEs enhances data reuse, as compared to a MAC unit.This is done by comparing the delay and area complexity when using TULIP-PEs and when using MACs.Let m and n be the number of bits needed to represent inputs and weights, respectively.
To multiply N pairs of weights and inputs, the area complexity of the MAC unit [44] is O(mn), whereas for a TULIP-PE it is O(1).The area complexity of TULIP-PE is a constant because it performs multiplication sequentially, in a bit-sliced manner.
The delay complexity of the MAC unit [44] O(N) and that of the TULIP-PE is O(mnN).Although the TULIP-PE is smaller, it is much slower than a MAC unit.However, as explained next, these tradeoffs change when MACs and TULIP-PEs are used in an SIMD architecture.
Consider the following two SIMD architectures.First is the baseline architecture for reference, which consists of a row of C MAC units.Second is the TULIP architecture, with a grid of R × C grid of TULIP-PEs.The baseline has a gate complexity of O(Cmn) and a delay complexity of O(N/C).Similarly, TULIP has a gate complexity of O(CR) and a delay complexity of O(mnN/CR).TULIP can match the area and delay of the baseline by setting R = mn.However, TULIP is still better than the baseline because the grid arrangement provides higher opportunities for weight reuse.If we assume that a workload of R × C graphs will be processed by both the architectures, then the baseline would fetch each weight R times whereas the TULIP would fetch each weight just once.As a result, significant energy-efficiency improvements are observed by enhancing data reuse.The complexity analysis discussed above is summarized in Table III.Note that the concept discussed above has already been used in other design settings.For instance, processor designers often choose to use several slower cores instead of using fewer faster cores, to enhance the energy efficiency without compromising on throughput.The work presented in this article also uses the same concept but at the level of PEs.TULIP replaces the traditionally used MAC units with slower but more energyefficient TULIP-PEs.

A. Experimental Setup
TULIP architecture was evaluated using TSMC 40nm-LP library.Synthesis was done using Cadence Genus, and then the design was placed and routed using Cadence Innovus (Fig. 9).Timing checks were performed using cross-corner analysis at {SS, 125C, 0.81V}, {TT, 25C, 0.9V}, and {FF, 0C, 0.99V}.The VCD file generated using real QNN workloads was used for accurate power analysis by modeling switching activity.
The primitive component of a TULIP-PE is the binary neuron shown in Fig. 1(c).A detailed analysis and design of this cell, along with its advantages over its CMOS functional equivalents appears in [34].For instance, a 5-input binary neuron in 40nm is about the size of a high-drive strength D-flipflop, but it can replace numerous functions that would normally require several levels of logic implemented using conventional CMOS logic.Overall, at the individual cell level [34] shows that a 5-input binary neuron in 40nm results in improvements in area, power, and delay of [80%, 60%, 40%], respectively, over the performance optimized, functionally equivalent CMOS circuit.These reductions at the individual cell level lead to significant improvements in throughput and energy of the TULIP-PE and of the TULIP architecture.
The three closest comparison points for TULIP are YodaNN [14], UNPU [19], and BitBlade [20].The UNPU architecture contains a full processor core, memory controller, etc., and hence it is harder to reproduce.Simillarly, the BitBlade architecture also uses a CPU for some operations of the QNN.Furthermore, the data presented in this article is all relative to another benchmark architecture.The energy numbers and throughput numbers are normalized to their chosen benchmark.Thus, it is difficult to perform a meaningful quantitative comparison with UNPU and BitBlade.On the other hand, YodaNN paper had sufficient details that allowed us to reproduce the architecture reliably.Note that both architectures are similar fundamentally since they use accumulator-based PEs.Since the main focus of this article is to highlight how TULIP-PE avoids hardware overprovisioning while also supporting a variety of QNN operations, the comparison was done against the YodaNN architecture.YodaNN is a BNN accelerator that was designed in 65nm technology.To present a fair comparison, we implemented the complete design of YodaNN using TSMC 40nm-LP technology and extended the design to support 2 to 4 bit QNNs.Our implementation of YodaNN will be referred to as YodaNN ++ .
Although YodaNN [14] does not report the throughput and energy efficiency for fully connected layers, we estimated them by performing element-wise matrix multiplications using the MAC units present in their architecture.In summary, TULIP and YodaNN ++ were both designed in the same technology, with the same memory organization, with support for 12-bit inputs, support for up to 4-bit weights and activations and kernel sizes of 3, 5, and 7.

B. Evaluation of TULIP-PE Against MAC
Table IV compares the baseline 18-bit reconfigurable MAC unit used in YodaNN ++ with a TULIP-PE with five neurons in each cluster and a 16-bit local register for each neuron.In large QNN architectures such as Alexnet [45], the input layers are integer, while the other layers are quantized.Consequently, both the MAC unit and TULIP-PE can support both types of layers.The MAC unit and TULIP-PE are compared when computing the outputs of the quantized layers.Both modules perform the weighted sum with quantized activations and weights.The MAC unit realizes convolution by multiplying and accumulating one kernel window in each cycle.On the other hand, the TULIP-PE treats convolution as a weighted sum represented as a compute graph of multiplication operations connected to an adder tree.This is important because TULIP realizes adders, multipliers, etc., of custom bit widths, thereby reducing the energy incurred by MAC unit that uses maximum width addition and multiplication operations in every cycle.
Table IV shows that the TULIP-PE is 15.8× smaller than the MAC unit and consumes up to 125× less power.However, its delay is 9.5× higher than the MAC unit since it performs bit-level addition.As a result, the power delay product of a TULIP-PE is up to 5.8× lower than the MAC unit while at the same time being 15.8× smaller than the MAC.Furthermore, since a MAC unit cannot compute operations, such as comparison, max-pooling, etc., the data is sent to other parts of the chip for these operations in the baseline [14].However, the TULIP-PE preserves the data locality and performs the comparison and max-pooling operations using the same hardware, without moving the data to other modules, thus resulting in additional energy savings.The reduced area allows us to have more TULIP-PEs, which leads to higher throughput.

C. Evaluation of the TULIP Architecture
The implementation of TULIP has 512 TULIP-PEs.This is to ensure that the area of YodaNN ++ and that of TULIP are the same.Note that the number of processing units in TULIP can easily be scaled to suit the application.For both designs, the size of the L1 buffer is 2.3 kB, the size of the L2 buffer is 10.5 kB, and the size of the kernel buffer is 24.5 kB.
Tables V and VI show the energy efficiency and throughput values for various neural networks (at varying bit-precisions), accelerated using both the TULIP and YodaNN ++ .For TULIP, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.two sets of results are presented: 1) TULIP tuned for the best energy efficiency and 2) TULIP tuned for the best throughput.
Here, tuning is done by changing the number of active neurons (cluster size K) in each cluster.Based on Tables V and VI, TULIP shows consistent improvement in the energy efficiency over YodaNN ++ for all the neural networks.Fig. 10(a) shows that TULIP consistently achieves an order of magnitude improvement in energy efficiency for all variants of the neural networks.This is primarily attributed to the fact that TULIP realizes adders, multipliers, etc., of different bit widths, which eliminates the waste incurred by conventional accumulation methods that use operators to accommodate the maximum width.This, coupled with the improved weight reuse, results in a substantial improvement in energy efficiency over YodaNN ++ .Fig. 10(b) shows throughput can be improved by reducing the bit precision.This is because fewer bits need to be processed for each operation.Fig. 11(b) shows that the throughput increases as the number of neurons in each cluster increases.Although this graph is restricted to the inference of Imagenet classification using ResNet-34, this trend applies to other neural networks as well.The increase in the throughput due to the increase in the number of neurons allows each operation to execute faster on the TULIP-PE.By appropriately choosing the right configuration, it is possible to match the throughput of the baseline (or even improve it) while gaining significant improvements in energy efficiency.The corresponding improvements in energy efficiency are shown in Fig. 11(a).Fig. 12 demonstrates how the TULIP architecture can be used to tradeoff energy efficiency and accuracy at runtime for neural networks used for ImageNet classification tasks.As the bit-precision increases, the energy efficiency decreases but accuracy increases.Hence, accuracy can be traded off at run-time with energy efficiency and throughput.This would be particularly useful for energy-constrained mobile devices where high accuracy may not be necessary to make the correct decision, or conversely, the accuracy could suddenly be increased in a critical situation, after operating at a lower precision.
In summary, there are two key observations that can be made from the experimental results.
1) TULIP can support multiple bit-precision of the weights and activations of a QNN and achieves consistently greater energy efficiency than the baseline architecture for the same.2) TULIP enables a high degree of tunability at run-time, to tradeoff energy efficiency, throughput, and accuracy.

IX. CONCLUSION
This article presents a new design of a QNN accelerator, called TULIP, that uses binary neurons as its core compute elements.TULIP and the baseline design YodaNN ++ were designed to the layout level and simulated on several well known neural networks.The simulations were carried out using commercial libraries and design tools and account for all the device, circuit and layout characteristics.The results show that TULIP can improve the energy efficiency by 30×-50× when compared the baseline design YodaNN ++ , with both designs having approximately the same area.These improvements do not rely on standard low-power techniques such as voltage scaling and approximate computing.The improvements in energy efficiency can be attributed to several factors: 1) the use of operators of the required bit-width instead of the maximum bit width; 2) the use of artificial neurons to compute complex logic functions within a very small area, thereby allowing for greater number of PEs for parallel operations; and 3) the ability to reconfigure the function of the neurons without sacrificing performance or energy efficiency.TULIP allows for tuning the precision and throughput and energy efficiency.

A. Problem of Scheduling QNN Node on TULIP-PE
In this section, we provide a precise formulation of the problem of register-aware scheduling of a QNN node on a TULIP-PE.It finds a mapping between two graphs: 1) a DFG of threshold functions that represents a QNN node and 2) a time-extended resource graph that represents TULIP-PE.
Definition 2 (Threshold Graph Gth(Vth, Eth)): This is a DAG where the nodes Vth represent threshold functions and an edge (u, v) ∈ Eth means that the output of u is an input of the threshold function v.
Definition 3 (Time Extended Resource Graph (TERG) G R (V R , E R )): This is a DAG where a node is a pair (u, t) ∈ V R , u is a local register or a neuron and t is a time instance at which the resource u is available.An edge between two resources (u, t) and (v, t+1) is represented as (u, v, t) to indicate that the output of u is input to v at time t+1.Edges are absent between resources if their timestamps differ by more than 1 or if there is no physical datapath between their associated neurons (or local registers).The latency of G R is max (u,t)∈V R {t}.A feasible schedule is a pair (G R (V * R , E R ), M : V * R → V th ), where G R is TERG on V * R and for each (u, v) ∈ V th , there exists a path P = {(r 0 , t 0 ), . . ., (r k , t k ), (r k+1 , t k+1 )} in G R such that M(r 0 ) = u, M(r k+1 ) = v, and t i+1 = t i + 1, 0 ≤ i ≤ k, and k ≥ 0. Fig. 13 shows an example mapping of a given compute graph containing four nodes to a time-extended resource graph containing two resources, extended over four cycles.
Definition 5 (TGS Problem): Given a threshold graph G th (V th , E th ), the TGS problem is to construct a feasible schedule of minimum latency.
In the design of the TULIP-PE shown in Fig. 1(b), there are four clusters, each with five neurons and each with a 16-bit register.Consequently, the number of resources will be 320, for each timestamp t.The maximum number of timesteps in the extended resource graph would be the number of levels in the compute graph.
The formulation of the TGS problem presented above is the same as the well-studied problem of mapping a DFG of computations onto a CGRA.There is an extensive body of literature on CGRA architectures and scheduling computations onto them that spans more than two decades.A precise formulation of the CGRA scheduling problem, similar to the TGS problem, first appeared in [39] and was shown to be NPcomplete.This was subsequently extended to register-aware mapping in [46], followed by several extensions [47], [48], [49], [50], [51].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Mapping the computations within the innermost loop involves small computation graphs, in the order of tens of nodes.In addition, with the target architecture being a CGRA, the resulting resource graph is also of the same order.This allows the various heuristic algorithms for finding a feasible mapping to enumerate the possibilities by transforming the compute graph [Fig.14(b)], constructing a compatibility graph [Fig.14(c)] and then finding a maximal clique of that graph, at each time step.The size of the compatibility graph is the product of the sizes of the computation and resource graphs.Such an approach is not possible for the TGS problem because the size of the compatibility graph would be in the tens of thousands for which a maximal clique has to be computed.For this reason, an alternate approach is presented in Section V.
Note that traditional high-level scheduling algorithms (used for task allocation to CPUs [40], [52]) do not apply when performing scheduling on TULIP-PE.This is because, unlike high-level scheduling algorithms, routing-aware scheduling algorithms used for CGRAs generate a valid schedule while also honoring the routing constraints that arise due to the physical limitations (bandwidth, connectivity, etc.) of the hardware.

Fig. 1 .
Fig. 1.TULIP architecture overview.(a) Top-level architecture of TULIP: controller configures the processing units.The input pixels and weights are sent through image and kernel buffers.The output of the processing units is collected in the output buffers before sending it back to the memory.(b) Architecture of a TULIP-PE, consisting of four clusters and four local registers.Each cluster contains K neurons (K=5).(c) Architecture of a binary neuron.

Fig. 2 .
Fig. 2. 5-bit carry lookahead adder using binary neurons that adds two 5-bit numbers A and B, and a 1-bit carry-in C0.Each box represents (2), such that the left sub-box and right sub-box represent the left-and right-hand side of the equation, respectively.

Fig. 4 .
Fig. 4. Mapping a primitive graph G P to a resource graph, where each resource is either a cluster or local register.(a) Primitive graph G P .(b) Mapping to resource graph.
(a) contains three primitive operations LK, ADDK, and COMPK, which are initialized in consecutive cycles as shown in Fig. 4(b).Operands A and B are stored in local registers 3 and 1.The operation LK is executed in cluster 4 and stored in its local register.The sum and carry bits of ADDK operation are calculated in cluster 2 using the data stored in local registers 1 and 4 and the result is stored in local register 2. Finally, the data from local registers 2 and 4 are used to compute COMPK to generate the final output Y.

Fig. 5 .
Fig. 5. Example to illustrate routing constraints in the primitive scheduling problem.The output of each node in the primitive graph G P is stored in the local registers of TULIP-PE.

Fig. 6 .
Fig. 6.Addition operation, adder-tree, accumulation, and comparison using the TULIP-PE architecture.Depending on the number of neurons available in each cluster, the scheduler can automatically tune the schedule for the best performance.(a) Addition operation (1-bit per cycle).(b) Addition operation (2-bit per cycle).(c) Addition-tree memory management.(d) Accumulation operation to add partial sums.(e) Comparison operation.

Fig. 7 .
Fig. 7. Representation of QNN as a DAG and data reuse opportunities in 2-D Convolution.(a) QNN as a DAG.(b) Convolution operation.

Fig. 8 .
Fig. 8. Convolution schedule used by TULIP on the basis of algorithms presented in [14].

Fig. 11 .
Fig. 11.Improvements of TULIP (with varying number of active neurons in each cluster) over YodaNN ++ , for ImageNet Classification using ResNet-34.(a) Improvements in energy efficiency.(b) Improvements in throughput.

Fig. 13 .
Fig. 13.Scheduling graphs of threshold functions on binary neurons.(a) Compute graph G th .(b) Time-extended resource graph G R .(c) Mapping solution of G th to G R .

Fig. 14 .
Fig. 14.Scheduling threshold function graphs on binary neurons.(a) Compute graph G th .(b) Equivalent transformed graph G th .(A and A indicate buffer functions).(c) Compatibility graph of G th and G R .

TABLE I NOTATION
FOR ILP USED TO SOLVE THE PRIMITIVE SCHEDULING PROBLEM

TABLE IV COMPARISON
OF FULLY RECONFIGURABLE MAC UNIT BASED ON THE YODANN ARCHITECTURE [14], WITH A TULIP-PE (K=5), FOR COMPUTING A 288 INPUT WEIGHTED SUM (32 INPUT CHANNELS, KERNEL =3×3).TULIP-PE IS 15.8× SMALLER THAN THE MAC UNIT.PDP: POWER DELAY PRODUCT

TABLE V ENERGY
EFFICIENCY [EN.EFF.(TOP/J)] AND THROUGHPUT (GOP/S) OF TULIP AND YODANN ++ FOR CIFAR-10 CLASSIFICATION.K INDICATES THE NUMBER OF NEURONS USED IN EACH CLUSTER.TWO VARIANTS OF TULIP ARE SHOWN: ONE IS TUNED FOR ENERGY EFFICIENCY, WHILE THE OTHER IS TUNED FOR PERFORMANCE.(A) ALEXNET.