Cryptensor: A Resource-Shared Co-Processor to Accelerate Convolutional Neural Network and Polynomial Convolution

Practical deployment of convolutional neural network (CNN) and cryptography algorithm on constrained devices are challenging due to the huge computation and memory requirement. Developing separate hardware accelerator for AI and cryptography incur large area consumption, which is not desirable in many applications. This article proposes a viable solution to this issue by expressing the CNN and cryptography as generic-matrix-multiplication (GEMM) operations and map them to the same accelerator for reduced hardware consumption. A novel systolic tensor array (STA) design was proposed to reduce the data movement, effectively reducing the operand registers by $2\times $ . Two novel techniques, input layer extension and polynomial factorization, are proposed to mitigate the under-utilization issue found in existing STA architecture. Additionally, the tensor processing element (TPE) is fused using DSP unit to reduce the look-up table (LUT) and flip-flops (FFs) consumption for implementing multipliers. On top of that, a novel memory efficient factorization technique is proposed to allow computation of polynomial convolution on the same STA. Experimental results show that Cryptensor achieved 21.6% better throughput for VGG-16 implementation on XC7Z020 FPGA; up to $8.40\times $ better-energy efficiency compared to existing ResNet-18 implementation on XC7Z045 FPGA. Cryptensor can also flexibly support multiple security levels in NTRU scheme, with no additional hardware. The proposed hardware unifies the computation of two different domains that are critical for IoT applications, which greatly reduces the hardware consumption on edge nodes.

applications in the past decade, including smart agriculture, smart surveillance system, smart network grid, etc. Combination of AI and edge computing is an emerging trend with the aim to reduce the inference latency without relying on the cloud server [1].Designing edge AI systems faces two major challenges.First, such solutions typically run on battery-powered IoT sensors built on constrained resources, e.g., microcontroller and field-programmable gate array (FPGA).This leads to the exploration of specialized AI accelerator for resource constrained scenario in recent work [2].Besides that, security solution is required when deploying IoT systems in order to protect the users' privacy.Data encryption using block cipher (e.g., AES) is a common way of achieving data confidentiality.In IoT systems, the sensor nodes are usually placed in open areas, which exposes them to various malicious attacks (e.g., side channel attack) that steal the encryption key.To avoid such a problem, the cloud server may refresh the encryption keys for all the sensor nodes from time to time.These refreshed encryption keys are transmitted to the sensor nodes securely through the use of key encapsulation mechanism (KEM).There are also some other advanced lattice-based KEM that were designed recently and are recently standardized [3], [4].
The most heavy computation in lattice-based schemes [3] is polynomial convolution, which has O(n 2 ) computational complexity.Efficient implementation of lattice-based schemes in software and hardware remains an active research topic to date.For instance, there are also attempts to accelerate polynomial convolution using hardware accelerator [5], but this involves additional hardware consumption.Typical IoT sensor nodes need to handle many nontrivial tasks (e.g., communication, data aggregation/compression/encryption, edge computing, etc.) Designing a separate convolutional neural network (CNN) [6], [7], [8] and cryptographic co-processors [9], [10] is a common practice, since both features are critical to the IoT applications.This ensures high performance for both operations at the expense of a larger hardware area, but is not practical for resource-constrained device implementation.Hence, this makes exploring opportunities to share the hardware resources across different types of computations an interesting and practical research problem.
Previously, Lee et al. [11] proposed to exploit GPU tensor core for polynomial convolution using 1937-4151 c 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
generic-matrix-multiplication (GEMM) computation.This is similar to CNN, that is typically expressed as GEMM operation as well when accelerated using GPU.However, their technique targets cloud server applications that processes polynomial convolutions in batches, which differs from edge nodes that only process one polynomial convolution at a time.Directly using the technique [11] for single polynomial convolution reduces the GEMM computation to matrix-vector computation, which is not efficient to be mapped onto existing CNN accelerators.How to effectively utilize technique proposed by Lee et al. [11] on edge nodes remains an open problem.
Many CNN accelerators in FPGA [6], [7], [8] are designed based on the systolic array (SA) concept, where each processing element (PE) are separated by the operand registers to transfer data in a systolic manner.The SA design ensures high-data reusability, which reduces data transaction between computation unit and on-chip memory.However, conventional SA design requires a lot of operand registers.This consumes a lot of flip-flops (FFs) in the FPGA, which becomes a significant issue when the number of PE increases.To address this issue, Liu et al. [12] proposed to implement a systolic tensor array (STA) architecture, where multiple PE are clustered together to reduce the number of operand registers.However, their implementation [12] was optimized for ASIC platform, which is not directly applicable to FPGA.In particular, it requires high bandwidth (large number of parallel input data) to maintain optimal performance.This implies that a customized memory interface to handle such memory bandwidth is required, which is challenging for FPGA implementation that typically uses standardized memory interface.Our Cryptensor is inspired by their method, but we focus on efficient design for standard FPGA platform.
Typically, a CNN consists of multiple CNN layers.Hence, existing accelerators are commonly designed to execute the CNN layer by layer.This requires a generic SA design that can support CNN layers with different settings.However, mapping such SA to different number of input channels has underutilization concern [13], because the input layer image has relatively low number of input channels.It is not compatible to input channels of deeper CNN layer that typically grows in multiples of 16 (e.g., 64, 128, 256, etc.) or multiples of 8 (e.g., 16, 24, 32, etc.) in recent years CNN architectures [14], [15], [16].To resolve this issue, McDanel et al. [17] proposed to reduce the input height and width and increase the input channels for input layer image, but this requires retraining of the CNN architecture.Our work resolve the under-utilization of SA due to input layer without retraining the CNN.
The main contributions of this work are as follows.
1) The first unified co-processor hardware architecture, Cryptensor, was proposed to accelerate both CNN and lattice-based cryptography.Cryptensor can be used to compute CNN (convolution and fully connected) operations efficiently and support cryptography operations (polynomial convolution) without additional hardware resources, which is not found in other state-of-the-art CNN accelerators.
2) Efficient STA design for FPGA implementation.Specialized memory scheme (Section III-C) is proposed to satisfy the high-input bandwidth requirement for STA design using FPGA built-in resources, instead of customized memory interface solution that is requiredin [12].Optimization of STA on fusing computation data-path (Section IV-A) was proposed, effectively reducing LUT and FF for multiplier implementation comparing to existing STA design [12].3) To resolve the under-utilization problem of CNN input layer for STA [13], a technique is proposed to extend the dimension of the input feature map and kernel map on the input layer (Section IV-B).The extended feature map and kernel map ensures all PE can be fully utilized for the computation in the CNN input layer.Unlike the input reshaping approach [17], our method does not require retraining.4) To ensure full utilization of the STA cores for polynomial convolution in cryptography and reduce memory access, a factorization technique (Section IV-C) is proposed to restructure the input polynomials.Compared to the batch processing technique [11], our approach expresses the single polynomial convolution as GEMM computation, which is more efficient to be mapped to STA.With these proposed techniques, Cryptensor is 21.6% faster than [18] for VGG-16 on XC7Z020 and achieve up to 8.40× better-energy efficiency when implementing ResNet-18 on XC7Z045 [19].
Cryptensor is very useful to IoT applications with edge computing, which allows low-latency inference on the edge node, at the same time reuses the hardware to offer security features.The remaining of this article is organized as follows.Section II introduces the background and related work.Section III presents the overview of the accelerator, Cryptensor design.Section IV discusses the optimization techniques applied to Cryptensor.Experimental results are presented in Section V, followed by the conclusion in Section VI.

A. Convolution in CNN
CNN extracts critical feature information from the input image using stack of layers.Typical layers available in CNN are convolutional (CONV) layer, activation (ACT) layer, pooling (POOL) layer, and fully connected (FC) layer.When an input image first enters the CNN, it is processed by the CONV layers, in which the convolution operations are performed between the input image and the kernel filters to construct feature maps from the input image.The feature maps are then forwarded to the ACT layer, where nonlinearity is introduced to the features through Activation Functions.One of the most commonly used activation function in CNN is ReLU.After this, the POOL layer performs down sampling to reduce the dimension of the feature maps, allowing subsequent CONV layer to compute faster with smaller feature maps.Finally, FC layer combines features learned in the previous layers and outputs them for further processing.In CNN, most of the computations are concentrated on the CONV and FC layers.Note that the operations in FC layer can also be expressed in the form of convolutions [20].Hence, accelerating the CONV layers can greatly improve the performance of CNN.In view of that, this section only focuses on analyzing the operations of CONV layer Equation (1) shows the computation of CONV layer.Convolution in CNN can be visualized as sliding multiplyand-accumulate (MAC) operations between kernel maps (F) and input feature maps (I) to compute output feature maps (O).The N is the total output channels, M is the total input channels, P and Q represents kernel height and width of F, respectively.To construct O, the F slides through I, at relevant input locations by calculating x × S+p for height and y ×S+q for width position.S in the calculation is stride, which defines how many input location to skip between each sliding operation, whereas x and y are the height and width position of output to be calculated for O, respectively.The fetched data from I at (m, x×S+p, y×S+q) are multiplied with the weight of F at (n, m, p, q) and are accumulated along m to obtain the output location at O(n, x, y).This process is repeated until all location at every output channel, N for O are computed.

B. Polynomial Convolution in Cryptography
Polynomial convolution is widely used in lattice-based cryptography, which may exist in a cyclic or nega-cyclic form.NTRU KEM, one of the notable finalist candidates in NIST standardization [4], uses the polynomials with a degree of 509, 677 and 821 for three different security levels, respectively, [21].NTRU public key encryption scheme is also an IEEE standard [3], where the polynomial degree is slightly different from the NIST version.The NTRU scheme is designed over a ring R q := Z q [x]/(x N − 1), so the multiplication over this ring is essentially a polynomial convolution in a cyclic form.It can be expressed as ( Note that in majority of the lattice-based cryptography, including NTRU, polynomial convolution is the most time consuming operation with a computational O(N 2 ), where N refers to the degree of polynomial.Due to this reason, a number of hardware architectures were proposed recently to accelerate the polynomial convolution on FPGA [22], [23].Such approaches can drastically reduce the computation time of the underlying lattice-based cryptographic schemes, but they also introduce significant hardware consumption.

C. Similarities in Convolution and Polynomial Convolution
A closer look into the polynomial convolution reveals that its computational steps are similar to the one in CNN convolution.In (1), F at index (p, q) are multiplied and accumulated with all I that are fetched using p and q as well.The multiplication results are then further accumulated along index m to finally generate an output at O(n, x, y).Similarly, in (2), all a i are also multiplied with all b j and accumulated to generate a coefficient x i+j .Comparing (1) and ( 2), it can be seen that multiplications between two series of inputs are performed, then accumulation takes place to generate the series of final output.With this similarity, our work attempts to propose a solution that reuses the same CNN hardware coprocessor to compute the polynomial convolution for IEEE NTRU [3].The mod operation in (2) only needs to be performed once after the coefficient x i+j is generated, hence, does not affect mapping of both algorithm to reuse the same co-processor.However, (2) only accumulates between two indexes (index i and j) as compared to (1) that accumulates between three indexes (first between index p and q, then further accumulate along index m).To resolve the potential hardware under-utilization raised due to this slight difference, a factorization technique is presented in Section IV-C.The proposed technique can be generalized to support other latticebased schemes (e.g., Dilithium [24]) that also use polynomial convolution.

D. Related Works
Implementing CNN accelerator efficiently on FPGA is challenging, as most of the workload is focused in the heavy and repetitive convolution layers of CNN.Most recent works [25], [26], [27] revolve around optimizing the CNN loop [refer to (1)].Such techniques study the effect of unrolling at different loop level, loop tiling and loop interchanging.The loop study then determines the optimal number and layout of PE to be implemented.However, the customized PE layout due to loop unrolling could introduce difficulty for the placer and router during FPGA implementation.Without careful design, it could result in less efficient FPGA mapping for obtaining good hardware area or computation performance.
Aside from loop optimization, several works [6], [7], [8] have been proposed to express CNN convolution as GEMM operation.To perform CNN convolution using GEMM, the input feature map and kernel map patches need to be unrolled and converted to matrix.This conversion procedure is typically performed by Image-to-Column (im2col) or Image-to-Row (im2row), the software routines that unroll the input feature map based on information, such as kernel size, padding, and striding.This slightly increases computational overhead, but it can be mapped to efficient SA architecture easily.The generic design of SA makes it easier for FPGA mapping, as compared to customized PE array in loop-based optimization.It is worth noticing that, since SA has high-data utilization, it is also good for reducing data movement between on-chip and off-chip memory that is taxing on energy consumption [28].
Several accelerator designs targeting NTRU polynomial convolution were proposed recently [9], [10].Farahmand et al. [9] proposed a high-speed architecture that uses a lot of hardware resource, which is more suitable for deployment in high-end server.On the other hand, the architecture proposed by Khan et al. [10] targets low-area consumption, which is more suitable for implementation on sensor nodes.These two work are highly optimized for NTRU polynomial convolution but do not support other operations (e.g., CNN).

III. FULL SYSTEM OVERVIEW
This section describes the proposed unified architecture, Cryptensor, to support the convolution operations exist in CNN and lattice-based cryptography.We first present the top-level of the proposed architecture.Following, the internal design of the unified convolution accelerator is discussed.Finally, data computation and storage patterns used by the computation core are described in details.

A. Proposed Convolution Accelerator
Fig. 1 shows the top-level overview of the proposed Cryptensor convolution accelerator, which implements the GEMM core as the main computation unit.The GEMM can perform both CNN and polynomial convolutions.Mapping of both operations to the GEMM core are presented in Section III-B and Section IV-A.Our work targets only 8-bit CNN, which is commonly supported in CNN framework, such as Tensorflow [29] and Pytorch [30].Input data and kernel weight of the CNN are expected to be quantized to 8-bit before computation.For polynomial convolution, NTRU is selected, and the coefficients for the polynomial and small polynomial are stored in 16-bit and 8-bit, respectively.Two input matrices (A and B) are expected to produce an output matrix (C), with each output data in 32-bit.
In this work, the matrix A represents either the smaller block of input feature map (CNN convolution mode) or the polynomial (polynomial convolution mode); the matrix B represents either the kernel (CNN convolution mode) or the small polynomial (polynomial convolution mode); the matrix C represents either the output feature map (CNN convolution mode) or output polynomial (polynomial convolution mode).
The reading and construction module for both matrix A and B are built with Input Fetching Unit, Intermediate Input Buffer, Matrix Construction Unit, and Data Streamer Unit.The Input Fetching Unit transfers input data from off-chip memory (DDR), and store it into the Intermediate Input Buffer, which is implemented using the on-chip memory on FPGA (BRAM).Due to limited BRAM on FPGA, it may not be possible to fetch the entire input feature map or polynomial for processing.Hence, the Input Fetching Unit is designed to support blocked matrix fetching for both CNN and polynomial convolution modes.Note that input data for CNN input feature map is stored in 8-bit, while coefficient for polynomial of polynomial convolution is in 16-bit.Due to the nature of bursts mode in DDR [31], [32], every read initiated by Input Fetching Unit will always return 8-bit × 16 data for CNN mode and 16-bit × 8 data for polynomial convolution mode The blocked matrix for input feature map (IBlk IFM ) and kernel map (IBlk F ) assumes the dimension as shown in ( 3) and ( 5), respectively.For (3), C IFM refers to the total input columns to be fetched for IBlk IFM , calculated using P × Q × M (P, Q and M are originally from ( 1)).R IFM refers to the maximum input rows of the IBlk IFM to be fetched.The R IFM × C IFM must be lesser or equal to the Intermediate Input Buffer size (8 Groups × 4 BRAMs × 2048 locations × 16-bit) for matrix A. For (5), R F represents the total input rows to be fetched for IBlk F , which is the same as C IFM for IBlk IFM .C F refers to the maximum inputs columns to be fetched for IBlk F .The R F × C F must lesser or equal to Intermediate Input Buffer size (8 Groups × 8 BRAMs × 2048 locations × 16-bit) of matrix B. Before CNN convolution, the matrix size is determined by first checking if the original input matrix (in DDR) fits the Intermediate Input Buffer for matrix A and matrix B. If the original input matrix is too large for the Intermediate Input Buffer, we manually split it into smaller blocked matrices, and calculate the new blocked matrix size.This calculation can be performed offline, as the input and output sizes of each CNN convolution layer are already predetermined based on the target CNN architecture to be used.For polynomial convolution, the expected input and output sizes are also predetermined, hence, input block size can be precalculated before starting the polynomial convolution.
In practice, to compute CNN convolution using GEMM, the input feature maps and kernels need to be transformed into a matrix form, using Image-to-Column (im2col) or Imageto-Row (im2row) software routines.Matrix A construction unit is designed to perform im2row transformation (refer to Section III-B) on the input feature map after fetching it from DDR memory, following the kernel window size in CNN convolution mode.It is designed to support striding (stride-1 or stride-2), padding (no-pad, pad-1 or pad-2), and various common kernel sizes (1 × 1, 3 × 3, 5 × 5, and 7 × 7).Such flexible support allows the reading of polynomial during the polynomial convolution mode, by configuring the matrix construction with the setting of stride-1, no-padding and 1 × 1 kernel size.Similarly, the matrix B construction unit transforms the kernel Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and small polynomial to matrix form during CNN convolution and polynomial convolution mode, respectively.Fig. 2 shows the input and data-flow during GEMM computation.The data read from the Intermediate Input Buffer are requested based on desired im2row data orientation.This is achieved by reading desired location based on the read address generated by Matrix Construction Unit.The im2row data stream is then passed into the GEMM core through the Data Streamer Unit.After GEMM computation, the results are stored into the Intermediate Output Buffers.The Intermediate Output Buffers (with the size of 16 BRAM × 1024 locations × 32-bit), is designed to hold small block of output data, before performing post-processing operations, such as max pooling (2 × 2 and 3 × 3), activation function (ReLU), and requantization (Requant).Each of these post-processing operations can be bypassed based on computation needs.Output from these post-processing modules will be written back to the off-chip memory.

B. GEMM Core Internal and Systolic Scheme Selection
The GEMM core contains PEs that calculates matrix-matrix multiplication.Cryptensor is designed for processing quantized CNN with 8-bit input data and kernel weight, as well as 16-bit and 8-bit coefficients for polynomial convolution in cryptography, respectively.Note that existing lattice-based cryptography performs polynomial convolution between a polynomial and a small polynomial, in which the small polynomial has much smaller coefficients.For instance, NTRU [3] uses one polynomial with 11-bit coefficients and another polynomial with 2-bit coefficients.Each PE is designed to perform 16-bit × 8-bit multiplication and accumulate up to 32-bit.The GEMM core is based on SA architecture that transfer input operands between adjacent PE during computation for highdata reusability.However, conventional SA requires a lot of operand registers between adjacent PE to propagate the data for computational reuse.A quick review on the GEMM operation for CNN convolution is presented here.As illustrated in Fig. 3, Matrix A is the input feature map after transformation by im2row.Each row in Matrix A represents an unrolled local patch in the input feature map, stacked at its input channel.For Matrix B, each column represents a kernel map for a different output channel.Similarly, within each column, the kernel weights are unrolled and stacked at its input channel.After GEMM, the output Matrix C contains the output feature map to be processed by the next CNN layer.Each of these columns represents a single output feature map, arranged side by side between output channels.GEMM expression for polynomial convolution can be found in Section IV-C.
The two common SA designs are input operand stationary and output stationary [28].For the input operand stationary design, the first input operands (typically 8-bit weight from Matrix B) for MAC are loaded into the PE before computation.Then, the second input operand (typically 8-bit input data from Matrix A extended to 16-bit) and 32-bit output result (output activation for MatrixC) are streamed from the adjacent PE during calculation.For the output stationary design, two operands (16-bit extended input data and 8-bit weight from Matrix A and Matrix B, respectively,) are streamed into the adjacent PE, and the 32-bit output result (output activation for MatrixC) stays in the accumulator, to be read out after computation ends.With lower-hardware resource design in mind, Cryptensor employs output stationary scheme that requires two smaller operands (16-bit and 8-bit) registers between PE, as opposed to larger 32-bit accumulation register between PE that is required in input stationary scheme.
Fig. 4 shows the SA grid layout of the internal GEMM core for the output stationary design.GEMM in Fig. 3 can be mapped to this SA grid with operand A equivalent to row data of Matrix A and operand B equivalent to column data from Matrix B. 32-bit output activations for Matrix C are remained inside accumulator of each PE to be read out from GEMM core after computation completes.To further reduce the number of operand registers, we follow the implementation scheme by Liu et al. [12], where multiple PEs are clustered together to form a single larger tensor PE (TPE).To perform clustering, Fig. 4. GEMM core internal with output stationary SA consisting fourcolumn and four-row of PE.Fig. 5. GEMM core internal after adopting technique from [12].Illustrated STA layout consists two-row and two-column of TPE (dashed bounded box).Each PE can perform two multiplications and one addition as compared to SA PE in Fig. 4.
we remove the intermediate operand registers between the selected PE.For instance, to create a 2 × 2 TPE, two operand A and operand B registers are removed between PE0, PE1, PE4 and PE5.The clustering not only reduces the number of operand registers but also increases the number of operations performed per clock cycle.To further increase the computation capacity of TPE, we propose to introduce extra multiplier into each PE for the design in Fig. 4. Take note however, number of multipliers in each PE affects the adder performance.The adder would need to process extra input for every multiplier introduced to each PE, which could result in a large and slow adder tree structure.Hence, we propose to create only two multipliers in each PE.
Fig. 5 shows the proposed STA grid design.While the total number of operand registers for a TPE (e.g., PE0, PE1, PE4 and PE5) is the same compare to Fig. 4, the reduction of operand register is still achieved.For Fig. 4, the multiplier to operand registers ratio is 1:2, where one multiplier in each PE (e.g., PE0) is connected to two operand registers (one operand A and one operand B).For Fig. 5, however, the PE to operand register ratio is 1:1, where every four multipliers of a pair of PE (e.g., PE0 and PE1) is only connected to four operand registers (two operand A and twooperand B).As such, the number of operand registers is effectively reduced by 2×.GEMM calculation using the proposed STA is illustrated in Fig. 6.However, the scheme proposed by Liu et al. [12] requires high-input bandwidth to maintain optimal GEMM computation performance.We presented a discussion in Section III-C on how our memory storage and access scheme are designed to meet the high-input bandwidth requirement.

C. Proposed Data Access Pattern and Storage Scheme
This section describes the proposed techniques to meet the high-input bandwidth requirement in STA [12].Before computation starts, we first read the desired block of the input feature map and kernel map from the off-chip DDR memory and buffer it on the on-chip intermediate buffer that is implemented using block RAMs (BRAM) for high-speed access.Each data for input feature map and kernel map are assumed to be quantized in 8-bit for DDR storage.Similarly, the coefficients for polynomial and small polynomial are assumed to be stored in 16-bit and 8-bit format, respectively.Referring to Fig. 6, each clock cycle needs at least 4 × 16-bit and 4 × 8-bit data from Matrix A and Matrix B, respectively.To meet this input requirement, our on-chip intermediate buffer is designed to access multiple BRAMs in parallel to fetch the required data for computation.To improve data transfer rate, we design a storage scheme to ensure high-data locality that conforms to the computation pattern.This reduces unnecessary DDR memory transactions, at the same time avoiding complex addressing hardware that could increase the hardware area for FPGA implementation when reading from the BRAMs.
Recall that the GEMM core for this work generates multiple output activations for multiple output channels in parallel (refer to Section III-B).Since output stationary systolic design Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. is adopted, the MAC result is expected to remain in the accumulator of each PE until all input channels are summed up to generate the final output activation.The input operands for each PE are as follows.
1) The full depth of the local region (relevant to the currently computed output activation) in the input feature map.
2) The whole kernel map for the current output activation.To ensure data locality, we propose to store the data in the input-channel-first format for both kernel and input feature maps.In the input-channel-first storage scheme, the values in the input operands are arranged first along the channel, followed by the width and height dimensions, respectively.Fig. 7 illustrates the expected input feature map storage pattern following the input-channel-first storage scheme in DDR memory.As illustrated in Fig. 7, each DDR location can address 16 × 8-bit input data.The 16 is due to the DDR [31] memory that performs bursts read of 8 × 16-bit.This scheme ensures a high-data locality by storing all the input channels for each input location as close as possible, solving the potential issue of DDR memory and BRAM access as discussed earlier.Furthermore, since input location is stored following its input channel, the read address can be calculated easily in a sequential manner.This reduces the addressing complexity for read access during matrix construction process.It should be noted, input-channel-first storage scheme is also applicable to polynomial convolution scenario.As illustrated in Section II-C, each cyclic polynomial can be represented as input data stacked at its input channel.The input channel dimension is equivalent to the coefficients of each cyclic polynomial.Using the same representation, no additional addressing scheme or hardware implementation will be required when switching between CNN convolution and polynomial convolution mode.
Fig. 8 illustrates how the input feature map is stored in BRAMs after it was read from DDR RAM.A total of eight BRAM groups are created for both the input data and kernel, with each group consisting four BRAMs and eight BRAMS (each with 2048 location × 16-bit), respectively.The number of BRAM groups is selected to conform to the input channel dimension that is divisible by 16 (e.g., 64, 128, 256, etc.), which is found in recent years CNN architectures [14], [15].For input data, four BRAMs per group is selected to buffer the full input feature map for at least one of the CNN layer, typically starting from layer with 128 input channels × 28 height and × 28 width in [15].Similarly, eights BRAMs per group is selected for kernel buffering, for CNN layer with kernel map dimension up to 128 output channel × 128 input channel × 3 height × 3 width in [14].Each of these BRAMs configured in 16-bit storage mode following the finalized operand size for our GEMM core (refer to Section IV-A).During each read, a single read address activates one BRAM from each BRAM group in parallel to output 1 × 16 input channel for each input data and kernel weight fetch.As illustrated earlier in Section III-A, the data output from the BRAMs are directed to Data Streamer Unit, and are loaded into the desired PISO unit.Next, the PISO unit shift out its data word by word, and the matrix construction unit will continue to fill the next PISO unit while the previous PISO unit is offloading its data.

IV. OPTIMIZING THE PROPOSED CONVOLUTION ENGINE
This section discuss further optimizations to improve the performance of the proposed convolution engine.We first discuss how digital signal processing (DSP) unit can be used to fuse the multipliers and to reduce hardware resource consumption without trading-off the computation performance.Then, we propose modifications to our existing data storage and access scheme to improve the hardware utilization for input layer that has relatively less input channels.Finally, we present a memory efficient computation strategy with improved PE utilization for polynomial convolution between a polynomial and a small polynomial.

A. Fusing Multipliers Into DSP for Better Performance
The DSP unit [33] in Xilinx FPGA is capable of performing 25-bit × 18-bit multiplication.However, since our design assumes a lower-bit-width (16-bit × 8-bit) MAC data path, the PE could be mapped to look-up table (LUT) and FFs instead of DSP unit by the compiler tool.This mapping could lower the accelerator performance, as it takes more effort to route and place the LUT and FF for a shorter delay path.DSP unit is built-in hardware on the FPGA, which has its own dedicated routing path.Overall delay path can be effectively reduced if the multipliers are mapped into it.For better-hardware performance and utilization, our work revisits DoubleMAC [34] technique to fuse multipliers of the Cryptensor PEs that is previously implemented using LUT and FF into DSP.Note that, input feature map data and kernel map weight of 8bit quantized CNN are usually represented using unsigned Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.8-bit values and signed 8-bit values, respectively.Similarly, the coefficient for both polynomials of cryptography scheme (e.g., NTRU) are represented using unsigned 11-bit and signed 2-bit values.
Fig. 9 shows how multiplication between two signed multiplicand (in 8-bit) and one unsigned multiplier (in 16-bit) can be performed using a single DSP.In Fig. 9, the operation +20 × 17 = +340 and −8 × 17 = −136 are performed concurrently, with 17 as the common unsigned multiplier.Before performing multiplication, +20 and −8 are packed together to form a 25-bit input multiplicand using the DSP preadder.The −8 are right shifted to bit 7∼0 and signed extended up to bit-24.Similarly, +20 is also sign-extended, but is shifted left to bit 24∼16.For the 18-bit multiplier, we zero extend the common multiplier, 17, before multiplication.The final multiplication results for the left input lane (+340) is extracted from bit 31∼16, whereas bit 15-0 are the result for the right input lane (−136).However, the intermediate result for the left input lane (+339) is different from the expected final result (+340).This happens due to the sign extension for −8 during input packing, causing subtraction of one from +20 and generate the packed multiplicand as +19 −8.This condition can be detected before computation and corrected using postadder after the multiplication operation in the DSP.Fig. 10 illustrates an example of GEMM core layout with two-row and one-column TPE (dashed bounded box) after applying this technique.The merged PE now consists of two MAC computation paths, with a single DSP and accumulator for each path.This means, each TPE (dashed bounded box in Fig. 10) now performs 2 × 2 GEMM computation using lesser LUT and FF resources.This is because merging of the two neighboring column TPE (originally shown in Fig. 5) with multiplication DSP mapping technique remaps every pair of LUT and FF implemented multipliers into a single high-performance DSP unit.GEMM calculation using the proposed merged TPE is illustrated in Fig. 11.

B. Optimization for CNN First Layer
Typically, the first layer of CNN assumes input with three input channels, because images are usually stored in the RGB format with three color channels.However, our proposed input-channel-first storage scheme is more efficient for convolution layers with at least 16 input channels.The small number of input channels for the first layer could not efficiently utilize the input bandwidth (1×16 input channel input data per BRAM read as discussed in Section III-C), and may reduce the performance of GEMM core or even generate inaccurate result.To resolve this issue, McDanel et al. [17] proposes input reshaping for the first layer.Both input image and kernel map of first layer are rearranged to reduce its height and width, but increases its input channel dimension.However, their proposed technique [17] needs to retrain the CNN model, as it involves modifying the input layer of the CNN architecture.To overcome such limitation, we propose a similar approach by introducing an extra input channel for the input feature map and kernel map with zeros.To further ensure uniform memory fetching, we also propose to increase the kernel map width by padding it with zero values.This reduces the address generation complexity and eliminates the need to handle misaligned input data fetch before streaming into GEMM core for computation.The extension caters only for the first layer, and they will not be applied to the other CNN Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the kernel map to maintain a consistent input channel dimension with input feature map for convolution calculation.The extended input feature map are stored as 4 input locations × 4 input channels where each input value is a 8-bit data at every DDR memory location in Fig. 12(b).
Fig. 13 shows how 4 input locations × 4 input channels data are stored in eight BRAM groups.Unlike BRAM storage pattern for input feature map with larger input channel (refer to Fig. 8), each BRAM address (dashed bounded box) now points to four different input locations.This implies that single address access is no longer guaranteed to fetch required input location, due to padding and striding settings of the CNN kernel window scanning.To resolve this issue, four separate address generator are created to read 4 × 4 input in parallel for computation, so that it follows the existing addressing scheme in Section III-C.Another issue with first layer is that, its kernel map size still does not align with the current 4 × 4 addressing scheme after input channel padding.For example, VGG-16 [14] uses a 3 input channel × 3 kernel height × 3 kernel width.After input channel padding, the kernel map is now in 4 × 3 × 3 dimension.When this kernel map is fetched and stored into BRAM, it becomes 36 sequential data, which is not divisible by 16 to conform to the 4 × 4 addressing scheme.As such, we propose to extend both the input channel and width of the kernel map for uniform BRAM access.Unlike input channel extending for input feature map and kernel map, kernel width is not extended directly on kernel map for storage, but instead, zeroed upon data fetch from BRAM.For instance, when computing 3 × 3 kernel, we configure the address generator to read in 3 × 4 dimension, but zero the extra dimension before passing the fetched data into Data Streamer Unit.Similarly, the address generator is configured to read in 7 × 8 dimension for 7 × 7 kernel (first layer of ResNet [15]).
The scheme proposed in Section III-C fetches 1 input location × 16 input channels.Now, with our dimension extension technique on input channel and kernel map width, it is extended to support fetches in four input location × 4 input channels for the input layer.This enables uniform memory access from BRAM when generating GEMM core input stream.Similarly, the proposed storage pattern reduces DDR and BRAM addressing complexity, and does not require misaligned input handling.Misaligned input handling is necessary if the storage follows original first layer data dimension that could not be rearranged to be divisible by 16 easily.Extension to the input channel of the kernel map can be done offline, hence, does not incur overhead to the computation.Fig. 14.GEMM polynomial convolution technique as proposed by [11].Location "?" in Matrix B is empty for one polynomial convolution scenario, reducing it from matrix-matrix to matrix-vector computation.
For the input image, we would require preprocessing to introduce extra input channel to its existing RGB storage format.However, our technique only need to perform zeroing of the extra input channel, which is a trivial operation that does not affect the overall computational performance.Worth mentioning as well, the extra input channel for the input feature map and kernel map are not computed by our GEMM core, because this can be easily detected and skipped by the Data Streamer Unit.Compared to the technique in [17], our proposed technique does not affect the computation result and computation performance, and most importantly, does not need to retrain the CNN.

C. Memory Efficient Polynomial Convolution
The GEMM polynomial convolution technique by Lee et al. [11] is useful for cloud servers that typically processes large number of small polynomials at the same time.However, embedded systems (e.g., edge and sensor nodes) only need to compute one encapsulation/decapsulation per communication session.This means that only one polynomial convolution is processed at each instance.Directly using the technique from [11] will reduce the matrix-matrix multiplication of GEMM polynomial convolution to matrix-vector multiplication, which under-utilizes the GEMM accelerator.To mitigate this issue, we propose a factorization technique to process single polynomial convolution in matrix-matrix multiplication.
Fig. 14 shows a small example of the technique proposed by Lee et al. [11] to perform polynomial convolution using matrix-matrix multiplication.First, it rotates a polynomial coefficients by one position to generates the cyclic pattern to form a large (Matrix A).The total number of cyclic polynomials required is equivalent to the degree of the polynomial.For instance, a degree-4 polynomial has four coefficients, so four cyclic polynomials are generated.The small polynomials are directly processed without any transformation (Matrix B).These four small polynomials are multiplied with the four cyclic polynomials of Matrix A to generate four output polynomials (Matrix C).However, only one polynomial convolution is processed in edge nodes, so the Matrix B will only contain one small polynomial.That means, the "?" location of the Matrix B in Fig. 14 will be empty.Computing GEMM operation in this manner under-utilizes Cryptensor, which is designed to perform at least 2 × 2 GEMM computation.
Fig. 15 shows our proposed factorization technique, which is derived from the observation of computational patterns in polynomial convolution.In the original form, the output coefficients are generated from cyclic product of two polynomials.However, if we rearrange the order of product summations (rearranged form in Fig. 15), a cyclic pattern can be observed from the intermediate product terms.Further decomposing these product terms can even yield two small matrices that represents both polynomials.This essentially translate the original large matrix-vector in Fig. 14 to a small matrix-matrix multiplication.Through this decomposition, it is further observed that the input matrix size and output matrix size are correlated to the factor of the output polynomial degree.Take an example of the convolution between two 4 × 4 polynomials, the technique proposed by Lee et al. [11] involves two 4 × 4 matrices.Our proposed factorization technique decomposes the original 4 × 4 matrices into 2 × 4 and 4 × 2, leading to a 2 × 2 reduction in size.
By transforming the matrix-vector multiplication into an equivalent matrix-matrix form, our technique reduces memory access.On top of that, with flexible decomposition factor, our technique can be applied to different GEMM core, such as matrix-vector core, tall-thin matrix-matrix core, or a square matrix-matrix core as proposed in Section III-B.The proposed factorization technique also does not affect the computation result, since summing the product terms in any order would essentially yield the same result.
Although the proposed factorization technique does not introduce additional hardware, both polynomials do not come in the factorized pattern naturally.Rearranging of the polynomial coefficients can be performed in software, which introduces a slight overhead.However, this overhead can be eliminated if hardware support is available.Since the cyclic form assumes a fix pattern, memory addresses can be calculated easily to read the polynomial in a cyclic manner.This dedicated addressing scheme can be integrated into our matrix construction unit (refer to Section III-A) to enable onthe-fly matrix generation for both polynomials, with small additional hardware consumption.Since the matrix construction and GEMM core operate in parallel, no computational overhead is incurred to prepare these matrices if hardware approach is used.

A. Experimental Setup
To evaluate the proposed Cryptensor, we configured our design to perform CNN image classification task and polynomial convolution (cryptography task).For CNN application, we configured our accelerator to perform image classification on the ImageNet dataset using ResNet-18 and VGG-16 architecture, with a batch size of one and an input size of 224×224.Both weights and input data are quantized to 8-bit following the framework in [35].Note that since we do not propose any modification to the CNN architecture, the accuracy achieved is the same as reported in [35] when executing the CNN model using our proposed Cryptensor.For cryptography application, we performed experiments on the NTRU parameter sets from IEEE standard [3].The coefficient for polynomial are11-bit and zero-extended up to 16-bit in storage for the ease of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.access.Similarly, the 2-bit small polynomial coefficients are signed extended up to 8-bit for storage.Note that we do not implement the entire NTRU scheme as Cryptensor is only used to accelerate the polynomial convolution, which is the most time consuming part.

B. Evaluation of the First Layer Optimization
Table I shows the comparison of our proposed first layer optimization with the work by McDanel et al. [17].They proposed to rearrange the input feature map by reducing the input height and width, but extending the number of input channels.However, as the number of input channels increases, the total trainable parameters of the first layer kernel map would also increase.Hence, it is impossible for them to execute the VGG-16 and ResNet-18 model which originally only have three input channels; retraining is required in this case.Furthermore, the changes to input channels also increases the total operations for the first layer of ResNet-18.Referring to Table I, 1.31× extra operations was introduced, compared to the original ResNet-18 model.In contrast, our approach also extends the input channels from three to four for aligned memory access, but the extra channel is padded with zero values.Since the zero value does not affect the original accuracy, our technique does not need to retrain the model.Furthermore, this extra input channel with zero values is automatically skipped when computing the first layer using our accelerator.Hence, our technique does not introduce extra computation as compared to the technique by McDanel et al. [17].

C. Evaluation of the Proposed Implementation Strategies
Table II shows the comparison between three different hardware implementation strategies: the conventional SA (SA_MULT), improved STA [12] (STA_MULT) and our proposed optimized STA (STA_DSP).For all *_MULT strategies, a 2×2 PE layout is implemented, and their PE follows the design in Fig. 5 of Section III-B.Note that multipliers inside these PE are implemented in LUT and FFs.The proposed STA_DSP assumes the layout of 2×1 PE, and each PE follows the proposed design in Fig. 10 of Section IV.This strategy fuses all multipliers in the PE onto the high-performance DSP units of the FPGA.In general, further reduction of LUT and FF is observed between each implementation strategy.From SA_MULT to STA_MULT, a reduction of 10.81% and 18.75% is observed for LUT and FF, respectively.The reduction of resources is due to removal of operand registers between adjacent PE in STA_MULT.In addition, when the proposed STA_DSP is used, further reduction of 72.75% LUT and 44.23% FF is observed compared to STA_MULT.This happens because the multipliers are fused onto the DSP, reducing the resource consumption compared to STA_MULT that uses LUT and FF to implement multiplier.However, LUT and FF are still required by the proposed STA_DSP for implementing correction circuity to recover from the effect caused by packing multiple signed MAC into a single DSP.

D. Accelerator Resource Consumption
The final GEMM core layout implemented for our proposed accelerator consists of four-column and four-row TPE.Each TPE is designed using the technique described in Section IV-A, where each TPE is created with 4 PE, and capable of performing 2 × 4 GEMM at every clock cycle.The proposed accelerator is designed using Verilog RTL, synthesized and implemented using Vivado 2018.2, targeting two Zynq7000 series, namely, the XC7Z020 and XC7Z045 and one Virtex-Ultrscale (xcvu440-flga2892-3-e) FPGA by Xilinx.Table III shows the implementation result for our proposed design.For the smaller XC7Z020 FPGA, our proposed design achieved 125Mhz, with 14926 LUT and 12238 FFs.This implementation is slower, but uses lesser LUT and FF compared to larger sized XC7Z045 and XCVU440.However, when implementing on larger FPGAs, a higher frequency (200 and 250 MHz) is achieved, using ∼14.0%more LUT and ∼2.8% more FF for XC7Z045 and ∼12.0%more LUT and ∼2.5% more FF for XCVU440.This difference in performance is because larger FPGA have more routing resources, allowing the route and place algorithm to perform more radical optimizations, at the expense of increased LUT and FF.Worth mentioning as well, the peak memory bandwidth required is averaged at 0.17 GBps, which is only 7.1% out of the 2.40 GBps [36], [37] maximum memory bandwidth supported by both Zynq7000 series and Virtex-Ultrascale FPGA devices, respectively.This shows that our proposed accelerator is suitable to be integrated into soft-core systems with minimal impact on the memory bandwidth.

E. Evaluation on CNN Convolution
Table IV shows the comparison of the proposed Cryptensor with existing CNN accelerators.The throughput (GOPS) for our accelerator is obtained through RTL simulation.For comparison purposes, we normalized all systems to 8-bit.For instance, we multiply DSP efficiency (GOPS/DSP), logic efficiency (GOPS/kLUT) and energy efficiency (GOPS/W) by two for comparison work that uses 16-bit datapath.We believe that a 16-bit datapath can ideally deliver 2× performance due to the doubled datapath compared to our 8-bit datapath.In general, Cryptensor achieves the best GOPS/W and GOPS/kLUT compared to the existing work.Additionally, our ResNet-18 implementation on XC7Z020 and XC7045 achieves the best GOPS/DSP when compared to Chang et al. [7] and Xiao and Liang [19].For the case of VGG-16, our implementation on XC70Z020 delivers 21.6% higher throughput when compared to Vinieris and Bouganis [18].Additionally, our design also achieves 2.72× higher GOPS/W when compared to their work [18].Another work that also implement VGG-16 [38] achieves higher throughput than Cryptensor, but it also consumes 95.0% more LUT than our XC7Z045 implementation.Cryptensor achieves 3.59× better GOPS/kLUT when compared to their work [38].On top of that, the lower-LUT consumption in Cryptensor also leads to 2.71× better GOPS/W compared to Liang et al. [38].Similarly, Cryptensor has a higher GOPS/DSP at 0.545 compared to Li et al. [39] which is only at 0.440.Since Cryptensor is designed to be used as a efficient hardware accelerator for resource-constrained IoT applications, we attempt to strike a balance between reasonable computational performance and minimal hardware area.Furthermore, the existing accelerators only focus on CNN acceleration, but our proposed Cryptensor can support computation for cryptography as well, without adding additional hardware.This is a critical feature for many applications like IoT, which is not found in any of these existing works.

F. Evaluation on Polynomial Convolution
Table V shows the results of two IEEE [3] NTRU parameter sets: 1) middle-level security with 761 polynomial Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.coefficients (ees761ep1) and 2) middle-level security with 1499 polynomial coefficients (ees1499ep1).This shows that Cryptensor can flexibly support multiple security levels using the same hardware.The standalone co-processor proposed by Khan et al. [10] is 5.6× faster than the proposed Cryptensor, for the ees1499ep1 parameter.However, Cryptensor achieves better throughput to area efficiency (2.38) compared to Khan et al. [10] (1.69).Our architecture uses lesser LUT resources due to the efficient STA architecture.Additionally, despite the longer computation time, we also observed that the throughput of Cryptensor is compensated by its efficient architecture.This is proven by the supported maximum frequency of Cryptensor, which is 250 MHz after implementation, 2.24× higher compared to Khan et al. [10] that is only at 111.45 MHz.Based on the experimental result, we see that although Cryptensor consumes more computational cycles, it can still be compact in area and operates at a higher-operating frequency, thus, achieving a more area-time efficient design.
Note that in practice, Cryptensor reuses the existing CNN core to compute the polynomial convolution in NTRU, which does not utilize any additional hardware.This is achieved by using our novel factorization technique proposed in Section IV-C.Considering the case where a separate co-processor is developed to compute NTRU polynomial convolution following the technique proposed by Khan et al. [10], one will need to use additional 58.5kLUT, which is almost 3.4× more than Cryptensor.Additionally, our factorization technique is a software approach, making it a novel and universal technique that is applicable to other GEMM accelerators as well.To conclude, Cryptensor is a flexible design that trades-off computation time for area efficiency, a useful feature for supporting various IoT applications with different security requirements.

VI. CONCLUSION
Deep learning and cryptography algorithms are essential to the IoT applications.Developing individual accelerator architectures for these algorithms consume significant area in FPGA platforms.In this article, a novel unified co-processor architecture is proposed to compute both the convolution in CNN (deep learning) and lattice-based scheme (cryptography).A technique is proposed to improve the utilization of STA in CNN accelerator.By employing the factorization technique proposed in this article, we are able to reuse the same CNN accelerator to compute the polynomial convolution used in many lattice-based schemes, including NTRU [3], [21].

Fig. 1 .
Fig. 1.Full system overview of our proposed convolution accelerator.

Fig. 2 .
Fig. 2. Input and output data-flow during GEMM computation.Matrix B data flow is identical to Matrix A.

Fig. 7 .
Fig. 7. Proposed storage pattern.(a) Input feature map with ICH = 64.(b) Arrangement of the input feature map in DDR following input-channel-first format.Each DDR address can access 16 × 8-bit input data.

Fig. 8 .
Fig. 8. Input feature map storage pattern in BRAMs.Data read from DDR are equally distributed to eight BRAM groups.In our proposed work, each group is arranged as four BRAM × 2048 locations × 16-bit.Every single BRAM address (represented by dashed bounded box) can read 1 × 16 input channels of input feature map data.

Fig. 10 .
Fig. 10.GEMM core after fusing with neighboring PE.Illustrated STA layout has two-row and one-column TPE (dashed bounded box).Each PE within TPE consists of two computation data path that performs four multiplications and two additions.

Fig. 11 .
Fig.11.Merged TPE after applying multiplier fusion technique.Each merged TPE (dashed bounded box) performs 2 × 2 GEMM computation at every clock cycle using only four DSP unit, compare to using eight multipliers for the illustration in Fig.6.

Fig. 12 .
Fig. 12.(a) Original input feature map for first layer extended to four input channels and its (b) storage pattern in DDR.Each DDR read will address four input locations, each with four input channels.

Fig. 13 .
Fig. 13.Extended input feature map storage after fetching from DDR, equally distributed to eight groups of BRAMs.Every single BRAM address (represented by dashed bounded box) now reads 4 × 4 input channels of input feature map data for input layer.

TABLE II COMPARISON
OF HARDWARE IMPLEMENTATION STRATEGY

TABLE III IMPLEMENTATION
RESULT OF THE PROPOSED ACCELERATOR