Priority Queue VLSI Architecture for Sequential Decoder of Polar Codes

—The VLSI architectures for stack or priority queue (PQ) are required in the implementation of stack or sequential decoders of polar codes. Such type of decoders provide good BER performance keeping complexity low. Extracting the best and the worst paths from PQ is the most complex operation in terms of both latency and complexity, because this operation requires full search along priority queue. In this work we propose a low latency and low complexity parallel hardware architecture for PQ, which is based on the systolic sorter and simpliﬁed sorting primitives. The simulation results show that just small BER degradation is introduced compared to ideal full sorting networks. Proposed PQ architecture is implemented in FPGA, the synthesis results are presented for all components of PQ.


I. INTRODUCTION
Polar codes were proposed by Arıkan [1] in 2009 and are already utilized in 5G NR standard as a FEC scheme for control channels [2]. The main reasons for such rapid adoption in industry are that polar codes are first to achieve channel capacity and still have explicite algorithms for code construction, encoding and decoding.
Paper [1] also proposed Succesive Cancellation (SC) decoder that exploites polarization transform features to find solution. The main drawbacks of this decoding algorithm are low correction capability for codes of practical lengths and sequential nature of decoding process. The latter one causes high decoding latency that lineary depends on codeword length N , since SC decoder constructs solution codeword bit-by-bit by traversing one path in a binary decoding tree.
In order to override first drawback, Succesive Cancellation List (SCL, [3]) and Succesive Cancellation Stack (SCS, [4], [5], [6]) decoders were proposed. SCL decoder constructs L best paths in parallel and tend to maximum likelihood decoding while L increase. The cost is higher computational complexity. SCS decoder uses the same best path search width as SCL, but uses a sorted stack in order to find and continue only the one best path in the stack during each iteration. This approach results in lower computational complexity with error correction capability being the same, but greatly increases decoding latency, compared to SCL. The sorter is main bottlneck of SCS decoder, since it may become quite complex. The  There are several other proposals that aim SCL decoder complexity reduction by unnecessary calculations elimination. For example in [7] authors propose a hybrid scheme, where many separate SC decoder cores work in parallel, but in case of failed CRC-check in one of them this decoder bank switch to SCL mode. To flatten decoder throughput, extra input buffer is used. Paper [8] presents an input distribution-aware SCL decoder, that disables some of SC processing cores for "good" codewords decoding, thus dynamically changing the list size in order to reduce energy consumption. Other papers [9], [10], [11] also present SCL decoders that adjust list size, in order to have good error correction capability of SCL on one hand and lower average computational complexity on the other.
All these tricks do reduce either average complexity and/or energy consumption, but they are not capable to reduce chip area, and similar to SCS decoder, result in higher decoding latency. Also these solutions give less flexible architectures and require extra control logic to switch between list sizes and modes. Sequential (SQ) decoder [12], [13] could be seen as a modification of SCS decoder that uses several ingenious improvements in order to reduce decoding latency. The basic hardware architecture of sequential decoder is proposed in [14]. One of the most important blocks in SQ decoder is the Priority Queue (PQ) or stack in terms of SCS decoder. This block contains partial paths, and during each decoder iteration it outputs one best path and thus this block delay is an essential part of overall decoding delay.
To the best of our knowledge, no papers about SCS decoder provide detailed algorithm of stack processing or hardware architecture, especially for the best path extraction operation. In this paper we propose a new hardware architecture for PQ, which is based on the folded systolic sorter, and provide its detailed description and analysis in terms of complexity and decoder BER performance. The PQ sorter is based on the systolic sorting algorithm presented in [15].
The rest of this paper is structured as follows. The SQ decoding algorithm and general SQ decoder structure are described in Section II. The details of proposed PQ architecture are given in Section III. Sequential decoder performance analysis is shown in Section IV. Section V deals with FPGA implementation. Finally, the conclusions are presented in Section V.

II. POLAR CODES SEQUENTIAL DECODER
A. Sequential Decoding Algorithm SC decoder was proposed by Arıkan in [1]. In logarithmic domain this algorithm uses two functions -namely Q-function and P function (Eq. 1) to traverse the decoding tree [14]. Decoding tree leafs are refered to codeword bits, and thus traversing the tree correspond to bit-by-bit succesive dedoding. In (Eq. 1) a and b are LLR values, taken from other functions' outputs or from channel, ps is the partial sum, calculated from decoded bits. Each decoder iteration it makes one-bit step along the codeword, obtaining this bit estimation and incrementing the decoding phase φ. In this case only one path (partial solution codeword) is considered and no memory is required. In case if error occures in one bit, it won't be corrected and could result in errors in some following bits in current codeword.
Sequential decoder, proposed in [12], keeps N P Q paths with different lenghts in PQ. During each iteration PQ outputs the best path in terms of metric and this path is extended using operations in Eq. 1. If current phase φ refers to a frozen bit, the path does not split and goes back to PQ after metric update. In case of non-frozen bit the path has two possible continuations, metrics are calculated for both of them, and then they are stored in PQ.
Path [16] and the total penalty of path In order to bound the computational delay, both SCS and SQ decoders use the search width parameter L. Decoder sets counters q φ , which indicate how many paths have passed phase φ. If some counter q i becomes equal to L, than all paths with phase less then i have to be deleted from PQ. This parameter has the same sense, as the list size in SCL decoder, but in SQ decoder it limits the computational delay instead of decoder area.
In SQ decoder, as well as in the stack one, the number of decoding iterations depends on the input LLR vector and could vary in vast range. The main cause is that longer paths on average always have worse metric compared to shorter ones, just because they had more opportunities for metric decrease through decoding. Thus decoder tends to return to shorter paths prolonging unreasonably often. The bias function Ψ(φ) from Eq. 2 was suggested in [17] in order to solve this imperfection by increasing longer path metrics and reducing the mean number of iterations.
As a result, with bias function mean number of iterations tend to that value for SCL decoder, but distinct values of this parameter may be several times higher. The bias function values are obtained through modelling and depend on the signal-to-noise ratio.

B. General SQ Decoder Structure
In most general view the SQ decoder consists of two main subblocks, the first one is Metric Processor (MP), which implements operations from Eq. 1 to calculate path metrics on each iterations. MP in [14] uses fully parallel structure in order to reduce the latency of one iteration and to avoid intermediate result storage that is proposed in [13].
The second subblock is Priority Queue that stores and sorts candidate paths and provide input data to MP during every iteration. PQ stores all needed information about each path, sort the paths according to their metrics and output one best path each iteration. PQ delay for sorting strongly affects the overall decoding delay, because MP latency is usually less than that of sorting operation.
In this paper we do not dive into MP implementation details, but we could formulate several demands that MP poses to PQ: • MP and PQ operate in parallel; • each iteration PQ has to provide the best and the worst paths in terms of metric for continuation and deletion respectively; • each iteration MP processes the best path from PQ; • each iteration one or two paths enter the PQ, and new search is needed for the next iteration to find the path with the best metric; • PQ stores N P Q paths themselves, their metrics and current phases; • PQ has to delete paths, shorter than φ, if some counter q φ becomes equal to L.

III. PROPOSED PRIORITY QUEUE ARCHITECTURE
Since the paths themselves are binary vectors of length up to N , it is preferable to store them in RAM. In such case the PQ splits to several blocks, presented in Fig. 1. Path RAM stores paths -binary vectors that represent parts of possible solution codewords and has classical dual port RAM architecture. q φ Counters block consists of a set of counters. It outputs φ min value, all paths with phase less than φ min are deleted from Path RAM and Sorter blocks.

Path RAM
Early output (EO) block is inserted in order to reduce PQ delay. PQ finds overall best path among those stored in RAM and those two paths that just arrived from MP. Since it is quite often that the best path is one of the newly arrived, PQ does not need to save it to the RAM, but just forward it to the MP for further processing. This situation is very common for "good" words, which are weakly corrupted with noise. Thus this block compares three paths to find the best one, and forward it to MP. Two other paths are forwarded into the RAM (one of them could be already there).
Sorter block is the most crucial one in PQ, since its performance and complexity are highly significant for those of whole decoder. In this paper we use sorter structure based on the algorithm from [15] which results in parallel architecture, depicted in Fig. 2. The main advantage of this algorithm and the architecture is very low delay for contents resorting if only two elements change their values at a time and only one best element is needed after resorting. (1) (3) (min_out) ... ... ... In case of SQ decoder hardware implementation, Metric Processor block requires several clock cycles for metric calculation and during its processing the PQ needs to perform one iteration of sorting and provide result to EO block. Two layers of Compare and Select blocks from Fig. 2 require only one or two clock cycles for processing, due to their simple structure. Together with reading data from RAM, this means that Fig. 2 direct implementation requires 2 (or 3) clock cycles for processing. For codes of length 512 and higher this delay is less than that of MP block. Proposed architecture uses a set of universal 3-input sorting blocks (CAS3, dedicted in Fig. 3, which rearrange 3 inputs into a sorted set of 3 outputs using metrics. Since the CAS3 output must have maximum metric input on the first output, we propose to let other two outputs remain unsorted. Such simplification greatly reduces the number of resources for CAS3 block and increases its scalability, while BER performance keeps almost unchanged (Figs. 5, 6). CAS2 blocks have trivial structure of one metric comparator and multiplexers for all bus elements.
Phase check blocks (PCB) from Fig. 4 are used on each sorter input to reset sorter contents when new word decoding starts and to compare path phases with current φ min value and delete the paths with lesser phases, reducing the average number of decoder iterations. In order to reduce complexity the latter function is performed on N P CB upper sorter inputs. In Section IV we provide results for various N P CB values and show that in most cases, this value could be much less than N P Q .
The path deletion is done by forcing the path metric to minus infinity, i.e. -to minimum possible negative value for the chosen fixed-point bit widths. In case of sign-magnitude arithmetics forcing is done by setting maximum possible positive value. Since paths after deletion remain in sorter until they get to its bottom, decoder performance degradation occurs, compared to full sorter output.
The systolic sorter requires a number of iterations to find the best and the worst paths, whilst these outputs are to be provided on every iteration. Not true solution may be output at first several iterations after paths deletion using phase counters mechanism. However, decoder performance degradation, caused by this errors, is neglible (Fig. 5).
In order to further reduce resources consumption we propose the folded sorter (Fig. 4), which uses only half of CAS3 blocks from original unfolded one (Fig. 2). This one layer of CAS3 blocks is used twice for one iteration of sorting. This leads up to 1 clock cycle delay increase, but as long as the overall PQ delays is lower than SP delay, this does not reduce decoder throughput. One could further reduce the number of CAS3 blocks, folding it vertically, but this will require much more complex multiplexing and control logic and thus resources consumption gain will be lower. Another option is using one-layer sorter and giving extra clock cycles for its processing to make more than one sorting iteration during one decoding iteration. In case of q i phase pass counters usage, this will reduce the performance degradation compared to ideal sorter.

IV. PERFORMANCE ANALYSIS
In this section we provide the simulation results for performance of sequential decoder equipped with the proposed PQ. The code under consideration is the polar subcode from [18] with N = 512, R = 1/2 and 24 dynamic frozen bits. The size of PQ is 256. The performance is analyzed in terms of BER, FER and average number of decoder iterations. Also we compare the performance of the proposed PQ with the one of PQ, which uses ideal sorter. Here "ideal" means that the full sorting of all paths in PQ is done on each iteration, and this operation does not require any clock cycles.
The BER and FER performance of sequential decoder with various modifications of proposed PQ is shown on Fig. 5. A number of cases is considered: • the performance of decoder with an ideal sorter in PQ for cases, when heuruistic function is applied or not applied (for reference); • the performance of decoder with the proposed PQ, which uses the systolic sorter for various number of PCB blocks. There is quite small BER and FER degradation of less than 0.05 dB of decoder with proposed PQ architecture for high and low SNR values compared to the case of PQ with ideal sorter. Also the error rate performance does not depend greatly on the number of PCBs for the considered range N P CB = 6 . . . 256.  The Fig. 6 shows the average number of iterations for different variants of proposed PQ. First of all we note dramatic decrease of iterations number, when the heuruistic function is applied, as it was introduced in [16]. Moreover the proposed PQ architecture introduces approximately the same results as for the PQ with ideal sorter. Only little degradation (< 1%) in iterations number can be seen for low SNR values. An important fact here is that the number of iterations does not depend on the number of PCB blocks for the considered range. This allows to use just 6 PCB blocks, reducing the complexity of PQ.
Note that for high SNR values the average number of iterations is less than the code length 512. We propose to start decoding process not from phase 0, but from the phase, which corresponds to the first non frozen element. This allows to slightly reduce the average number of iterations.  In general the total resource usage depends on the PQ length. The block memory consumption does not depend on the complexity of sorter and is defined only by the PQ size. The block memory stores only the decoded bits of each path, whilst paths phases and metrics are stored inside the systolic sorter. The number of flip-flops depends on the PQ size and path metric bit width as the registers are placed at the Systolic Sorter outputs. The main logic consumption is introduced by the systolic sorter's Max Units. There number is proportional to PQ size. The PCB blocks also require additional logic but their number could be quite small (up to 6), still providing very small BER degradation (Fig. 5).

CONCLUSION
This paper presents the hardware architecture of priority queue, which consists of early output logic, RAM and systolic sorter. It ensures low latency to output the best and the worst path at each decoder iteration. The folded systolic sorting approach combined with simplified basic sorting units provide low resources consumption. The simplification of basic sorting units causes BER degradation less then 0.05 dB for BER level 10 −4 , and the average number of iterations is decreased by less than 1%. The proposed architecture can be used in stack or sequential decoders of polar codes.