Efficient Error Detection Schemes for ECSM Window Method Benchmarked on FPGAs

Elliptic curve scalar multiplication (ECSM) stands as a crucial subblock in elliptic curve cryptography (ECC), which represents the most widely used prequantum public key cryptography. Hardware constructions of cryptographic systems utilizing ECSM have been subject to permanent or transient errors. In cryptographic systems, it is important to validate the correctness of the underlying computation performed on hardware or software to identify such errors. In this article, we present new fault detection schemes in window method scalar multiplication, which, to the best of our knowledge, has not been previously investigated. Our approach involves introducing refined algorithms and implementations that can effectively counter both permanent and transient errors. We assess this by simulating a fault model, ensuring that the evaluations conducted reflect the obtained results. As a result, we achieve a significantly extensive coverage of errors. Finally, we benchmark our proposed error detection scheme on ARMv8 and field-programmable gate array (FPGA) to demonstrate the implementation and resource overhead. On Cortex-A72 processors, we maintain a clock cycle overhead of under 3%. In addition, when implementing our error detection method on different FPGAs, including Zynq Ultrascale+, Artix-7, and Kintex Ultrascale+, we achieve comparable throughput while introducing a mere 2% increase in area compared with the original hardware implementations.


Efficient Error Detection Schemes for ECSM Window Method
Benchmarked on FPGAs Kasra Ahmadi , Saeed Aghapour , Mehran Mozaffari Kermani , and Reza Azarderakhsh Abstract-Elliptic curve scalar multiplication (ECSM) stands as a crucial subblock in elliptic curve cryptography (ECC), which represents the most widely used prequantum public key cryptography.Hardware constructions of cryptographic systems utilizing ECSM have been subject to permanent or transient errors.In cryptographic systems, it is important to validate the correctness of the underlying computation performed on hardware or software to identify such errors.In this article, we present new fault detection schemes in window method scalar multiplication, which, to the best of our knowledge, has not been previously investigated.Our approach involves introducing refined algorithms and implementations that can effectively counter both permanent and transient errors.
We assess this by simulating a fault model, ensuring that the evaluations conducted reflect the obtained results.As a result, we achieve a significantly extensive coverage of errors.Finally, we benchmark our proposed error detection scheme on ARMv8 and field-programmable gate array (FPGA) to demonstrate the implementation and resource overhead.On Cortex-A72 processors, we maintain a clock cycle overhead of under 3%.In addition, when implementing our error detection method on different FPGAs, including Zynq Ultrascale+, Artix-7, and Kintex Ultrascale+, we achieve comparable throughput while introducing a mere 2% increase in area compared with the original hardware implementations.

I. INTRODUCTION
In 1985, Miller [1] and Koblitz [2] each independently introduced the application of elliptic curves in the field of cryptography.Elliptic curve cryptography (ECC) has garnered significant attention among public key cryptographic algorithms due to its shorter key sizes.To ensure efficient modular reduction, the National Institute of Standards and Technology (NIST) has suggested using Solinas prime as the modulus for short Weierstrass curves.It is noted that, researchers in [3] have raised concerns about potential backdoors in the NIST curves and have put forward Curve25519 and Curve448, which were developed by Hamburg [4], [5], as a viable alternative variants.Shor's [6] algorithm indicates that quantum computing has the potential to break the majority of current cryptographic systems.The emergence of quantum computers has raised concerns about the security of ECC, and it is expected to be eventually replaced with postquantum cryptography (PQC).Despite the need for transitioning to PQC due to potential quantum threats, ECC remains widely used for several reasons.Efficient algorithms have been introduced for the elliptic curve scalar multiplication (ECSM) operation, such as the binary method, double-and-add-always, Montgomery ladder, and window method [7].To resist side channel attacks, it is essential to design the scalar multiplication algorithm to be regular, ensuring that its operation flow remains independent of the input scalar.

A. Related Works
A number of prior studies focus on the implementations and fault detection in various arithmetic aspects of both classical and PQC, encompassing ECSM, among others [8], [9], [10], [11], [12], [13], [14].While fault detection procedures have been explored for cryptosystems, there has been limited research focusing on fault detection at the algorithm level of ECC.
Dominguez-Oviedo and Hasan [15] and Dominguez-Oviedo [16] addressed this gap by presenting fault detection schemes at the algorithm level for scalar multiplication on general curves.Their research introduced error detection schemes for the double-and-addalways and Montgomery ladder ECSM algorithm for nonsupersingular elliptic curves.It was found that the overhead of error detection algorithm on double-and-add-always is about 27% for projective coordinates and less than 1% for affine coordinates.Nevertheless, unlike prior studies, the process of selecting coordinates has not significantly influenced our proposed error detection approach for window method multiplication.In other words, such adoptions to other coordinates can also be performed.

B. Major Contributions
To the best of our knowledge, no prior research has explored fault detection in the context of window method scalar multiplication.The potential also exists for our error detection method to be utilized in the τ -nonadjacent form (NAF) conversion of the scalar within Koblitz curves, in cases where the window method ECSM is applied.
1) We have proposed an algorithm level fault detection scheme on window method ECSM.Our proposed fault detection algorithm has been mathematically proven to determine the extent of error coverage it provides.As such, we have simulated error injection on our proposed approach.According to the simulation results, our proposed scheme demonstrates the ability to detect a wide range of fault variants with a high level of error coverage.2) We have implemented our proposed fault detection scheme on ARMv8 and field-programmable gate array (FPGA) architectures to assess its viability.For less constrained usage models, we selected Cortex-A72 from ARMv8 family, which is used in Raspberry Pi 4. Our results show that we can achieve very high error coverage while adding only 3% more clock cycles on software.Furthermore, our error detection technique is executed on different FPGAs, including Zynq Ultrascale+, Artix-7, and Kintex Ultrascale+, delivering comparable throughput with a mere 2% rise in area as opposed to the initial method.

A. Window Method
Given the availability of additional memory, the window method significantly improves the efficiency of scalar multiplication by 1063-8210 © 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1 Window Method Scalar Multiplication utilizing a table of precomputed points.Window algorithms share similarities with binary algorithms, but they differ in that each iteration of the former concentrates on a window of ω scalar bits, rather than processing individual bits.In other words, each loop iteration handles a digit of the scalar represented in radix 2 ω .The window method, which is described in Algorithm 1, involves two stages.The first stage is the precomputation, which is a one-time process, and the second stage is the window method arithmetic.

III. ERROR DETECTION IN WINDOW METHOD ECSM
In this section, we introduce the proposed error detection scheme in window-based scalar multiplication.The process involves coherency check (CC) validation, which examines the intermediate or final output to ensure it adheres to a valid pattern in any given algorithm.Essentially, such validation is capable of confirming the specific relationship between intermediate variables and the output of any routine.
Our error detection approach, which is described in Algorithm 2, involves modifying Algorithm 1 to validate the occurrence of various patterns within a window of a specific length.For a window length of ω, there will be 2 ω different patterns in the binary form of a scalar.In our proposed error detection scheme, we keep track of the count for each pattern both before and during the ECSM operation.The function known as pattern_counter, outlined in Algorithm 3 and utilized in line 9 of Algorithm 2, is responsible for computing the occurrences of scalar patterns with respect to length of window, ω, before commencing the primary operation of window method scalar multiplication.In line 19 of Algorithm 2, the occurrences of scalar patterns are computed once again and stored in post_pattern_counter.At the conclusion of Algorithm 2, if there is a disparity between the values of pre_pattern_counter and post_pattern_counter, it indicates the occurrence of at least one error during the ECSM operation.This coherency checking process takes place at line 21 of Algorithm 2.

IV. SIMULATIONS AND MATHEMATICAL ANALYSIS
Within this section, we introduce the employed fault model and subsequently conduct a mathematical analysis and derive proofs for

A. Fault Model
The window method calculation in Algorithm 2 may experience the occurrence of random or burst errors during its execution.To simulate faults, we utilized the pseudo-program counter (PC) model.According to this model, after executing each instruction, Algorithm 4 Fault Calculation Function a counter (PC) is incremented.In this particular model, we made assumptions that each line of code is an instruction, and all instructions carry an equal workload.Consequently, if a specific instruction within an algorithm executes more, the likelihood of encountering an error during that particular instruction increases.Algorithm 4 outlines our fault calculation function, which is responsible for randomly choosing faulty instructions prior to executing Algorithm 2.
Once the fault calculation function has been executed, a check is performed before each instruction execution.This check determines whether the counter corresponds to a line of faulty code or not.If the current counter is associated with a faulty instruction, it indicates that a fault should occur, leading to a random alteration of the output produced by that particular instruction.We assumed any fault introduced would result in a modification of the instruction's output.

B. Mathematical Analysis
Lines 10-13 and 16 in Algorithm 2 execute log k times, while lines 15 and 18 execute a maximum of log k/ω times, where ω represents the length of window and k is the scalar.In Algorithm 4, line 2 computes the possible_fault_domain, which corresponds to the maximum value of the PC counter.Our error detection scheme is capable of identifying errors that occur in all the instructions except for the point_double and point_add operations (corresponding to lines 16 and 18).The cause for this is that any alterations made to the instructions, with the exception of point_double and point_add, have a direct impact on the post_pattern_counter array.This array is later compared with the pre_pattern_counter array at the conclusion of Algorithm 2.
The probability of undetected fault is α = (log k + log k/ω)/ (5 log k + 2 log k/ω).Error detection ratio in our scheme will be 1 − α m , where m is the number of faults, which will happen in the window method scalar multiplication.It is evident that the error detection ratio increases in scenarios with a higher number of faults.

C. Simulation Results
Using Python 3, we conducted simulations utilizing the presented fault models and error detection methods.The simulation used million samples, each utilizing varying ω.As described in Table I, employing a smaller window length leads to a higher error detection ratio in cases of single fault occurrence.Moreover, as the number of faults increases, we can achieve an approximately 100% error detection ratio across all window lengths.Burst errors can be identified with higher error coverage as well.

V. IMPLEMENTATION BENCHMARKS AND COMPARISON
In order to demonstrate the effectiveness and error detection capabilities of our proposed scheme, we opted to evaluate it using short Weierstrass curve and employed the binary form of the scalar for the analysis.We performed an implementation benchmark with ω = 3, on the Cortex-A72 processor and various FPGAs, including Zynq Ultrascale+, Artix-7, and Kintex Ultrascale+ to assess the overhead of our approach.This analysis clearly illustrates that our methods maintain a reasonable overhead while successfully achieving a high ratio of fault detection.

Implementation Results
Through the implementation of our design on ARMv8 and FPGA architectures, we gained valuable insights into the efficacy of our schemes across diverse platforms.For hardware implementation, we have selected AMD/Xilinx FPGA, including Zynq Ultrascale+, Artix-7, and Kintex Ultrascale+.We used high-level synthesis (HLS) Vitis development environment to synthesize our proposed schemes to register transfer-level (RTL) hardware description.We used Vitis resource allocation pragmas to achieve the respective area and performance strategies and to control the trade-off between the time and the required resources.The same pragma rules and clock were used in the baseline and the proposed error detection scheme.In the context of the Vivado synthesis tool, area is defined as a combination of Slices and Digital Signal Processing Units (DSPs).In addition, the equivalence ratio used is that one DSP is considered equal to 100 Slices.The latency was evaluated using identical test vectors for each hardware implementation.Tables II-IV present the area, timing, power, and energy derivations of our proposed error detection scheme based on the aforementioned area and timing effort strategies on three different AMD/Xilinx FPGAs.Our proposed error detection designs successfully reach an approximate maximum operational frequency of 60 MHz across all the FPGAs.Our proposed error detection scheme incurs a maximum overhead of 2% in terms of additional area and adds a latency of up to six clock cycles at most.Our implementation code in HLS and simulation code are available in our GitHub account. 1or the ARMv8 architecture, we chose the Raspberry Pi 4 as our platform, equipped with quad 1.5-GHz Cortex-A72 cores.To ensure a suitable environment, we installed Raspberry Pi OS lite 32 bit with kernel version 6.1 as the operating system on the Raspberry Pi 4. For evaluating performance on the ARMv8 architecture, we utilized the Performance Application Programming Interface (PAPI), a well-established framework for measuring system performance [18].Table V illustrates the clock cycle counts for both the proposed error detection scheme and the baseline approach.

B. Implementation Optimization
We note that the performed implementations are utilized for deriving the overhead of the error detection schemes used, to assess their suitability for constrained usage models; nevertheless, implementations on other hardware platforms (other FPGA families and devices and also ASIC) would result in similar overheads, because the proposed schemes are platform oblivious.While HLS transitioned the design abstraction from RTL to C/C++, it is often essential, in practice, to engage in substantial source code rewriting, including the insertion of pragmas, to achieve satisfactory performance.We achieved our intended area and timing efforts implementation by strategically placing the following pragmas within our program.
1) Pragma HLS Inline: We placed this pragma to limit the resource allocation to control the area usage.Inlining promotes cross optimization among various C functions arranged hierarchically.Differing from a basic optimization method that handles each function independently, this pragma encourages resource sharing to improve performance and reduce the usage of digital resources.2) Pragma HLS Pipeline: We placed this pragma to enable instruction-level pipelining to increase the throughput and clock frequency within the doubling and addition modules, which are the most time-consuming components in the window method ECSM.However, it is worth noting that this optimization comes with the trade-off of consuming additional digital resources.3) Pragma HLS Interface: We used this pragma to integrate our scheme into Vivado as an external IP.This pragma not only simplified the incorporation process but also resulted in an overall performance boost and more efficient utilization of resources, effectively making our IP an integral part of the FPGA design.

VI. DISCUSSION
The presented error detection scheme can be utilized on various curves, such as Montgomery (Curve448 and Curve25519) and Weierstrass, all of which employ the fixed window method ECSM.In addition, it applies to both the binary and NAF of the scalar.Moreover, it has the potential to be employed in the τ -NAF conversion of the scalar within Koblitz curves, where the window method ECSM is utilized.It is important to note that our error detection approach is not reliant on particular coordinates.This signifies that whenever the window method algorithm is chosen for scalar multiplication due to application requirements and hardware limitations, our error detection approach remains applicable.

A. Challenges and Shortcomings
A potential add-on to this work could be an extension, so that our proposed error detection scheme can identify errors occurring within the doubling or addition components.Our research focuses specifically on the algorithmic level of the window method ECSM.
To address this issue, multiple error detection schemes [8], [9] have been performed for multiplication, addition, inversion, and squaring, which can be integrated into our work.Furthermore, another method to overcome the aforementioned issue is to easily confirm that the doubling and the addition's output lie on a valid elliptic curve.This procedure is referred to as point verification (PV) [15].

VII. CONCLUSION
In this article, we proposed an error detection scheme for window method ECSM operation, which is a crucial operation in ECC-based cryptosystems.By computing the occurrences of possible scalar patterns in a window length, we provide a coherency function between the scalar as an input and intermediate outputs on the window method scalar multiplication.Through conducting simulations and mathematical analysis, we showed that our proposed error detection scheme achieved high error coverage.Furthermore, we implemented our proposed design on FPGAs and ARMv8.In terms of overhead, the implementation resulted in negligible additional cost in hardware and software.

Algorithm 2 Algorithm 3
Proposed Error Detection Scheme in Window Method Scalar Multiplication Pattern_Counter Function each error detection scheme in relation to this fault model.Following that, we present the simulation results obtained through Python implementation, utilizing the proposed fault model and error detection schemes.

TABLE I ERROR
DETECTION RATIOS OF OUR PROPOSED SCHEME WITH 1, 2, 4, 8, AND 16 FAULTS FOR WINDOW LENGTH (ω) OF 2, 3, 5, 7, AND 10 INCURRED IN THE IMPLEMENTATION OF ALGORITHM 2 BASED ON THE SIMULATION RESULTS