Outperforming Sequential Full-Word Long Addition With Parallelization and Vectorization

The article presents algorithms for parallel and vectorized full-word addition of big unsigned integers with carry propagation. Because of the propagation, software parallelization and vectorization of non-polynomial addition of big integers have long been considered impractical due to data dependencies between digits of the operands. The presented algorithms are based upon parallel and vectorized detection of carry origins within elements of vector operands, masking bits which correspond to those elements and subsequent scalar addition of the resulting integers. The acquired bits can consequently be taken into account to adjust the sum using the proposed generalization of the Kogge-Stone method. Essentially, the article formalizes and experimentally verifies parallel and vectorized implementation of carry-lookahead adders applied at arbitrary granularity of data. This approach is noticeably beneficial for manycore, CUDA and vectorized implementation using AVX-512 with masked instructions. Experiments show that the parallel and vectorized implementations of the proposed algorithms can be multiple times faster compared to a sequential ripple-carry adder or adders based on redundant number systems such as one used in the GNU Multiple Precision library.


INTRODUCTION
U SE of big-integral arithmetic is required in many fields of technology and science such as cryptography [1], digital signal processing [2] and high-precision scientific computations [3].
Additive operations provide a foundation for most of the operations of big-int arithmetic. Although most of modern published research is focused on big-integer multiplication, addition plays a significant role in it. For instance, every integral multiplication algorithm in [4] and [5], i.e., the schoolbook multiplication, the Comba's method, the Toom-Cook multiplications (including the Karatsuba's method), as well as asymptotically more efficient Fourier transform based algorithms over rings or floating-point algebras (e.g., Sch€ onhage-Strassen [6] and F€ urer multiplication [7]) rely on a separation of multiplicands to smaller values, which in turn have to be multiplied, using the same or different algorithm, and then recombined using long addition to produce a product.
Specifically, the family of Toom-Cook algorithms compute values of polynomials, which represent the multiplicands and the product, at several different points on the curves defined by the polynomials. Then the final product is computed as a value of a polynomial which is a result of interpolation given smaller products at the chosen set of points. Given the right values of the chosen points the problem of multiplication is reduced in size. The algorithm thus involves long addition during computation of the polynomials, during their smaller pointwise multiplications as well as interpolation of the result to yield the final product.
The Fourier transform based multiplication algorithms [6], [7] and [8] rely on the convolution theorem to compute the product. Depending on the algorithm, the multiplicands are represented using a ring of either integer or complex values. The choice of the ring as well as its principal root of unity used in the Fourier transform defines the choice of the discrete (DFT) or fast (FFT) Fourier transform algorithm as well as implementation of elementwise multiplication, addition and remainder computations. First, the DFT by itself, whether direct or using one of the FFT algorithms, requires at best Oðnlog nÞ (in case of radix-2 Cooley-Tukey FFT) and at worst Oðn 2 Þ additions. The size of additions again depends on the choice of the principal root of unity and the underlying ring of integers or the required precision of floating-point representation of complex values with the latter formally described in [7] and [2]. Particularly, the rings of integers used in [6] and [7] are equipped with arithmetic operations, including addition, modulo 2 n þ 1. In order to implement those, one could use an approach described in [9] particularly for such moduli. The approach represents modulo 2 n þ 1 addition and subtraction in terms of additive operations modulo 2 n which can be implemented using the algorithms proposed in this paper. Furthermore, the acquired product, which results from the inverse Fourier transform of the pointwise multiplication of the transformed multiplicands, needs to be recombined into a final scalar product by adding up parts of the pointwise products to perform the multiplicative carry operation which again requires long addition.
Therefore, importance of long addition is at least comparable to one of long multiplication both of which provide the foundation for many algorithms and implementations as stated above. One way to improve performance of the operations is to develop a way to compute parts of the initial problem in parallel. While in many respects multiplication can be conducted in parallel, such as evaluation of polynomials at different points or components of the Fourier transform of the inputs, the addition operation with the requirement to propagate excessive higher-order digits make the addition (as well as subtraction and additive steps of multiplication) poorly parallelizable due to data dependencies. This, in particular, hinders vectorization of addition, and implementers of long addition (e.g., the GNU multiprecision library, as of version 6.2.0) choose scalar ripple-carry addition (e.g., ADD/ADC instructions in the case of x86 and x86-64 architectures) in favor of SIMD. Indeed, a number of operations required to correctly propagate and take into account the carry flag after addition of each element of the addends is at least equal to the number of elements within the addends, and thus vectorization and parallelization might yield no performance benefits.
Yet detection of digits which can produce carry flags can partially be performed in parallel, which provides an opportunity to increase performance and energy efficiency of adders at a cost of increasing complexity of implementation.
A detailed overview of methods to implement adders at a bit-level granularity can be found in [10]. The most straightforward way to implement addition of long integers is based upon a ripple-carry adder which adds up digits of addends sequentially propagating carry bits. Many implementations rely on this method because it is cheap, easy to implement both as a program and as a system of logic gates with minimal wiring and respective latencies. This adder can easily outperform (see Section 6) other more complicated methods especially when size of addition is small.
Yet detection of digits which will or might produce carry bits is easily parallelizable when no carry propagation resulting from adding up less significant digits is performed. General approach to such operation is given in [11] and [12] both of which provide algorithms commonly known as carry-lookahead adders. These adders precompute sums without carry propagation and then perform parallel reduction to iteratively adjust the result to account for carry propagation. This approach is widely used in implementations of addition when performance and energy efficiency is important [13], [14], [15] and [16].
The disadvantage of carry-lookahead methods is the complexity of their implementation. For instance, in [17] the authors combine the two approaches, i.e., Kogger-Stone [11] and Brent-Kung [12], to find a balance between sparsity of implementations of the former and latency of the latter caused by an increased number of sequential operations. Another way to find a balance between performance and complexity of addition is to use a tree-like structure of addition algorithm which conditionally adjusts elements of the carry-less sum based on whether less-significant digits yield carry; if they do not, the adjustment can be skipped [15], in which case the performance gain may take place. Carryselect adders operate similarly regarding the selection of either a sum or its incremented value based on a value of an inbound carry bit produced as a result of lower-order addition. Works such as [18] and [19] propose a design of such adders with the primary purpose of reducing complexity of adders.
Another approach to implement parallel addition is to express the sum in a redundant number system in which a given value can be represented in more than one way. Such are carry-save [10] and borrow-save [20] adders, both are roughly equivalent performance-wise. These allow addition with no interdependence of digits provided that the representation of at least one of the addends and of the sum remains in the respective redundant number system. Though fast and scalable by themselves, conversion between the redundant and non-redundant number systems will at best make their performance equivalent to one of ripplecarry adders because the conversion requires adjustments of digits in the non-redundant number system, which is linear in time, as well as extra memory (and IO operations) to store the values in the redundant-number system.
These works focus primarily on hardware implementation of adders and employ bit-level parallelism in order to improve performance and energy-efficiency of addition. Also, in existing literature the discussed adders mostly implemented for short operands, and addition of arbitrarylength integers is represented poorly [21]. This paper on the other hand provides ways to implement parallel carry-lookahead addition of arbitrarily long integers at a machine word level using a set of common arithmetic instructions on word-sized scalars and vectors thereof.
For that, the paper formalizes and presents two algorithms for parallel addition of long full-word unsigned integers using widespread scalar multiprocessors and vectorization. Essentially, results are a generalization of carry-lookahead adders, especially the Kogger-Stone adder, to add up numbers represented as arbitrary-lengths vectors of machine words under the assumption that a pair of words can be added up in a constant time by the underlying hardware. Capabilities required from a vector processor in order to implement such addition with performance gains are also investigated.
The paper concludes with experimental evaluation of performance of the proposed adders implemented for several underlying architectures and compares the results with the sequential implementation of long addition using the Intel ADC instruction. The latter can be thought of as either a ripple-carry adder (and regarded as such in the paper) or, given an equivalence between a multiplexer of a carry-select adder and the ADC instruction both of which sequentially select/produce a sum of their operands adjusted given a carry bit.
The results show significant performance boost of the proposed parallel and vectorized algorithms for addition arbitrarily long full-word integers. Besides parallelism, the performance is gained because there is no need to convert data between different number systems or to perform unnecessary data movement, the latter of which is shown to be highly impactful on overall performance. The other main contribution of the paper is that it is agnostic about a (fullword) number system used and does not limit or fix sizes of the addition problem which makes the results applicable in wide range of use cases.
Throughout the paper the following notations are used. Vectors which represent long integers of n words are in the group hZ W ; þi which is isomorphic to hZ W ; þi. Particularly, each scalar v i (0 i < n) can be either an unsigned integer from hZ W ; þi or a signed integer provided that the latter has W 's complement representation. Additionally, the branchless carry detection algorithm for signed integers presented in the Section 2 relies on bit representation of v i . In this (and only in this) case, W is assumed to be a power of two.
Also, separate b-bit scalars are used by the presented algorithms to perform carry propagation. Therefore, the underlying platform is assumed to provide a way to implement integral b-bit arithmetic, i.e., addition in There are three addition operations used in the paper, those are: ordinary scalar addition "þ" defined for an additive group Z; þ h i which contains the long integers being added up; bitwise XOR, denoted as È, of elements in Z 2 n ; È h i (i.e., addition homomorphic to one of polynomials in fp 2 Z 2 ½x j degðpÞ < ng) and a vector addition ( of n-length vectors in Z n W (with respective scalars added up modulo W ) to form an additive group Z n W ; ( . The latter is isomorphic to a group of polynomials in Z W ½x; ( h iof degrees less than n.

DETECTION OF CARRY ORIGIN
The first step of the addition algorithm is to detect origins of set carry flags resulting from addition of vector elements. This detection is easily performed using a comparison of vector values. For this, two cases need to be considered with respect to signed and unsigned comparison provided by vector processors. For instance, Intel AVX-512 and ARM SIMD extensions provide instructions (e.g., VPCMPUD and CMHS) for unsigned comparison. On the other hand, some processors only allow comparison of signed integers, as is the case with SSE-SSE4.2 and AVX2.
Addition of two unsigned scalars, x and y, overflows if and only if 8x; y 2 Z W : x þ y < x (where addition is performed modulo W ). Since there is no dependency here on anything other than x and y, this detection, when applied to multiword integers, can be done in parallel using elementwise unsigned comparison. That is, given a pair of n-element vectors ð Þ T as well as a function CMPGTðV; UÞ ¼ ðv 0 > u 0 Þ ÁÁÁðv nÀ1 > u nÀ1 Þ ð Þ T which maps ðZ n W Þ 2 ! Z n 2 , the detection of the carry flag is then performed by the call CðV; UÞ ¼ CMPGTðV; V þ UÞ (or alternatively CðV; UÞ ¼ CMPGTðU; V þ UÞ).
In the case of signed vector comparison, the detection is a little more complicated as it requires to consider three cases (Fig. 1). As stated in the Section 1, the rest of this section assumes that elements of the vectors are represented with a set of bits. If this condition is met, one can proceed as follows. First, if the most significant bits of both addends, x and y, are reset, then their addition will never yield a set carry flag, because addition of lower-order digits will yield a carry of at most 1 which, added to zero highest-order bit of the sum, will never produce a new carry. Second, if the most significant bits of both addends are set, then their arithmetic sum will always be either 2 (binary 10) or 3 (binary 11) with the left bit to become the value of the carry flag, and the right bit equals the value of the incoming (resulting from addition of less significant bits) carry. Otherwise, i.e., when most-significant bits of x and y differ, the resulting carry flag will always be equal to the carry produced by addition of lower-order bits and complement to the sum of most-significant bits of the operand modulo 2. Indeed, if the most-significant bit of one operand is set and the most-significant bit of the other operand is reset, then the result of arithmetic sum will equal one plus the carry of the lower-order bits, that is 1 (binary 01) or 2 (binary 10) with the left bit becoming the new carry flag which is the opposite of the right bit. Assuming that the most-significant bit is a sign bit, the latter check can be performed by an arithmetic comparison of the sum with zero.
Similarly to the unsigned case, these bitwise operations do not introduce data dependencies and thus can be applied to SIMD vectors. In this case, given a similarly defined vector comparison CMPGT (although for signed integers), where O is a zero n-element vector, and VAND, VOR and VANDN are respectively bitwise AND, OR and AND-NOT operations upon respective elements of the operands. AND-NOT complements the first operand and evaluates bitwise conjunction of the result with the second operand.

CARRY PROPAGATION
The main problem which hinders efficient parallelization and vectorization of addition is the need for carry propagation, because adding up separate pairs of digits can cause additional carry flags to appear which cause data dependency of higher-order digits of the result upon the lower-order digits. Nonetheless, once carry flags which stem from the addition of vector elements (as well as possibly a carry flag which originates from prior addition of less significant parts of the original long integers which the vectors are part of) are determined, the following approach can be used to replace n checks and increments of vector elements with dn=be operations to accumulate bits produced by the function CðV; UÞ (given above) in an integral value stored in b-bit registers which allow arithmetic addition or subtraction. Any general purpose integral registers (or mask registers K0-K7 of AVX-512) can be used for it.
For brevity, in this section it is assumed that dn=be ¼ 1, which is the case for many vector processors. Otherwise, e.g., in parallel implementation of addition, the Kogge-Stone method [11] can be used to perform parallel reduction of b-bit values as shown in the Section 4. Since the result of addition of any amount of lower-significant digits of a number can never produce a value of carry greater than 1 (the proof of that can be found in many classical sources, e.g., in [22]), the mask produced by the function CðV; UÞ can never have the bits set at the same positions as the positions of sum elements which need to be incremented during carry propagation. This allows to unite (using bitwise OR) the mask resulting from CðV; UÞ, shifted once to the left, with the bit mask which corresponds to the position of the sum elements which generate carry flags as a result of carry propagation, and no information loss will take place.
For that, one first needs to determine elements of the vector sum V þ U which may produce a carry flag as a result of carry propagation. Since the carry value is at most one, those elements have a defined value of W À 1, i.e., all bits of those elements are set. Therefore, vectorized comparison of the sum with a vector I ¼ W À 1 Á Á Á W À 1 ð Þ T using the approach similar to that used above can be applied here to obtain a mask In order to formally associate scalar values with their vector counterparts we introduce the bijection n n m : Z n m ! Z m n (together with its inverse n Àn m ) between the respective naturally isomorphic sets as well as the injection g n m : representing respectively the unsigned integers n n W ðV Þ ¼ P nÀ1 k¼0 v k W k and n n W ðUÞ ¼ P nÀ1 k¼0 u k W k as well as a bit z 2 Z 2 , the sum S ¼ P n k¼0 s k W k of the two integers and z equals the vector sum where i ¼ n n 2 ðIðV; UÞÞ, and c ¼ n n 2 ðCðV; UÞÞ. Proof. First, let n ¼ 1. Then, to take into account a possible carry flag, consider both n 1 Then adding up n 1 W ðV Þ and n 1 W ðUÞ produces a sum Then adding z to the sum yields where i 0 is a carry bit resulting from adding z to the first element s 0 0 2 Z W of S 0 and equals z^ðs 0 Fixing given values of s 0 0 and W (and thus i ¼ ðs 0 along with bitwise XOR as a group operation. Note that 8c; z; i 2 Z 2^: ði^cÞ : where ð:ði^c 1 Þ^:ði^c 2 ÞÞ ) :ði^ðc 1 È c 2 ÞÞ, and thus G 1 is isomorphic to a group G 2 of linear polynomials in Z 2 ½x with the set Therefore, applying the bijection n À2 2 ¼ g 2 2 n À2 2 to " in (4) yields (5), and thus (3) holds for n ¼ 1.
Then from (5) and (6) , the proof can proceed by induction: if (3) and (4) hold for n, then they also hold for n þ 1.
Representing the addends, n nþ1 W ðV Þ and n nþ1 W ðUÞ, as respectively v 1 W þ v 0 and u 1 W þ u 0 (with v 0 ; u 0 2 Z W and v 1 ; u 1 2 Z W n ) yields the sum: Þð2"Þ: Then, based on the induction hypothesis and on the proof for the base case n ¼ 1, Since 8a; b; k 2 N 0 : ða È bÞ Á 2 k ¼ a Á 2 k È b Á 2 k , and 8a; b 2 N 0 ; 8d; c 2 Z 2 n ; 8k ! n : ða È bÞ Á 2 k þ ðc È dÞ ¼ ða Á 2 k þ cÞ Èðb Á 2 k þ dÞ, the last term of (7) can be rearranged: Substitution of this to (7) yields (3). t u Computation (3) provides a way to parallelize addition as described in the following sections. This can provide performance increase, if, first, n nþ1 W is cheap (e.g., it is the case when the map is just a re-representation of the same data and does not, for instance, involve data transfers which can be expensive as is the case with CUDA); and, second, if computations of " and ðg nþ1 W n Àðnþ1Þ 2 Þð"Þ are efficient. Impacts of these computations essentially determine overall efficiency of the adders as shown below.

PARALLELIZATION
Using (3) one can add up two long integers in parallel as follows.
Addition of two integers, X; Y 2 Z W n with W > 2 and n > 0, involves parallel carry-less addition n Àn W ðXÞ(n Àn W ðY Þ using T threads of execution in time Oðn=T Þ. Consequently, the carry propagation has to be performed using (4). Computation of " 2 Z B (where B ¼ 2 b ) also requires possibly long addition of the bit masks i and 2c þ z followed by a bitwise XOR with i. The latter is a carry-less addition of binary digits which requires no special effort for parallelization. The sum 2c þ z also requires no carry propagation because z is a one-bit value, and 2c is simply a linear bit shift to the left (towards the most significant position) by one, therefore 2c þ z ¼ ðc < < 1Þ È z. But the computation of the sum i þ ð2c þ zÞ again requires carry propagation which, in turn, can also be performed using (3). Then the original addition of size n is reduced to addition of nlog W 2 d esized values of i and 2c þ z. Likewise, the addition of i and 2c þ z is reduced to the addition of dnðlog W 2Þ 2 e. This reduction continues up until the size of addition is reduced to 1 which requires one scalar addition in Z W , i.e., dnðlog W 2Þ maxk e ¼ 1 for maxk 2 N 0 . Therefore, the total number of base-W digits to add up is for all k such that k < maxk it follows that Since parallel addition (without ") can be performed in parallel by T threads, and the same parallel addition is performed to addends for each k, total time required to perform the addition with T threads can therefore be estimated as because Note that in order to achieve (10), one should exclude race conditions and limit the use of blocking synchronization of threads accessing different parts of generated i and c. This can be achieved by choosing (possibly increasing) values of n T ðlog 2 W Þ k j k to be a multiple of a size of words which i and c are composed of. In particular, if i and c are implemented as vectors in Z n W , then n T ðlog 2 W Þ k j k should be a multiple of dlog 2 W e for every thread to set bits within a subset of its own words.
The above derivation can also be applied to (8) with T ¼ 1. More precisely, space required by this algorithm excluding 3n words of the addends and the sum (i.e., space required to store values of i, c and " mod 2 n ) can be estimated with the upper bound: This bound is particularly useful to assess maximal n such that the totally occupied space S does not exceed available memory G, i.e.
S 2n þ ðn þ 1Þ þ S f þ maxk G (11) which is composed of 2n words of the addends, n þ 1 words of the result with extra word for carry bit, S f words of c, i and " mod 2 n as well as maxk words for holding carry bits of " on each iteration of the algorithm (cf. Fig. 2). Let L ¼ log 2 W . Then # : Now, if n 2 Z N , then from (9) it follows that maxk dlog L ðN À 1Þe. Thus if

VECTORIZATION
As shown in the paper, scalar operations on bit masks i and c, as if those masks were ordinary word-sized integers, gives a way to increase performance of vectorized addition and outperform existing scalar ripple-carry implementations. This relies on the capability of the underlying parallel (vector) platform to efficiently implement both maps used in the theorem statement, i.e., n n W (as well as the inverse n Àn W ) and g n W (or the composition g n W n Àn 2 ). This is the case with the new masking forms of AVX-512 instructions which consequently provides performance boost to carry-lookahead adders implemented using (3).
The Equation (3) can also be used to improve performance of addition with vectorization if, besides the addition, the maps n n m and g n m are or can be implemented efficiently with a vector processor. While support for the former applied to results of comparisons (1) and (2) is widespread (e.g., PCMPEQ/PCMPGT in SSE2 and CMEQ/ CMHI in ARM), support for the latter map is noticeably limited, and while its implementation with lookup tables is possible, this requires access to external memory and thus can be inefficient. Experimentally, significant performance boost was only achieved with masked operations of Intel AVX-512 upon integral mask registers K0-K7. The implementation of the critical loop using AVX-512 in MASM syntax is shown in Fig. 3 in which 512-bit vector values are considered to be in Z 8 2 64 and mask values to be in Z 2 16 (to take extra carry bit of (4) into account).
In other cases implementations of n n 2 ðIðV; UÞÞ and n n 2 ðCðV; UÞÞ is rather straightforward using widespread vector comparison instructions (see Section 2). Such are, for instance, (V)PCMPEQ and (V)PCMPU on Intel [23, Section 5.2] (Volume 2C) and CMEQ and CMHI on ARM [24] which set bits of elements of a destination register according to how the respective elements of the operands meet the comparison predicate.
On the other hand, the inverse transformation, ðg n W n Àn 2 Þ, used in (3), is not widespread which makes its implementation less efficient. Since lengths of vector registers are usually small, the transformation can in many cases be implemented with a lookup table which maps short scalar integers to respective vector values. The obvious disadvantage of this is the need for a data structure in external memory, access to which can be expensive, though such a table can be small enough to reside in cache. Such implementation for AVX-2 with 256-bit vector registers (and n ¼ 4) exhibited performance (see Fig. 10) comparative to the sequential adder.
AVX-2, as well as SSE-4.2, only provide for signed comparison of integers by means of the (V)PCMPGTx instructions. Therefore, in order to implement CðV; UÞ in (4) one has to employ (1) for signed carry detection.
The workaround for a lack of hardware means to easily implement ðg n W n Àn 2 Þ can be to use lookup tables (LUT) due to shortness of ymm and xmm registers. The AVX-2 implementation for this paper (Fig. 4) uses the lookup table which can be represented as a matrix, in Z 4Â16 2 64 , composed of values m ";b ¼ Àðb" Á 2 Àb c mod 2Þ with "-th row loaded into an AVX-2 register, where " is a result of evaluation of (4). The "-th row contains a binary representation of the number " using the alphabet f0; À1g ðmod 2 64 Þ. Likewise, SSE-4.2 can use a similar matrix in Z 2Â4 2 64 . In general, a LUT to implement the map ðg n W n Àn 2 Þ would require n Á 2 n elements of the set Z W .
If vector registers are small, such as 128 bit SSE registers which can fit in two 64 bit scalars, carry detection and propagation, i.e., computation and addition of ðg n W n Àn 2 Þð"Þ in (3) and (4), can be done without a LUT using horizontal shuffling of scalars. However, this would make computation of the adjustment ðg n W n Àn 2 Þð"Þ for the vector sum sequential and thus is neither scalable nor efficient. Indeed, such implementation (cf. Fig. 8) showed no performance gain compared to the sequential scalar implementation using the ADC instruction nor to the implementation which utilizes a LUT similarly to the AVX-2 implementation described above.
The most beneficial way to combine the vector algorithm with one shown in Fig. 2 is by a thread to reduce its whole subvector to a set of singular bits, c (obtained from " of the sum of the most significant elements of the subvectors) and i (which can be obtained from accumulated bitwise AND of n n 2 ðIðv j ; u j ÞÞ of elements v j and u j of the subvectors with the result equal to 2 n À 1, should i be set), write those bits to respective vectors of words, such that each word is occupied by one bit, then use parallel reduction to pack those bits together: Elements of the result can then again be combined to obtain " using the parallel algorithm (Fig. 2) and/or the vector algorithm (Fig. 3). If the latter is used, one would need to include a way to double the value of integers representing the bits c which can be accomplished by using bit-shift instructions (e.g., (V) PSLL in x86-64 or SHL in ARM) together with element permutation across lanes (respectively, (V)PSHUF and VEXT).

EXPERIMENTS
Measurements of performance of the algorithms were conducted using a system with Intel Core i9-10980XE CPU with 18 physical cores and Hyper-Threading enabled, with one NVIDIA Tesla P100 CUDA card operated by Ubuntu Workstation 20.04.3 LTS. The measurements were taken as a difference between two values of a CPU counter returned by the RDTSC instruction as well as differences between clock values of the underlying system (chrono::steady_clock library of C++ and CUDA events, the latter were used to measure performance of CUDA kernel code). Source code of a program which performs all the measurements, generates data tables and every plot provided in this section is available in the repository [25]. Different implementations of the parallel and vector addition algorithms were given randomized data sets representing addends of sizes from 64 bytes to 8 gigabytes. Measurements were taken ten times per experiment and then averaged. The results presented in the paper are produced arithmetic means.
Measurement of performance of the parallel algorithm (cf. Fig. 2) shows increase in performance of long addition (Fig. 5).
However, measurements show little scalability of the parallel algorithm, both with and without vectorization (Fig. 6), respectively achieving speedups of about 2.1 and 2.2 when the algorithm is executed to add up two 8 GiB values with 11 threads on a machine with 18 physical cores. When sizes of addends are close to cache sizes, the results are slightly better without vectorization: the maximal measured speedup of 16 MiB addition (which is close to L2 cache size) performed using a parallel scalar adder is 4.41 when executed using 16 threads which is close to a number of physical CPU cores in the machine.
This demonstrates that the limited scalability of the implementation is due to contention of threads accessing global memory (using quad-channel configuration) even if the threads write to memory addresses far enough to exclude false sharing. Indeed, assuming the thread contention which implies hardware blocking of threads accessing the memory with performance impact proportional to the number of threads, one can apply the model (11) and (12) in [26] for critical sections as shown in [27] to achieve asymptotic limits for speedups s. Assuming no asymptotically impactful sequential part of parallel the algorithm (due to (10)) the following simplified upper limit of speedup can be derived from the model in [26]: where f cont is a fraction of time spent by threads during blocking access to memory, and p cont is the probability of contention. Even though the model (11) in [26] is purely theoretical as it is described in terms of base core equivalents, which model CPU cores as defined in [28], it can be applied to a real-world manycore CPU and the parallel algorithm. For  Fig. 2 performed using 36 logical cores (dashed line) compared to time taken to perform the same parallel algorithm but such that each thread performs vectorized addition (Fig. 3) producing a pair of bits i and c to add-up later recursively using (4) (dot-dash line) as well as times of sequential addition using the ADC instruction (solid line). that one has take into account the way threads (and, respectively, CPU cores) access words of the vectors V , U and n ÀnÀ1 W ðSÞ. In order to reduce contention of the threads whose access to global memory and cache can be serialized, the vectors are split onto subvectors with its adjacent elements, if possible, processed in a single thread. This causes the elements of V and U to reside in a cache line read by a single CPU-thread and take advantage of the cache. If the addends are n words in size, the thread t, 1 t T , is thus assigned n t elements of the vectors starting from the index ; if t ! n mod T: ( Provided that the cache line is 64 bits in size, a thread reading an 8-byte word from the global memory causes the respective CPU core to cache the word together with seven adjacent words which are reused later. In the case of AVX-512 adder, the thread reads 64 bytes, i. e. one cache line, of V or U into a ZMM register. If T > 4, the threads contend for cache lines read from the global memory using four channels of the platform in question. Therefore, in this case the probability of contention is the probability of five or more threads, out of T , accessing the global memory which adheres the binomial distribution where p R is a probability of a single thread reading V or U from global memory at a random point in time during the parallel addition process. As soon as a cache line of a vector addend has been read (and seven of the eight 64-bit words of the corresponding vector sum V (U are acquired), the cache line can be discarded as the words no longer participate in the addition algorithm. Instead the thread writes the vector sum into a memory buffer to be adjusted using the evaluated " later on. This introduces a point in which cache misses occur as well as a thread contention for using cache of limited size: ideally the vector sum would remain in cache until after it is adjusted. However, if the vector sum is sufficiently large, as in the bottom plot of the Fig. 6, a part of the sum will have to be replaced with S f words of i, c and " (see Section 4). One can verify that S f < 2n whenever n ! 10, i. e. if the size of an addend processed by a thread is greater or equal to 80 bytes, the entirety of the corresponding vectors i, c and ", for all k, fit in cache whenever n fits in the cache, which in turn means that evaluation of " does not have to involve interactions with the global memory. However, one has to take into account thread contention for accessing cache on all three levels when HyperThreading is enabled, because threads of a single CPU core share the cache. In this example for simplicity the assumption is made that no such contention takes place. To take into account cache contention, one can employ the model of L2 cache described in [29].
Also, the reduction of the parallel algorithm (cf. Fig. 2) requires a barrier synchronization at every level k which is usually implemented such that each of T threads blocks at a mutex, and one of the T threads signals a monitor associated with the mutex. Thus probability of a thread blocking at the barrier is taken as 1.
After a thread has evaluated ", it reads the value of formerly obtained vector sum, partially from the global memory and partially from cache, adjusts it with respect to " and finally writes the result into the global memory with probability p cont, W of contention for quad-channel memory. Again, p cont, W is distributed binomially based on the probability p W of a thread writing the result at a random point in time.
Given that, the speedup can be estimated using the following derivation from (11) in [26] (considering time taken by the adder run in a single thread as 1): where f R , f bar and f W are respectively fractions of time spent by a thread reading the addends from memory, at the barrier and writing the result into global memory. This shows that the speedup and efficiency of parallelization significantly depends on the ability of the underlying platform to perform data transactions in parallel with minimal contention as demonstrated in Fig. 6.
Nonetheless, there is a steady performance gain when using the parallel adder compared to performance of the sequential adder and, likewise, when using parallel addition with AVX-512 vectorization compared to all other adders.
As for vectorization on one CPU, the increase in performance is visible until the impact of interactions with memory overtakes performance gains.
To compare performance provided by the AVX-512 implementation to one of the sequential adder, the relative measurement 1 À t AVX-512 Á t À1 ADC (where t ADC is time taken by the ADC-based adder, and t AVX-512 is time of the AVX-512 based adder) is used in Fig. 7. The clock frequency is fixed at 1.5 GHz. The peaks correlate with cache sizes (level 1 data cache is 1.125 MiB in size, level 2 cache size is 18 MiB and level 3 size is 24.75 MiB) of the CPU.
Additionally, an implementation of the parallel algorithm ( Fig. 2) was developed for CUDA devices as follows. First, input data is separated onto subvectors to fit available global memory on a device. In order to determine size of each subvector, one can employ (12) in which G is available device global memory, in words, N is 2 64 (assuming n is a 64-bit integer), W is 2 32 (a device word is considered 32-bit in size) which yields the limit n ð31G À 1730Þ=96 for n per device. The word size of the device (log 2 W ) is chosen to be 32-bit because, first, a word size of CUDA shared memory is 32-bit, second, because the number of shared memory banks is 32 (or 16 for devices of compute capability 1.x) and, third, because CUDA warp size is 32 threads. Therefore, a 32-bit word allows employment of fast shared memory, individual for each CUDA streaming multiprocessor, to efficiently evaluate i, c and " in (4), as described below, avoiding bank conflicts.
Respectively, each subvector is then processed by CUDA threads constituting dn=1024e blocks of 1024 threads per each. The process is composed of evaluation of 1024 elements per CUDA block of the vector sum V (U, written into the device global memory, as well as evaluation of 1024 bits of i 1 and c 1 (in terms of Fig. 2). Since each of the bits of i 1 and c 1 initially occupies one 4 byte word, the threads of the same block consequently perform an inplace bitwise-ORbased parallel reduction to pack those 1024 bits into 32-bit words. Provided that the capacity of the shared memory is sufficient to hold 2 Á 1024 Á 4 ¼ 8 KiB of data (two collections of 1024 bits, each bit initially occupies four bytes), the reduction can and is performed upon elements of i 1 and c 1 located within the shared memory, rather than global memory, until 32 words of 32-bit size are obtained. The latter are written to the global memory. The CUDA kernel stops after having completed the evaluation of the subvector of V (U composed of 1024 word-sized elements per block as well as 32 words per block per each of i 1 and c 1 . Consequently, the host performs parallel evaluation of " recursively as follows (cf. Fig. 2). Given k, such that 1 k < maxk, the host first invokes CUDA kernel to perform the carryless evaluation of " 0 as described in the Section 4, as well as bits of i kþ1 and c kþ1 . The latter two are evaluated in the process involving the shared memory and the parallel bitwise-OR-based reduction described above. After the kernel is complete, the host recursively invokes kernels to compute " kþ1 and adjust the value of " 0 k using (3) and (4) to obtain In this process, given the global (grid-based) index t of a CUDA thread, the value for z k is either the input bit z (if k ¼ 1 and t ¼ 0, i.e., the thread 0 is processing the first words of i 1 and c 1 resulting from operations upon the first respective words of V and U) or zero (if k > 1 and t ¼ 0) or (otherwise) the value of the most-significant bit of the lower-order word (concurrently read by the thread t À 1) of c k . The map ð" kþ1 ÞÞ is implemented as a test of the bit t mod 32 of the word bt=32c of " kþ1 resided in the global memory of CUDA device. The value k ¼ maxk represents the recursion base case, in which the computation of " maxk only involves operations upon word-sized i maxk and c maxk . Finally, the computation upon the subvectors of V and U completes when " has been evaluated, and the value of V (U has been adjusted with respect to (3) to compute the result.
If memory capacity of an employed CUDA device is sufficient to hold both addends, the sum as well as the values of i k , c k and " for each k, 1 k maxk, then one invocation of a CUDA kernel which implements the adder is enough to obtain the final sum S in (3). On the other hand, if there are multiple subvectors, their addition, there has to be a loop which processes the corresponding subvectors on every iteration. This loop carries dependency on respective carry bits, and on every iteration it is required to exchange data with host which significantly impacts the performance (Fig. 9). However, given multiple CUDA devices, the theorem and the parallel algorithm (Fig. 2) can be used again for parallelization. The efficiency of the approach has not been   verified experimentally, but it is expected to boost the implementation provided that data can be exchanged with the devices in parallel.
Performance of the described implementation based on a single NVIDIA Tesla P100 CUDA device was measured and is shown in Fig. 9 in comparison to the sequential ADC-based ripple-carry adder. The CUDA implementation was compiled using the nvcc compiler version 11.4 for compute capability 3.5 (gpu-architecture and gpu-code) and 64-bit architectures. The "CUDA" plot demonstrates total time taken by the implementation to perform the addition including times for data exchange between the host and the device. "CUDA (LED)" (load extra data) demonstrates times excluding the impact of data transfers during addition of the first subvector but including data transfers for addition for the rest subvectors. "CUDA (kernel)" shows accumulated times taken exclusively by kernel computations and measured using CUDA events. Times denoted as "CUDA" and "CUDA (LED)" were acquired using chrono::steady_clock of the C++ library.
Also, every implementation of long addition presented in the paper, as well as other used SIMD-based adders, significantly outperform addition using GNU multiprecision library (GMP) version 6.2.0, even if the measurements do not take into account data transfers performed with mpz_import and mpz_export. This is shown in Fig. 10.

CONCLUSION
Addition of full-word long integers in parallel, and especially using vectorization, is widely believed to be impractical due to challenging carry propagation which establishes data interdependence between digits of addends and the sum. On the one hand, it can be true because parallelization and vectorization can be applied only partially at the cost of a decrease of the algorithm simplicity. Nonetheless, there are various applications, e.g., in high-precision scientific computations, including those based on floating-point arithmetic, digital signal processing and cryptography, where addition can become performance-critical. In this case complexity of the implementation can become acceptable, or it can be mitigated as shown in [12], [15] and [17].
However, the existing sources focus on hardware implementation of adders and employ bit-level parallelism requiring specialized hardware to perform addition with low latency. The current paper on the other hand focuses on using unspecialized scalar and vector processors with a minimal set of common arithmetic instructions to achieve similar performance gains. Hence, the proposed algorithms and their formal analysis. Particularly, if the word length is one bit, i.e., W ¼ 2, one can easily derive bit-level carry-lookahead full adders (described, for instance, in [14]) from the statement of the theorem 1. The described parallel algorithm shown in Fig. 2 corresponds to one presented in [11] and [17].
Experiments conducted to measure performance of the results agree with the prediction (10) in the paper. Notably, AVX-512 with its hardware implementation of g n W n Àn 2 via masked operations helps to increase performance of addition significantly, especially when the addition is implemented on a manycore CPU. The parallel implementations without use of vectorization, although not very scalable, provide significant performance boost compared to sequential scalar addition implementing ripple-carry algorithm, even when optimization techniques such as loop unrolling are enforced upon the latter.
Performance gains also take place when addition is performed on a CUDA device as long as transfers of parameters with the host are not involved.
Surprisingly, all of the implementations significantly outperformed addition implemented in the GNU Multiple Precision library (version 6.2.0) even if import and export of the operands and the result are not taken into account. Brief analysis of the GMP source code as well as its disassembly showed that much time is taken for restructuring of data (with reallocations) and performing addition in several different ripple-carry loops.
Compared to adders described in literature, the proposed adders are agnostic to sizes of addends, easily scale with respect to problem size and full-word data representation and provide noticeably better performance if parallelism and/or vectorization are employed. Fig. 9. Measured performance of CUDA implementation of addition using NVIDIA Tesla P100 compared to sequential addition on CPU using the ADC instruction. The "CUDA" plot includes times taken by all data transfers between the host and the device, "CUDA (LED)" excludes times required to transfer the first subvectors of addends and the sum, "CUDA (kernel)" shows accumulated times taken by CUDA kernels to perform addition and does not take into account times for data transfers.