Multiple-Precision BLAS Library for Graphics Processing Units

. The binary32 and binary64 ﬂoating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision ﬂoating-point arithmetic. This paper presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision signiﬁcands of ﬂoating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.


Introduction
It is no surprise that floating-point operations have rounding errors that occur during calculations. Such errors are natural due to the limited length of the significand in the binary32 and binary64 formats from the IEEE 754 standard [1]. For many applications, these errors do not prevent obtaining the correct results. Moreover, for some applications such as deep learning, the best option is to use lower precision formats, e.g., the 16-bit (half-precision) format [2]. However, for a rapidly growing number of scientific computing applications the natively supported IEEE 754 formats are not enough and a higher level of precision is required [3][4][5][6]. For such applications, multiple-precision libraries are used, which allow one to perform arithmetic operations on numbers represented with hundreds and thousands of digits.
This paper describes a library that provides new parallel algorithms and implementations for a number of basic linear algebra operations, like the BLAS routines [7], with multiple precision. The library, called MPRES-BLAS, is designed to be used on high-performance computing systems equipped with modern graphics processing units (GPUs). The library uses the residue number system    (RNS) [8,9] to represent multiple-precision numbers. In the RNS, a set of moduli are given which are independent of each other. A number is represented by the residue of each modulus and the arithmetic operations are based on the residues individually as shown in Fig. 1. This introduces parallelism in arithmetic with multiple precision and makes RNS a promising number system for many-core architectures such as GPUs.

Related Work
One approach to get greater precision and accuracy is to use floating-point expansions, when an extended-precision number is represented as an unevaluated sum of several ordinary floating-point numbers. An example of such an expansion is the double-double format, capable of representing at least 106 bits of significand. Each double-double number is represented as an unevaluated sum of two binary64 numbers. In turn, the quad-double format is capable of representing 212 bits of significand by using four binary64 numbers. Algorithms for computing floating-point expansions are called error-free transformations [10]. Double-double arithmetic is used in the XBLAS [11] and QPBLAS [12] packages for CPUs, as well as in the QPBLAS-GPU package for GPUs [13]. The ExBLAS package [14] contains a number of optimized implementations of accurate and reproducible linear algebra operations for parallel architectures such as Intel Xeon Phi and GPUs. Reproducibility is defined as the ability to obtain a bit-wise identical result from multiple runs of the code on the same input data. To ensure reproducibility, ExBLAS uses error-free transformations and long fixed-point accumulators that can represent every bit of information of the input floating-point format (binary64). The use of long accumulators provides the replacement of non-associative floating-point operations with fixedpoint operations that are associative.
The paper [15] presents highly optimized GPU implementations of the DOT, GEMV, GEMM, and SpMV operations, which are included in the BLAS-DOT2 package. In these implementations, internal floating-point operations are performed with at least 2-fold the precision of the input and output data precision, namely, for binary32 data, the computation is performed using the binary64 format, whereas for binary64 data, the computation is performed using the Dot2 algorithm [16], which is based on error-free transformations. Nakata et al. proposed MPACK (aka MPLAPACK) [17], a package of multipleprecision linear algebra routines. It consists of two modules, MBLAS and MLA-PACK, which are multiple-precision versions of BLAS and LAPACK for CPUs, respectively. MPACK supports several libraries like GMP, MPFR, and QD for underlying multiple-precision arithmetic. In addition, MPACK provides doubledouble implementations of the GEMM and SYRK routines for GPUs.
There are also a number of extended-and multiple-precision arithmetic libraries for GPUs. Support for double-double and quad-double is implemented in GQD [18], which allows one to perform basic arithmetic operations and a number of mathematical functions with extended precision. GQD mainly uses the same algorithms as the QD library for CPUs. To represent extended precision numbers, GQD uses the vector types double2 and double4 available in CUDA.
CAMPARY [19] uses n-term floating-point expansions (generalization of the double-double and quad-double formats to an arbitrary number of terms) and provides flexible CPU and GPU implementations of multiple-precision arithmetic operations. Both the binary64 and the binary32 formats can be used as basic blocks for the floating-point expansion, and the precision (expansion size) is specified as a template parameter. Generally, each addition and multiplication of n-component expansions in CAMPARY requires 3n 2 + 10n − 4 and 2n 3 + 2n 2 + 6n − 4 standard floating-point operations.
GARPREC [18] and CUMP [20] support arbitrary precision on GPUs using the so-called "multi-digit" format. This format stores a multiple-precision number with a sequence of digits coupled with a single exponent. The digits are themselves machine integers. The GARPREC algorithms are from David Bailey's ARPREC package for CPUs, whereas CUMP is based on the GNU MP Bignum library (GMP). In both GARPREC and CUMP, each multiple-precision operation is implemented as a single thread and an interval memory layout is used in order to exploit the coalesced access feature of the GPU.
In [21], we have proposed new multiple-precision arithmetic algorithms using the residue number system and adapted them for efficient computations with multiple-precision vectors on GPUs. Similar algorithms for dense multipleprecision matrices were then proposed and implemented. All these algorithms are used in the MPRES-BLAS library, which is discussed in this paper. Table 1 summarizes the software packages considered in this section.

Data Types and Conversions
In MPRES-BLAS, a floating-point number is an object consisting of a sign, a multiple-precision significand, an integer exponent, and some additional information about the significand. We denote such an object as follows: where s is the sign, X is the significand, e is the exponent, and I(X / M) is the interval evaluation of the significand (additional information). The significand is represented in the RNS by the residues (x 1 , x 2 , . . . , x n ) relative to the moduli set {m 1 , m 2 , . . . , m n } and is considered as an integer in the range of 0 to M − 1, where M = n i=1 m i . The residues x i = X mod m i are machine integers. The binary number X corresponding to the given residues (x 1 , x 2 , . . . , x n ) can be derived using the Chinese remainder theorem [8] Unlike the addition, subtraction, and multiplication operations that are based on the residues individually, comparison, sign determination, overflow detection, scaling, and general division are time-consuming operations in the RNS, since they require estimating the magnitude of numbers. These operations are called non-modular operations. The classic technique to perform these operations is based on the Chinese remainder theorem and consists in computing binary representations of numbers with their subsequent analysis. In large dynamic ranges this technique becomes slow. MPRES-BLAS uses an alternative method for implementing non-modular operations, which is based on computing the interval evaluation of the fractional representation of an RNS number [22]. This method is designed to be fast on modern massively parallel general-purpose computing platforms such as GPU. The interval evaluation I(X / M) is defined by two bounds, X / M and X / M, that localize the value of X scaled by M. The bounds are represented in a working precision floating-point format with an extended exponent in order to avoid underflow when M is large. To compute X / M and X / M, only modulo m i operations and standard floating-point operations are required. Using interval evaluations, efficient algorithms have been developed for implementing a number of non-modular operations in the RNS [22].
In [21], basic algorithms are proposed for performing arithmetic operations with numbers of the form (1) and the following rounding error bound is given: where u < 4/ √ M and M is the RNS moduli product. Hence, the user can set arbitrary arithmetic precision by changing the used set of RNS moduli.
The C data type corresponding to a multiple-precision number of the form (1) is mp float t, defined as the following structure: where er float t is the C data type representing a working precision floatingpoint value with an extended exponent.
The mp float t type is used mainly in the host code, and MPRES-BLAS provides a set of functions for converting data between mp float t and double, and also between mp float t and the mpfr t type from the GNU MPFR library 3 .
To store an array of multiple-precision numbers in the GPU memory, MPRES-BLAS uses the data type mp array t, defined as the following structure: typedef struct { int * digits; int * sign; int * exp; er_float_t * eval; int4 * buf; int * len; } mp_array_t; /* Array of multiple-precision values */ The fields of this structure are as follows, where n is the precision (size of the RNS moduli set) and L is the length of the multiple-precision array: digits are the digits (residues) of significands (an integer array of size n × L); all digits belonging to the same multiple-precision number are arranged consecutively in the memory; sign are the signs (an integer array of size L); exp are the exponents (an integer array of size L); eval are the interval evaluations of significands (an array of size 2L, where the first L elements represent the lower bounds of the interval evaluations, and the second L elements represent the upper bounds); buf is the buffer (an array of size L used in arithmetic operations to transfer auxiliary variables between CUDA kernels); len is the number of items that the array holds, i.e. L; for a vector of length N , the array must contain at least (1 + (N − 1) × |incx|) items, where incx specifies the stride for the items; for a matrix of size M × N , the array must contain at least LDA × N items, where LDA specifies the leading dimension of the matrix; the value of LDA must be at least max (1, M ).
Using the mp array t structure instead of an array of mp float t structures allows ensuring coalesced memory access when dealing with multiple-precision vectors and matrices; details can be found in [21]. MPRES-BLAS provides the following helper functions for working with mp array t instances: mp array init: allocate a multiple-precision array in the GPU memory; mp array clear: release the GPU memory occupied by an array; mp array host2device: convert a regular multiple-precision array allocated on the host (with elements of type mp float t) to an mp array t instance allocated on the GPU; mp array device2host: convert an mp array t instance allocated on the GPU to a regular multiple-precision array allocated on the host.

Data Layout
The BLAS routines in MPRES-BLAS follow the Fortran convention of storing matrices using the column major data format. Thus, for an M by N matrix A, the address of the kth digit of the element a ij in 0-based indexing is given by where n is the size of the RNS moduli set, 0 ≤ k < n, 0 ≤ i < M , and 0 ≤ j < N . Figure 2 illustrates the column major layout of a 3 × 4 multiple-precision matrix using the mp array t data type. In this example, n = 4, i.e. the significand of each element a ij consists of four digits: X = (x 1 , x 2 , x 3 , x 4 ). We use the dot symbol (.) to access the parts of a multiple-precision number. The symbols lo and up denote the lower and upper bounds of the interval evaluation so that lo := X / M and up := X / M. Note that we use 1-based indexing in this example.

Functionality
The current version of MPRES-BLAS (ver. 1.0) provides 16 GPU-based multipleprecision BLAS functions listed in Table 2. All the functions support the standard BLAS interface, except that some functions have additional arguments that are pointers to auxiliary buffers in the global GPU memory. Implementation details, performance data, and accuracy of the ASUM, DOT, SCAL, and AXPY functions are given in [21]. Table 3 show the forward error bounds for the GEMV, GEMM, GE ADD, and GER routines. In this table it is assumed that each matrix is a square N by N matrix.
We note that not all of the functions listed in Table 2 can be ill-conditioned and require higher numeric precision. For example, as shown in [21], the SCAL and ASUM functions have relative forward error bounds of u and u(N − 1)/(1 − u(N − 1)), respectively. However, support for such functions allow one to eliminate time-consuming conversions between the IEEE 754 formats and the multipleprecision format (1) in the intermediate steps of a computation.
To achieve high performance, GPU kernels must be written according to some basic principles/techniques, stemming from the specifics of the GPU architecture [24]. One such technique is blocking. The idea is to operate on blocks Table 3. Error bounds for the functions from MPRES-BLAS. According to [23], the quantity γN is defined as γN := N u/(1 − N u).

Absolute error bound
Relative error bound GEMV γN+2 · |αA| · |x| + |βy| γN+2 · |αA| · |x| + |βy| αAx + βy GEMM γN+2 · |αA| · |B| + |βC| γN+2 · |αA| · |B| + |βC| αAB + βC GE ADD γ2 · |αA| + |βB| γ2 · |αA| + |βB| αA + βB GER γ3 · |αxy T | + |A| γ3 · |αxy T | + |A| αxy T + A of the original vector or matrix. Blocks are loaded once into shared memory and then reused. The goal is to reduce off-chip memory accesses. However, when working with multiple-precision numbers, shared memory is the limiting factor for occupancy, since the size of each number may be too large. Another problem is that the multiple-precision arithmetic algorithms contain serial and parallel sections, and while one thread computes the exponent, sign, and interval evaluation, all other threads are idle. This results in divergent execution paths within a warp and can lead to significant performance degradation. In order to resolve this problem, MPRES-BLAS follows the approach proposed in [21], according to which multiple-precision operations are split into several parts, each of which is performed by a separate CUDA kernel ( global function). These parts are computing the signs, exponents, and interval evaluations; computing the significands in the RNS; rounding the results.
Thus, each BLAS function consists of a sequence of kernel launches. In some cases, such a decomposition may increase the total number of global memory accesses, since all intermediate results should be stored in the global memory. However, this eliminates branch divergence when performing sequential parts of multiple-precision operations. Furthermore, each kernel has its own execution configuration which makes it possible to use all available GPU resources.
As an example, Fig. 3 shows a flowchart of the multiple-precision GE ADD operation (C ← αA + βB). Other BLAS operations have a similar structure. An exception is reduction operations, in which each operation with multiple precision is performed as a single thread, and communication between threads is performed using shared memory.

Performance Evaluation
In this section, we evaluate the performance of the GEMV, GE ADD and GEMM functions from MPRES-BLAS for various problem sizes and levels of numeric precision. In the experiments, we used the NVIDIA GeForce GTX 1080 GPU. The host machine was equipped with an Intel Core i5-7500 and 16 GB of RAM and run Ubuntu 18.04.4 LTS. The GPU codes were compiled by nvcc 10.2.89 with options -O3 -use fast math -Xcompiler=-O3,-fopenmp,-ffast-math. For comparison, we also evaluated the GARPREC (only for GEMV), CUMP, and CAM-PARY libraries; see Section 2 for details. The input vectors, matrices, and scalars were initialized with random numbers over [-1, 1]. Our measurements do not include the time spent transferring data between the CPU and the GPU, as well as time of converting data into internal multiple-precision representations. Figure 4 shows the experimental results. In many cases, MPRES-BLAS has better performance than implementations based on existing multiple-precision libraries for GPUs. Increasing the precision of computations increases the speedup of MPRES-BLAS. At 106-bit precision, the performance of MPRES-BLAS is on average 6 times lower than that of CAMPARY. This is because the RNS-based algorithms implemented in MPRES-BLAS are designed for arbitrary (mostly large) precision and have overhead associated with calculating interval evaluations and storing intermediate results in the global memory. On the other hand, MPRES-BLAS is less dependent on precision than CAMPARY and with 848-bit precision, MPRES-BLAS performs on average 9 times better than CAMPARY.

Conclusion
In this paper, we presented MPRES-BLAS, a new library of basic linear algebra operations with multiple precision for CUDA compatible GPUs based on the residue number system. The current version of MPRES-BLAS (ver. 1.0) supports 16 multiple-precision operations from levels 1, 2, and 3 of the BLAS. Some operations such as SCAL, NORM and ASUM typically do not require precision higher than the one provided by the natively supported floating-point formats. Nevertheless, support for such functions allow one to eliminate time-consuming intermediate conversions between the natively supported formats and the RNSbased multiple-precision format used. In addition to the linear algebra kernels, MPRES-BLAS implements basic arithmetic operations with multiple precision for CPU and GPU (through the mp float t data type), so it can also be considered as a general purpose multipleprecision arithmetic library. Furthermore, the library provides a number of optimized RNS algorithms, such as magnitude comparison and power-of-two scaling, and also supports extended-range floating-point arithmetic with working precision, which prevents underflow and overflow in a computation involving extremely large or small quantities. The functionality of MPRES-BLAS will be supplemented and improved over time.