TVMpaper.pdf (1.22 MB)
Download file

A native tensor-vector multiplication algorithm for high performance computing

Download (1.22 MB)
posted on 2022-03-15, 04:40 authored by Pedro J. Martinez-FerrerPedro J. Martinez-Ferrer, A. N. Yzelman, Vicenç Beltran
Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor--vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This paper proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62\% and 76\% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58\% and 69\% of the theoretical bandwidth for large tensors.


Support by MCIN/AEI/10.13039/501100011033 and ESF/10.13039/501100004895 under Grant RYC2019-027592-I

HPC Technology Innovation Lab, a Barcelona Supercomputing Center and Huawei research cooperation agreement (2020)


Email Address of Submitting Author

ORCID of Submitting Author


Submitting Author's Institution

Barcelona Supercomputing Center (BSC)

Submitting Author's Country

  • Spain

Usage metrics