The CUR Decomposition of Self-attention

Chong Wu; Maolin Che; Hong Yan

doi:10.36227/techrxiv.171392846.60982484/v1

loading page

The CUR Decomposition of Self-attention

Chong Wu,
Maolin Che,
Hong Yan

Abstract

Transformers have achieved great success in natural language processing and computer vision. The core and basic technique of transformers is the self-attention mechanism. The vanilla self-attention mechanism has quadratic complexity, which limits its applications to vision tasks. Most of the existing linear self-attention mechanisms will sacrifice performance to some extent for reducing complexity. In this paper, we propose a novel linear approximation of the vanilla self-attention mechanism named CURSA to achieve both high performance and low complexity at the same time. CURSA is based on the CUR decomposition to decompose the multiplication of large matrices into the multiplication of several small matrices to achieve linear complexity. Experiment results of CURSA in image classification tasks show that it outperforms state-of-the-art self-attention mechanisms with better data efficiency, faster speed, and higher performance.

18 Apr 2024Submitted to TechRxiv

24 Apr 2024Published in TechRxiv

Abstract

Peer review timeline