loading page

The CUR Decomposition of Self-attention
  • Chong Wu,
  • Maolin Che,
  • Hong Yan
Chong Wu
Department of Electrical Engineering and Centre for Intelligent Multidimensional Data Analysis, City University of Hong Kong

Corresponding Author:[email protected]

Author Profile
Maolin Che
School of Mathematics, Southwestern University of Finance and Economics, Department of Electrical Engineering and Centre for Intelligent Multidimensional Data Analysis, City University of Hong Kong
Hong Yan
Department of Electrical Engineering and Centre for Intelligent Multidimensional Data Analysis, City University of Hong Kong

Abstract

Transformers have achieved great success in natural language processing and computer vision. The core and basic technique of transformers is the self-attention mechanism. The vanilla self-attention mechanism has quadratic complexity, which limits its applications to vision tasks. Most of the existing linear self-attention mechanisms will sacrifice performance to some extent for reducing complexity. In this paper, we propose a novel linear approximation of the vanilla self-attention mechanism named CURSA to achieve both high performance and low complexity at the same time. CURSA is based on the CUR decomposition to decompose the multiplication of large matrices into the multiplication of several small matrices to achieve linear complexity. Experiment results of CURSA in image classification tasks show that it outperforms state-of-the-art self-attention mechanisms with better data efficiency, faster speed, and higher performance.
18 Apr 2024Submitted to TechRxiv
24 Apr 2024Published in TechRxiv