ST-GCN AltFormer: GestureRecognition with Spatial temporal Alternating Transformer
Recently, with the rapid development of 3D motion estimation technology and the continuous improvement of low-cost depth camera performance, skeleton-based dynamic gesture recognition technology has received increasing attention. This paper proposes a novel neural network architecture — spatial-temporal alternating graph convolutional Transformer (ST-GCN-AltFormer), which combines spatial-temporal graph convolution and in-parallel spatial-temporal alternating Transformer, for dynamic gesture recognition. We first input the joint stream to the spatial-temporal graph convolution (ST-GCN) to extract the low-level features of gesture actions. Secondly, the spatial and temporal correlations between joints are modelled by an in-parallel spatial-temporal alternating Transformer (ST-TS) module, in which a spatial Transformer (STR) extracts the spatial correlations of joints within each frame, and a temporal Transformer (TTR) extracts the inter-frame correlations between joints. STR and TTR are alternately fused to obtain the final gesture prediction results. The performance of the proposed method is validated through experiments on three public dynamic gesture datasets (SHREC'17 Track(two evaluation protocols), DHG-14/28(two evaluation protocols), and LMDHG). Compared with the state-of-the-art methods, our method achieves the highest recognition accuracy with $97.3\%$, $95.8\%$, $94.3\%$, $92.8\%$, and $98.03\%$ on the SHREC'17 Track, DHG-14/28, and LMDHG dynamic gesture datasets, respectively.
Email Address of Submitting Authorflp@zjut.edu.cn
ORCID of Submitting Author0000-0001-7752-4839
Submitting Author's Institutionthe College of Information Engineering, Zhejiang University of Technology
Submitting Author's Country