WiGRUNT: WiFi-enabled Gesture Recognition Using Dual-attention Network

—Gestures constitute an important form of nonverbal communication where bodily actions are used for delivering messages alone or in parallel with spoken words. Recently, there exists an emerging trend of WiFi sensing enabled gesture recognition due to its inherent merits like device-free, non-line-of-sight covering, and privacy-friendly. However, current WiFi-based approaches mainly reply on domain-speciﬁc training since they don’t know “ where to look ” and “ when to look ”. To this end, we propose WiGRUNT, a WiFi-enabled gesture recognition system using dual-attention network, to mimic how a keen human being intercepting a gesture regardless of the environment variations. The key insight is to train the network to dynamically focus on the domain-independent features of a gesture on the WiFi Channel State Information (CSI) via a spatial-temporal dual-attention mechanism. WiGRUNT roots in a Deep Residual Network (ResNet) backbone to evaluate the importance of spatial-temporal clues and exploit their inbuilt sequential correlations for ﬁne-grained gesture recognition. We evaluate WiGRUNT on the open Widar3 dataset and show that it signiﬁcantly outperforms its state-of-the-art rivals by achieving the best-ever performance in-domain or cross-domain.


INTRODUCTION
A Gesture is a movement usually of the body or limbs that conveys an idea, sentiment or attitude. It dates back to our hominid ancestors who are believed to be "better pre-adapted to acquire language-like competence using manual gestures than using vocal sounds" [1]. Nowadays, while the mankind steps into the information era, the situation remains quite the same under the context of human-computer interaction (HCI) since we still heavily rely on the gesture to deliver messages, commands or even emotions to the computers surrounding us [2]. Thus, there is a compelling demand for an effective gesture recognition system, which can accurately recognize gestures to ensure timely communication and reaction of computers to their users [3].
Gesture recognition has drawn attention from both academic and industrial community in recent years. Various signals have been exploited in prior works, such as vision [4], wearable sensor [5], RFID [6], and doppler radar [7]. As the primary sense of human beings, vision is also the most important signal of computers in gesture recognition [8]. In general, a gesture is captured as a dynamic movement on video, which is then decomposed into a a set of features taking individual frames into account for recognition [9]. Vision sensing is convenient since it allows non-invasive and distance gesture recognition, but it usually requires sufficient lightning or its recognition accuracy may be severely damaged. Also, the line-of-sight constraint may demand deploying multiple cameras working in a cooperative way to eliminate the blind zone, even for covering some small area. Another major option is the wearable sensors such as accelerometer (ACC) or electromyography (EMG) sensors, which have no lightening requirements while delivering an excellent performance on gesture recognition [5]. But wearable sensors, as their name suggests, need users to be in close proximity to ensure valid and reliable sensory data. The same issue also exists for the RFID-based techniques, which attach cheap and passive RFID tags to the cloth or body to gather movement readings [6]. The doppler radar based approaches are non-contact. By exploiting the Doppler Effect to detect movements, they can recover the corresponding gesture via machine learning [7]. However, the radar devices currently are the most costly both in deployment and use compared to other approaches. In summary, current gesture recognition methods using the above signals usually need specialized devices with nonnegligible deployment overheads, hindering their practicality and flexibility in daily life. Also, the privacy concerns are unlikely to be overlooked.
To solve these issues, WiFi sensing emerges recently as an efficient alternative for gesture recognition [10]. It can fully leverage the ubiquitous WiFi infrastructure to provide a low-cost, device-free and privacy-friendly solution. The theoretic underpinning is that human beings to the WiFi signal (either 2.4 GHz or 5 GHz) are just like mirrors to the light, i.e., a movement will bounce off the WiFi signal and cause multi-path distortions in channel response. By modeling or mapping such distortions to the corresponding movements, a gesture can be recovered.
WiGest [11] is among the first of WiFi sensing enabled gesture recognition, with a focus on hand gestures. By modeling each gesture to a manually-defined pattern in Received Signal Strength (RSS), it utilizes a matching method to achieve an average 87.5% accuracy using only one access point (AP). WiGest is quite inspiring. However, it has two major drawbacks, i.e., the coarse-grained RSS indicator and the poor scalability of gestures. WiFinger [12] tackles such demerits via employing the fine-grained Channel State Information (CSI) as the indicator and relying on machine learning to accommodate more gestures, respectively. WiFinger lays the foundation on learning-based WiFi gesture recognition for the following research [12], [13], [14]. However, the learning-based approaches face a critical challenge, i.e., dependence on domain-specific training like locations, orientations and environments.
Recently, WiDar3 [15] aims at this issue and proposes a new feature named Body-coordinate Velocity Profile (BVP) that describes power distribution over different velocities to achieve cross-domain gesture recognition. Similarly, WiHF [16] derives a domain-independent motion change pattern of arm gestures, rendering the unique characteristics and user performing style. Their work are very enlightening, but the handcrafted cross-domain features in [15], [16] are unlikely to fully recover the domain-independent clues of a gesture spread over the spatial-temporal dimension of CSI. In particular, a gesture recorded in CSI usually involves multiple pairs of transmitting and receiving antennas that are placed diffusedly to increase the spatial granularity. For each pair of antennas, the signal distortion caused by a gesture is likely distributed over multiple subcarriers (representing different central frequencies). Therefore, there exists a surging demand for an agile learning framework that can automatically and adaptively focus and extract such critical gestural cues scattered in spatial-temporal CSI profiles.
To this end, we propose WiGRUNT, a device-free and privacy-friendly gesture recognition system that dynamically concentrates on the domain-independent features via a dual-attention mechanism. The key idea of WiGRUNT, in a nutshell, is to mimic how a keen person intercepting a gesture regardless of the environment changes by exploring its sequential correlations in the time and space dimension. Firstly, for a human being, vision constitutes the most reliable signal for gesture recognition. Similarly, instead of processing the WiFi signal directly, we propose a novel CSI visualization method that leverages the CSI-ratio method [17] for effective denoising and then fuses all CSI subcarriers from spatially distributed pairs of antennas into normalized time-series images for leveraging cutting-edge techniques developed for vision sensing such as neural network backbones and pre-training. Secondly, a well-trained human being would subconsciously isolate and perceive a gesture from any environments via its the inherent characteristics. Likewise, we also design a dual-attention CSI network (DACN) that can automatically focus on sequential correlations of a gesture despite the domain variations. In particular, we embed a Deep Residual Network (ResNet) backbone, well-recognized for handling the correlation decaying issue, with a fine-tuned spatial-temporal attention module as a human being determining "where to look" and "when to look" for a gesture. DACN assigns a weight to each pixel denoting how much attention it deserves to form an attention map for each image indicating the distribution of domain-irrelevant clues of a gesture in spatial-temporal dimension. It then combines the images and the correspond-ing attention maps for gesture recognition.
We evaluate WiGRUNT on the open Widar3 dataset and show that WiGRUNT significantly outperforms its state-ofthe-art rivals by achieving the best-ever performance, i.e., 99.67% in-domain recognition accuracy and 96%, 92.6%, and 93.15% recognition accuracy across locations, orientations, and environments, respectively. Moreover, we share all the source codes on https://github.com/purpleleaves007/ WiGRUNT to facilitate further validation and optimization.
The main contributions of WiGRUNT are summarized as follows: • To the best of our knowledge, we are among the first to explore the attention mechanism for effective gesture recognition in WiFi sensing, which enables the best-ever in-domain or cross-domain recognition performance compared to the state-of-the-art rivals. • We propose a simple but effective CSI visualization method to fuse multiple CSI streams from spatially distributed devices into time-series images to provide a fine-grained representation of a gesture.

•
We design the dual-attention CSI network (DACN) based on a ResNet backbone to dynamically focus on the domain-independent informative clues of a gesture spread over the spatial-temporal dimension. The rest of this paper is orgainzed as follows: Section 2 reviews some representative prior works for gesture recognition using different signals, especially WiFi. Section 3 introduces the dual-attention CSI network rooted in a ResNet backbone, followed by experimental evaluation on the Widar3 dataset in Section 4. Finally, Section 5 summarizes the paper.

RELATED WORK
Half a million years ago, our hominin ancestors began to harvest large animals like mammoths through precisely coordinated hunting using vocal and gestural signals [22]. Since then, gestures have played an important role in our daily life for information and emotion exchange. While human beings enter the information era, a new communicate entity rises, i.e., computers. Primary input devices like mouses and keyboards are cumbersome and pose location-constraints for users. As wireless communication blossoms, pervasive computing [23] aims to relieve such limits and allows seamless access to the surrounding computers anytime anywhere, raising a new challenge: how to convey messages, commands or even emotions to the ambient computers? Again, the gestural language naturally becomes an attempting option. However, current gesture recognition solutions under the context of HCI are mainly based on signals like vision [4], [24], [25], wearable sensor [5], [26], [27], RFID [6], [28], [29], and doppler radar [7], [30], [31], where have incurred non-negligible deployment and maintenance overheads, limiting their practicality and scalability in real-world. Recently, WiFi sensing emerges as a low-cost, device-free and privacy-friendly alternative and draws much attention from the academic society. Table 1 compares and concludes some representative prior works related to WiFi-sensing enabled gesture sensing and recognition. It can be roughly divided into two categories: modeling-based and learning-based. The former normally relies on manual characterization between signal distortions and gestures, while the latter generally leverages machine learning for gesture recognition.
Modeling-based: WiGest [18] sets up a handcrafted pattern for each gesture in RSS and designs a similarity matching method for recognition. It is an inspiring work but the coarse-grained RSS indicator severely limits its accuracy. WiMU [13] pushes the research further by achieving multi-user gesture recognition using fine-grained CSI. It depends on the exhaustive search between the combinations of known gestures to the collected samples for gesture recognition. The idea is quite interesting, but the way of manually defining patterns naturally leads to the scalability issue. WiDraw [18] uses Angle-Of-Arrival (AOA) measurements for hand tracking, and allows users to draw in the air using bare hands with an average tracking error less than 5 cm. But the practicality issue remains unjustified since it needs over 25 WiFi transceivers surrounding the user. QGesture [19] employs phase information and achieves a similar performance using only two receiving antennas. But it needs to know the initial hand position before tracking.
Learning-based: It can be further divided into shallow learning [12], [20] and deep learning [14], [15], [16]. The former uses handcrafted features to train a shallow learner to classify gestures. Wikey [20] is among the first to realize keystroke recognition based on WiFi sensing. But it is very sensitive to environmental changes. WiFinger [12] uses WiFi CSI to realize the recognition of 9 sign languages, but the user is constrained in the middle of the LoS path between the transmitting and receiving antennas. In general, shallow learning only needs a small training dataset, but its performance is limited. Consequently, deep learning emerges as an effective alternative. For example, WiSign [14] aims at American sign Language recognition, and it uses amplitude and phase CSI profiles processed by a Deep Belief Network (DBN) for recognition. However, the deep learning-based approaches face a critical challenge, i.e., dependence on heavy domain-specific training.
WiDar3 [15] targets on this issue and proposes a domainindependent feature BVP that describes power distribution over different velocities to achieve cross-domain gestures recognition. WiDar3 is among the first to reveal and address such cross-domain issue in gesture recognition and its wellhoned dataset lays foundation for a fair comparison among different recognition frameworks. Based on Widar3 datset, WiHF [16] derives a domain-independent motion change pattern of arm gestures for obtaining the unique features for cross-domain recognition. Their work are quite enlight- ening, but the handcrafted cross-domain features can hardly cover the domain-independent clues of a gesture in spatialtemporal dimension. It drives us to design a new learning framework that can automatically extract and explore such critical gesture cues scattered in CSI profiles.

SYSTEM DESIGN
In this part, we first model the CSI-based gesture perception and then introduce the WiGRUNT in detail. Figure  1 provides an overview of the architecture of WiGRUNT, which consists of two modules, i.e, the CSI preprocessing and gesture recognition module. The first module denoises and visualizes raw CSI data as time-series images, while the second module leverages the DACN for gesture recogntion.
CSI Preprocessing Module: Upon receiving CSI measurements from spatially distributed WiFi transceiver pairs, our system first denoises the raw data using the CSI-ratio method [17], [32]. Then, it extracts the phase tensor from all subcarriers to form a two-dimensional matrix and visualizes it into time-series images to provide a fine-grained description of gestures in spatial-temporal dimension.
Gesture Recognition Module: The time-series images generated in the first module will be processed in our DACN for gesture recognition. DACN roots in a ResNet backbone mounted with a dual-attention module to exploit the sequential correlations of a gesture for cross-domain recognition.

Modeling the CSI-based Gesture Perception
WiFi CSI describes the signal's attenuation on its propagation paths, such as scattering, multi-path fading or shadowing, and power decay over distance. In the frequency domain, it can be characterized as [33]: where Y and X are the received and transmitted signal vectors, respectively. N is the additive white Gaussian noise, and H is the channel matrix representing CSI. CSI is a superposition of signals of all the propagation paths, and its Channel Frequency Response (CFR) can be represented as: where f and t represent center frequency and time stamp, respectively. m is the multi-path component. a m (f, t) and d m (t) denote the complex attenuation and propagation length of the mth multi-path component, respectively. Φ denotes the set of multi-path components, and λ is the signal wavelength.
In the case of CSI based gesture recognition, the multipath component m consists of dynamic and static paths [34]: where H s (f, t) and H d (f, t) denote the static and dynamic components, respectively. Φ s represents the set of static paths, e.g., reflected off the walls, furniture and static body parts, while Φ d denotes the set of dynamic paths, e.g., moving body parts. When to detect gestures, the movements of hands and arms will change the dynamic propagation distance d m d (t), and thus alter the phase shift e −j2π dm d (t) λ of Φ d , and finally affect H(f, t). All in all, the gesture can be portrayed by the change of phase shift in CSI.

CSI Preprocessing
As demonstrated in the previous section, the CSI phase profile can characterize gestures. Unfortunately, for commodity WiFi devices, as the transmitter and receiver are not synchronized, there exists a time-varying random phase offset e −jθ of f set : where A(f, t), e −j2π d(t) λ and d(t) denote the complex attenuation, phase shift and path length of dynamic components, respectively. This random offset thus prevents us from directly using the CSI phase information.
Therefore, we need to eliminate e −jθ of f set . Fortunately, for commodity WiFi card, this random offset remains the same across different antennas on the same WiFi Network Interface Card (NIC) as they share the same RF oscillator. Thus it can be eliminated by the CSI-ratio model [17], [32]: where H 1 (f, t) and H 2 (f, t) are the CSI of two receiving antennas. When two antennas are close to each other, d  The classification loss F * The processing of the backbone network La True label of the input data P A The matrix generated by attention process W f The parameter of function f CE * Generate the classification loss via cross-entropy operation ImP The input of attention-based recognition neural network Therefore, phase P extracted from H q can be used to describe gestures: where angle(·) denotes the phase extraction function. After removing the random phase offset, we get a 4dimensional tensor H ∈ C N ×M ×K×T as shown in Fig.  2a. Naturally, the key challenge here is to automatically and adaptively isolate the informative cues of a gesture scattered in such spatial-temporal dimension. As the main sense of our human beings, vision also constitutes a major signal for gesture recognition and draws significant efforts in academia. Hence, instead of processing this highdimensional data directly, WiGRUNT relies on the visualization method [34] that maps H ∈ C N ×M ×K×T into timeseries images ImP (heatmaps in Fig. 2b), which are then fed into our DACN to get the corresponding attention map (also in Fig. 2b) evaluating how much attention an information clue deserves for fine-grained gesture recognition:

WiFi Meets Attention: Realize Recognition Using WiFi With Attention Based Neural Network
Psychologists believe that a human-being well-trained in culture would subconsciously keep attention on a gesture during communication by exploring its inherent sequential correlation [35]. Likewise, WiGRUNT aims to mimic such phenomenon by designed a new learning framework that can automatically focus and exploit the inbuilt and domainindependent features of a gesture. Fig 3 shows a basic structure of an attention-based neural network. In our case, the time-series images ImP includes fine-grained descriptions of a gesture scattered in time and space (including frequency) dimension. For each image of ImP , its pixels should be evaluated for the importance to the recognition of the gesture, so that WiGRUNT can keep focusing on the essential cues of a gesture while suppressing the rest information, i.e., where A denotes an attention map. W f means the parameter of the function f and W f is initialized randomly. The dimensions of A and P are the same, and each pixel in A corresponds to the weight of the pixel in ImP . The larger the weight, the more important the pixel in ImP . Then, P A is processed by the backbone network F to get the classification result R: After obtaining the loss, the neural network calculates the partial derivative of the loss L with respect to the parameter W f of attention function f based on the backpropagation algorithm: Consequently, the network updates the parameter W f based on gradient descent: where α represents the learning rate. The process of calculating the loss L and adjusting the parameter W f continues to iterate until the model converges. In this way, the attention module adaptively learns how to generate an accurate attention map A, which assigns larger weights to the critical pixels in ImP , allowing the network to pay more attention to essential cues when recognizing gestures. Expanded from the above basic network, in this paper, we propose Dual-Attention CSI Network (DACN) for CSIbased cross-domain gesture recognition task, which utilizes two Temporal-Spatial Attention modules to generate attention maps for the input CSI phase map and the feature map extracted from the backbone network ResNet18, respectively. And we realize zero-effort cross-domain gesture recognition with the help of attention map, they are described in detail later.

Dual-Attention CSI Network based Gesture Recognition
Given an phase map ImP ∈ R C×H×W as input, where C, H, W are the number of channels, height and width of the map, respectively. In WiGRUNT they are set to 3 (R,G,B channel), 224 and 224, respectively. Our DACN sequentially infers a 2D temporal-spatial attention map A tsa ∈ R 1×H×W and a 1D temporal-spatial attention map A tsb ∈ R C1×1×1 TABLE 3: Definitions of symbols in defining the DACN (The Superscript * means that the item is a function or process.).

Name Definition Atsa
The attention map that focuses on pixels in ImP A tsb The attention map that marks the importance of different channel in feature map ImP The heat map of the phase matrix P ImP The output of the first attention processing, and the input of ResN et 18 ImP The output of ResN et 18, and the input of the second attention processing ImP The output of the second attention processing, and the input of recognition processing σ * The sigmoid function f * (7×7) The convolution operation with the filter size of 7 × 7 BN * The batch normalization processing M LP * Multi-layer perceptron AvgP ool * Average pooling M axP ool * Max pooling ResN et18 * The operate of ResN et 18 neural network Sof tmax * Classification process as illustrated in Figure. 4, where C1 is 512 if using ResNet18 [36] as backbone network to extract features. The overall process of our DACN can be summarized as: where ⊗ and ⊕ denotes element-wise multiplication and summation, respectively. Figure 4 depicts the computation process of each attention map. The following describes the details of each attention module.

Temporal-Spatial Attention Module A
We generate temporal-spatial attention map A tsa by utilizing the inter-spatial relationship of inputs [37]. The A tsa directly focuses on 'where' is an informative part, to mark the importance of different pixels in the ImP . DACN uses a standard convolution layer that convolves the input ImP to produce our 2D temporal-spatial attention map. Through the convolution operation, we can calculate the local interspatial relationship of inputs. The temporal-spatial attention A tsa is computed as: where σ denotes the sigmoid function, f 7×7 represents a convolution operation with the filter size of 7 × 7, and BN indicates the batch normalization operation. Figure 5a shows the CSI phase waveform after noise reduction. In the above, we have explained that gestures could cause the CSI phase to change. And the more significant the phase waveform changes, the more information it contains. From Figure 5a, we can easily locate more critical periods in time dimension for gesture recognition, and we mark them with black boxes. In the input map Figure 5b, we also use black boxes to mark the same periods. The temporalspatial attention map A tsa is shown in Figure 5c, and the redder the color means the larger the weight. It can be seen that the critical pixels in the black box recording a gesture from different antenna pairs in different time periods in Figure 5b turn hotter in Figure 5c, which demonstrates the effectiveness of our temporal-spatial attention module A.
In addition to finding important antenna pairs and periods, our attention module can also mark essential subcarriers. As shown in Figure 5c, for the second Transmit-Receive (TR) pair, the importance of the subcarrier with a smaller index is not as important as that of the subcarrier with a larger index between 400-600ms. As shown in Figure 6, we show the phase waveforms of the second and twenty-fifth subcarriers of the second TR pair. It can be seen that in the 25th subcarrier, the waveform fluctuates significantly during 400-600 ms, while in the second subcarrier, the fluctuation is not obvious, meaning that this subcarrier is not sensitive to gestures and thus deserves less attention.
After obtaining the attention map, we multiply the attention map with the input and then add the original input map as shown in Figure 5d. We add the original input map after the multiplication because the multiplication will completely ignore some pixels, and cause subsequent training on the ResNet18 (18 layers) backbone unstable.

Temporal-Spatial Attention Module B
In this module, we produce a temporal-spatial attention map by exploiting the inter-channel relationship of features [38]. The feature map output from the backbone network ResNet18 has 512 channels, and the dimension of each channel matrix is 7 × 7. As each channel of a feature map is considered as a feature detector [39], different channels pay different attention to different parts of the input. The A tsb indirectly focuses on 'where' is an informative part, that is, to mark the importance of the different channels in the feature map.
To compute the channel attention efficiently, we first aggregate spatial information of a feature map using averagepooling and max-pooling operations and generate two different spatial context descriptors. Both descriptors are then forwarded to a shared multi-layer perceptron (MLP) to produce our 1D attention map A tsb . Finally, we use elementwise summation to merge the output features and generate our attention map through the sigmoid operation. The 1D temporal-spatial attention is computed as: A tsb (ImP ) =σ(M LP (AvgP ool(ImP )) + M LP (M axP ool(ImP ))) where σ denote the sigmoid function, AvgP ool and M axP ool denotes average pooling and max pooling, respectively. Figure 7 shows the channel 1, 3 and 212 in feature map before and after weighted by the temporal-spatial attention map B. The information concerned by channel 3 has a higher  degree of coincidence with the black boxes in Figure 5, the degree of coincidence of channel 1 is low, and channel 212 focuses on areas that are completely unimportant. Therefore, our attention module assigns the largest weight to channel 3, a smaller weight to channel 1 and near zero weight to channel 212. The weighted feature map will go through an average pooling layer and a softmax layer to obtain the final recognition result.

IMPLEMENTATION AND EVALUATION
Dataset: The public dataset WiDar3 [15] is constructed for fair comparisons among different learning-frameworks. Therefore, as in the state-of-the-art pior works like WiHF [16], we also rely on Widar3 to evaluate WiGRUNT. WiDar3 contains 15375 samples collected from 3 environments, and its detailed description is shown 4. In our evaluation, the division of the dataset keeps the same as [16]. In section 4.1 and 4.2, we use 4500 samples (6 users × 5 positions × 5 orientations × 6 gestures × 5 instances, position means the position of the subject, orientation means the direction the subject faces.) to evaluate the in-domain, cross-location, and orientation performance of WiGRUNT same as other researches. To verify the performance of the system with more users and more gestures, in section 4.5, we also evaluate the in-domain, cross-location, and orientation performance of WiGRUNT with all the data from 1st environment (10125 samples, 9 users × 5 positions × 5 orientations × 9 gestures × 5 instances).
For in-domain, cross-location, and orientation evaluation, 80% of the data are used as the training set, 20% as the test set, and we perform 5-fold cross-validation. For crosslocation evaluation, we choose one position for testing and the remaining four for training each time. In-domain and cross-orientation evaluations are similar to cross-location evaluations. For cross-environments evaluation, we use the data from all 3 environments, which contains 12000 samples (16 users × 5 positions × 5 orientations × 6 gestures × 5 instances). We use data from two environments for training and the other one environment for testing, and perform 3fold cross-validation.
The cross-domain performance is shown in Table 5. The first interesting observation is that WiGRUNT yields a stable performance across different locations, e.g., the recognition accuracy ranges from 93% to 98.89% with a standard deviation 0.01913. However, the recognition accuracy of different orientations is quite diversified. For instance, WiGRUNT only obtians 89.33% accuracy for orientation 1, while reaches 96% for orientation 4. The same phenomenon also occurs in WiHF and WiDar3 (70.17% in WiHF and close to 78% in WiDar3 for orientation 1). And the reason is that in the case of orientation 1, gestures might be shadowed by the human body. Nevertheless, the performance degradation cross the best and worst orientations of WiGRUNT (6.89%) is much lower than WiHF and WiDar3 (19.89% in WiHF and over 10% in WiDar3, respectively), demonstrating the superiority of WiGRUNT to its state-of-art rivals.

Comparative Study
Compared to Widar3 and WiHF, WiGRUNT does not need to extract handcrafted features. Thus the preprocessing step is simplified. We show the specific processing flows of these three approaches in Table 6. WiDar3 first denoises the CSI data, then performs time-frequency analysis and motion tracking on the noise-reduced data to generate BVP feature, and finally uses a neural network to recognize gesture with BVP. WiHF is similar to WiDar3. But the extracted feature is the motion change pattern instead of BVP. Considering that the handcrafted feature may lose gestural information, WiGRUNT extracts features adaptively based on the at-  tention mechanism, omiting cumbersome feature extraction steps while simplifying the system implementation.
As shown in Figure 8 and Table 7, WiGRUNT significantly outperforms its state-of-art rivals in both indomain and cross-domain scenarios, even for WiHF with the HuFu dataset (tailored from the WiDar3 dataset). In the case of using the same dataset (WiHF with HuFuM), WiGRUNT is superior to the current best solutions by 5.75%, 5.68%, 10%, and 0.75% respectively in terms of in-domain, cross-location, orientation, and environment evaluation. We think the reason is that though the handcrafted features are quite ingenious they may be not able to cover all the domain-indepdent sequential correlations scattered in spatial-temporal dimension. For instance, the BVP feature is subtly designed but it somehow ignores which clues are important and which clues are not. WiHF has solved this issue by directly evaluating these clues, but the features provided by WiHF only focus on the period of the motion change and filter out the information in the remaining period, leading to certain information loss.

Impact of Different Attention Modules
In this part, we evaluate the performance of WiGRUNT with the combination of different attention modules to evaluate the effectiveness of the two attention modules. The results are shown in Table 8. One interesting observation is that the attention mechanism only provides negligible enhancement for the in-domain scenario, i.e, 0.2% improvement. But for the cross-domain scenarios, it proves to be quite helpful. Moreover, the performance of ResNet with temporal-spatial attention module A and module B is better than the basic network ResNet 18. The temporal-spatial attention module A directly focuses on essential clues that are conducive to gesture recognition in the temporal and spatial dimension via assign weights to different pixels in the input image. The temporal-spatial attention module B indirectly focuses on important information in the temporal and spatial dimension by assigning weights to each channel in the feature map. Furthermore, using two temporal-spatial attention modules simultaneously achieves better performance compared to single module. The experimental results show that both attention modules in WiGRUNT can provide important clues, and these clues can be complementary.

Impact of Pre-training With ImageNet
In this part, we design experiments to show that pretraining the network using image datasets in the CV field is still effective in our Table 9 illustrates the performance of WiGRUNT while leveraging methods, i.e., with and without pre-training using the super-large-scale pre-training data set ImageNet, respectively. The experimental results are shown in Table 9. The pre-trained model's performance using Ima-geNet is higher than the model without pre-training, and it is even close to 10% higher in cross-environment gesture recognition. As shown in Table 3, in the case of crossenvironment gesture recognition, the number of samples in 2nd and 3rd environments is small, and pre-training can significantly alleviate the model's inability to learn parameters due to the small training dataset.
Pre-training is commonly used in CV due to the existence of many well-constructed and well-labeled image datasets. On the one hand, pre-training acts like an initial parameter setter and plays a vital role in subsequent supervised training. On the other hand, it relieves the over-fitting problem on a small dataset. ImageNet is the world's largest labeled image dataset containing 22,000 categories and 15 million images, and it is a dominant paradigm to initialize the backbones of object detection and segmentation models [40]. It is also an major gain of our CSI visualization method.

Impact of No. of Gestures and Users
To verify the performance of WiGRUNT in the case of increased number of users/gestures, we evaluate our system's performance with more users and more gestures (the default number is 6). The evaluation results are shown in Table 10, the in-domain accuracy remains above 98% though the number of gestures or users increases to 9. And with the increase in the number of gestures and users, the performance of WiGRUNT has not been significantly affected, and among them, the increase in the number of users has almost no impact on system performance. In some cases, the  increase in the number of users/gestures can increase the accuracy. This may be because newly added gestures/users are easier to recognize. We compare the effect of the increase in the number of gestures on the system performance with WiHF [16], and the results are shown in Figure 9, where we can see that WiGRUNT is less affected by the increase in the number of gestures.

CONCLUSION
A gesture is defined as "a movement of part of the body, especially a hand or the head, to express an idea or meaning". While our hominid ancestors have used gestures to communicate with their fellows with or without vocal expression, we in the informatics era also rely on gestures to deliver messages to the computers around us. This paper proposes WiGRUNT, a device-free and noncontact solution leveraging the ubiquitous WiFi infrastructure. It roots in a attention-based ResNet backbone to dynamically focus on informative clues of a gesture spread over spatial-temporal dimension and explore their inherent sequential correlations for cross-domain gesture recognition. WiGRUNT has been evaluated on the open Widar3 dataset and achieves the bestever performance in-domain or cross-domain compared to its state-of-the-art rivals.
Xiang Zhang was born in Anhui, China, in 1996. He received the B.E. degree from the Hefei University of Technology, where he is currently pursuing the Ph.D. degree. His research interests include intelligent information processing, and wireless sensing and affective computing.
Yantong Wang received the B.E degree from Shanghai Normal University in 2016. She received master degree from Hefei University of Technology, where she is currently working toward the Ph.D. degree. Her research interest includes affective computing and sensorless sensing.
Meng Wang received the bachelor's degree from Hefei University of Technology, where she is currently pursuing the master's degree. Her current research interests include intelligent information processing, wireless sensing and machine learning.
Zhi Liu (S11-M14-SM19) received the B.E., from the University of Science and Technology of China, China and Ph.D. degree in informatics in National Institute of Informatics. He is currently an Associate Professor at The University of Electro-Communications. His research interest includes video network transmission, vehicular networks and mobile edge computing. He is now an editorial board member of Springer wireless networks and has been a Guest Editor of Mobile Networks & Applications, Springer Wireless networks and IEICE Transactions on Information and Systems. He is a senior member of IEEE.