Wearable Motion Capture: Reconstructing and Predicting 3D Human Poses From Wearable Sensors

Reconstructing and predicting 3D human walking poses in unconstrained measurement environments have the potential to use for health monitoring systems for people with movement disabilities by assessing progression after treatments and providing information for assistive device controls. The latest pose estimation algorithms utilize motion capture systems, which capture data from IMU sensors and third-person view cameras. However, third-person views are not always possible for outpatients alone. Thus, we propose the wearable motion capture problem of reconstructing and predicting 3D human poses from the wearable IMU sensors and wearable cameras, which aids clinicians' diagnoses on patients out of clinics. To solve this problem, we introduce a novel Attention-Oriented Recurrent Neural Network (AttRNet) that contains a sensor-wise attention-oriented recurrent encoder, a reconstruction module, and a dynamic temporal attention-oriented recurrent decoder, to reconstruct the 3D human pose over time and predict the 3D human poses at the following time steps. To evaluate our approach, we collected a new WearableMotionCapture dataset using wearable IMUs and wearable video cameras, along with the musculoskeletal joint angle ground truth. The proposed AttRNet shows high accuracy on the new lower-limb WearableMotionCapture dataset, and it also outperforms the state-of-the-art methods on two public full-body pose datasets: DIP-IMU and TotalCaputre.


I. INTRODUCTION
P EOPLE with movement disorders face multiple disadvan- tages while walking, such as increased strains on the lower back, increased metabolic cost, and gait asymmetry.Appropriately monitoring the progression of walking can mitigate these disadvantages and prevent secondary issues such as joint arthritis, risk of falls, and vascular diseases by having timely follow-up treatments from frequent assessment.Current monitoring procedures are only available at the clinical site.However, due to the absence of feasible technologies, it is extremely challenging to monitor the progress of the treatments after outpatient discharge.Thus, there is a need to assess the walking poses outside the clinic, which will not only significantly save medical expenditure by preventing unnecessary visits, but also enable patients to have the appropriate treatments without delay between regular visits.
Challenges: Most commonly, the motion capture systems [1] are used to achieve a highly accurate understanding of the human pose, but the numerous wearable markers and extra setup of motion capture cameras in the laboratory make this approach infeasible in an unconstrained daily environment.
Several works [2], [3], [4], [5], [6], [7] focused on the reconstruction of human poses from third-person view RGB or RGB-D cameras.However, these methods based on third-person views are not always possible for outpatients alone.Thus, there is a need to reconstruct and predict human poses from wearable sensors only, so discharged patients can walk freely in their daily lives.
Some works relied on numerous IMU sensors (e.g., 17 or more) to obtain an accurate human pose reconstruction [8], but wearing many sensors is very uncomfortable and impractical to use in daily living.Recently, several works [9], [10], [11], [12] used a reduced set of IMU sesnors for human pose reconstruction.However, motion capture from sparse inertial sensors is inherently ambiguous and challenging.Research question: The above challenges lead to a research question: When discharged patients walk in their daily lives, how to design a feasible and effective approach to accurately sense their poses with a small set of wearable sensors, so clinicians can access their patients' walking functions outside clinics and researchers can design intelligent prosthetic devices to assist outpatients with real-time optimal control?Our contributions: As shown in Fig. 1, we propose to handle the wearable motion capture problem with two tasks: (1) reconstructing 3D human pose outdoors over time for clinical diagnosis; and (2) predicting 3D human poses at the following time steps for real-time assistive device controls.
Currently, there is no public dataset that contains both wearable IMU and wearable camera data, along with its ground truth of 3D walking poses.To develop the wearable motion capture algorithm, we collected a dataset under varying walking conditions (e.g., on the treadmill, on the ground, slope, stairs) from 10 subjects with different walking speeds.Though our collaborative research projects aim at people with lower limb amputations, our wearable motion capture can be applied to reconstruct and predict upper limb and full-body poses too.Hence, we also compare the performance of our wearable motion capture method on two related full-body pose datasets: DIP-IMU [13] (contains IMU-only data for 10 subjects) and TotalCapture [4] (contains IMU data and videos from third-person view cameras on 5 subjects).
Our main contributions have four folds: r We propose to handle the wearable motion capture prob- lem of reconstructing and predicting 3D human poses from the wearable IMU sensors and wearable cameras, which aids clinicians' diagnoses on people with movement disabilities.Prior works that require vision data from third-person views for pose reconstruction are not always possible for outpatients alone, thus we propose to reconstruct and predict walking poses via on-body camera and IMU sensors.
r We propose a novel Attention-Oriented Recurrent Neural Network (AttRNet) that contains a sensor-wise attentionoriented recurrent encoder, a reconstruction module, and a dynamic temporal attention-oriented recurrent decoder, to reconstruct the 3D human pose over time and predict the 3D human poses at the following few time steps.
r We introduce a new dataset containing data from wearable IMU and wearable camera sensors with the 3D human pose ground truth.To our best knowledge, no prior work was done on 3D human pose reconstruction or prediction from the fusion of both wearable IMU and wearable camera sensors.This dataset will be available on1 : r Our approach is able to generalize to both multi-modal and single-modal sensor input, and it can be applied to both lower limb pose and full-body pose analysis.Our proposed approach outperforms the state-of-the-art methods on two full-body pose datasets [4], [13].

IMU-based human motion capture:
The wearable IMUs (e.g., Xsens [15]) show remarkable stability and accuracy in capturing human motion [16], [17], [18], [19], [20].Previously, Roetenberg et al. [8] introduced a motion tracking algorithm using IMU sesnors.Recently, Huang et al. [13] proposed a Recurrent Neural Network (RNN) based algorithm to reconstruct human poses from sparse inertial measurements, and also introduced an IMUbased human motion capture dataset.More recently, Nagaraj et al. [19] introduced an RNN-ensemble approach for human pose estimation from IMU sensors.Most of these IMU-sensor-based algorithms employ straightforward recurrent neural networks to reconstruct human poses.However, since different sensors on the human body have different capabilities to capture different joint movements at a specific time and the feature at the past time step is highly related to a certain future time step, straightforward recurrent neural networks might not be sufficient to compute the discriminative features and predict future poses from the observed input sequences.Differently, we introduce a novel Attention-Oriented Recurrent Neural Network (AttRNet), which contains a sensor-wise attention-oriented recurrent encoder, a reconstruction module, and a dynamic temporal attentionoriented recurrent decoder, to reconstruct and predict 3D human poses by encoding the highly discriminative features from the observed input sequences.
Despite the success of 2D/3D pose estimation, 3D pose prediction is yet under-explored.Previously, some traditional methods such as the hidden Markov model [38] and the Gaussian model [39] were developed.Recently, some deep-learningbased algorithms introduced recurrent networks [40], [41], [42], [43], [44] and feed-forward networks [45], [46], [47] to predict future 3D poses.However, most of these vision-based pose analysis tasks rely on video data from third-person view cameras, which are not always possible for outpatients alone.Thus, there is a need to reconstruct and predict human poses from wearable sensors.
Most works with egocentric vision focused on objects and activities in front of cameras such as the detection of objects [48], [49], gaze [50], visible hands and arms [51].Differently, we are interested in the movement information of the wearable cameras for reconstructing and predicting 3D walking poses of the camera-carrying subject.
Hybrid approaches for human motion capture: The hybrid approach mainly fuses the IMU and the vision modalities to learn richer features for human motion capture.Previously, Malleson et al. [3] proposed a real-time optimization approach to fuse multi-view data and IMU data to perform real-time motion capture.Recently, Trumble et al. [4] introduced an algorithm for fusing multi-view videos with IMU sensor data to estimate 3D human poses.Marcard et al. [5] proposed a graph-based optimization approach that jointly optimizes vision and IMU data on a SMPL model.More recently, DeepFuse [2] introduced an IMU-aware network for real-time 3D human pose estimation from multi-view images.Most of these hybrid approaches require vision data from third-person view cameras.However, the third-person views for pose reconstruction are not always possible for patients alone outdoors.Differently, we introduce our AttRNet to reconstruct the 3D pose over time and predict the 3D poses at the following time steps from both wearable IMUs and wearable cameras.
Note that different from the existing motion capture problem, in our wearable motion capture, both the IMU and camera sensors are worn on the human body, as shown in Fig. 1 and summarized in Table I.

A. Problem Statement
Suppose that the observed IMU data are , where M and N are the number of IMU and camera sensors, respectively, Δ 1 represents the temporal interval in which we can go back to the past from the current time step T , and D IMU and D V ideo are the feature dimensions extracted from each IMU and camera sensor at each time step, respectively.The goal of our proposed approach is to reconstruct the 3D human pose over time and predict the 3D human poses at the following time steps given the observed IMU and video data.Mathematically, we aim to obtain the following reconstruction and predictions functions: where Joints denotes the reconstructed pose at the current time T ( PT ) and the reconstructed poses of the past two time steps ( PT −2 and PT −1 ) for pose dynamics calculation.
are the predicted future poses, J is the number of joints in the pose model, D Joints is the coordinate dimension of each joint, and Δ 2 is the temporal interval in which we aim to predict the future poses.

B. Method Overview
We propose a novel Attention-Oriented Recurrent Neural Network (AttRNet) to jointly reconstruct the 3D pose over time and predict the 3D poses at the following time steps in an online setting from both wearable IMU sensors and wearable cameras, as shown in Fig. 2. In our AttRNet, we introduce an attention-oriented recurrent encoder-decoder and a reconstruction module.Our attention-oriented recurrent encoder performs sensor-wise attention at each time step and embeds features over different time steps of the observed input sequences.On the other hand, our attention-oriented recurrent decoder outputs a series of future poses by dynamically computing the relevant information from the encoded observed features using a dynamic temporal attention module.In our AttRNet, we also design a reconstruction module that reconstructs the current pose to support the initial pose prediction of the decoder.In the following sections, we discuss the IMU and video features, our proposed attention-oriented recurrent encoder, reconstruction module, and attention-oriented recurrent decoder in details.

C. IMU and Video Features
IMU features: An IMU returns 3-channel acceleration and 3-channel angular velocity from the accelerometer and the gyroscope, respectively.We calculate the orientation as a 3 × 3 rotation matrix from raw IMU data, which is then flattened Video features: For the videos from wearable cameras on legs, we compute the histogram of optical flow features, where the optical flow vectors are quantified into different orientation bins and the magnitude of a bin is computed from the aggregation of the magnitudes of the flow vectors inside that bin.Formally, at time t, the histogram of optical flow feature of the n-th camera sensor is V n t ∈ R 1×D V ideo , where D V ideo is the number of bins.

D. Attention-Oriented Recurrent Encoder
Our attention-oriented recurrent encoder contains sensor-wise attention modules to compute sensor-wise attention scores at each time step and recurrent encoders to embed attentionweighted features over different time steps of the observed input sequences.
Sensor-wise attention module: Since different sensors on the human body (e.g., Fig. 1(a)) have different capabilities to capture different joint movements at a specific time, we are motivated to compute sensor-wise attentions for both IMU and video features, i.e., we introduce Sensor-wise Attention Modules (SAM) to learn attention scores for different sensors and update the sensor-wise features with those attention scores.Formally, for a specific time t, the SAM loads the IMU features from M IMUs I t = [I 1 t ; . ..;I M t ] ∈ R M ×D IMU to compute the attention score vector, u I t ∈ [0, 1] M ×1 .The SAM consists of two fully-connected layers and a ReLU layer located between them.The second fully-connected layer outputs attention score vector, which are then passed through a sigmoid function that enforces the attention scores to be between 0 and 1.The attentionweighted IMU feature at time t, I a t , is: where ⊗ represents the element-wise multiplication, i.e., each column of the I t matrix is multiplied with the column vector u I t element-wisely.Similarly, for the video features from N wearable cameras at time t, we compute the attention weighted video features features V a t : where V t is the video feature from N camera sensors and u V t is the corresponding attention score vector.
Recurrent encoders: The recurrent encoders separately encode the attention-weighted IMU and video features of different time steps and then fuse them into a single vector which will be used to reconstruct the pose over time and decode poses at the following time steps.Given the attention-weighted IMU features

the recurrent encoder passes these features through a Bidirectional Gated
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.], which will be fed into the reconstruction module and the decoder.

E. Reconstruction Module
We design a reconstruction module to reconstruct the current pose and the poses in the past two time steps to compute the pose dynamics.Formally, the reconstruction module loads the encoded feature vector h e T and predicts P[T −2,T ] ∈ R 3×J×D Joints .The reconstruction module consists of two fully-connected layers and a ReLU layer between them.After the final fully connected layer, the output layer is reshaped to P[T −2,T ] , which represents the poses of the last three time steps of the observed sequences.Since this module reconstructs the current pose and the poses in the past two-time steps, we call this module as a reconstruction module.
Note, although both the sensor-wise attention module and the reconstruction module consist of two fully-connected layers and a ReLU layer between them, the detailed designs of these two modules are different since they do not share parameters and their goals are different.

F. Attention-Oriented Recurrent Decoder
Our attention-oriented recurrent decoder outputs a series of future poses by dynamically computing the relevant information from the encoded features of different time steps using the dynamic temporal attention module.
Recurrent decoder: The recurrent decoder aims to predict the future 3D poses P[T +1,T +Δ 2 ] .More specifically, from the time step T + 1 to T + Δ 2 , our recurrent decoder predicts the future poses over different time steps, i.e., at time step T + 1, our recurrent decoder predicts the pose PT +1 ; at time step T + 2, our recurrent decoder predicts the pose PT +2 ; and so on.The recurrent decoder consists of GRU and the hidden state at each step is updated using the GRU update rules: where the input z τ −1 is computed from the output of the dynamic temporal attention modules, the previous predicted pose, and the pose dynamics of the previous predicted poses (the details are explained later shortly).The hidden states h d τ and h d τ −1 are the current and previous hidden states of the decoder, respectively.During the first step of the decoder, the output of the recurrent encoder h e T is used as the previous hidden state.Given the hidden state h d τ , the pose P τ is predicted using a fully-connected layer, as follows: where w P τ is a trainable parameter.In the following, we discuss how we compute z τ −1 from the D-TAM and pose dynamics in details.

Dynamic temporal attention module (D-TAM):
The recurrent encoder encodes the observed input sequences and outputs a global representation as a single vector (e.g., h IMU T as the global feature representation of all observed IMU data).However, the global representation of the encoder might not be sufficient to predict future poses, due to the variations of human poses.Since humans walk with repetitive patterns, as shown in Fig. 3, the encoded feature at a past time step is highly related to a certain future time step.Thus, we configure the recurrent decoder with a Dynamic Temporal Attention Module (D-TAM) that dynamically computes the relevant information from the encoded features to predict future poses at different time steps.We employ D-TAM for both encoded IMU and video features, as described below.
Given the hidden state of the decoder h d τ −1 at time τ − 1 and the encoded IMU features of all the observed time steps ], first we generate the query (q IMU τ −1 ), the keys (K IMU τ −1 ) and the values (X IMU τ −1 ) for the dynamic temporal attention on the encoded IMU features: where where f is the feature dimension.
are the trainable parameters.Then, we compute the output using a weighted sum of the values: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II EXISTING 3D HUMAN POSE DATASET VS OUR WEARABLEMOTIONCAPTURE DATASET
where O IMU τ −1 ∈ R 1×f is the output of the D-TAM computed from the encoded IMU features and the current hidden state of the decoder.
Similarly, we apply D-TAM on video features and compute the output using a weighted sum of the values, as follows: (11) where is the output of the D-TAM computed from the encoded video features and the current hidden state of the decoder.
Pose dynamics: Since the first-order and second-order pose motions such as velocity and acceleration carry important motion dynamics, we use them in addition to the pose at time τ − 1 to predict the pose at time τ .For the pose dynamics at time τ − 1, we compute the velocity as V τ −1 = ( Pτ−2 − Pτ−1 ) and the acceleration as A τ −1 = ( Pτ−1 − 2 Pτ−2 + P τ −3 ).Finally, we concatenate the output of each D-TAM, the pose, the velocity, and the acceleration to generate the input for the decoder (i.e., z τ −1 in ( 5)), as follows:

G. Loss
The proposed AttRNet is trained with its hidden state output at each step supervised.The loss function in the proposed AttRNet is composed of two terms: where L Reco and L P red are the reconstruction and prediction loss, respectively.

A. Dataset
We conduct experiments on the wearable IMU+camera dataset on lower limbs which was collected by us: Wearable-MotionCapture, and two public full-body pose datasets: DIP-IMU [13] and TotalCapture [4].The datasets are summarized in Table II.
WearableMotionCapture: We recorded data from 10 subjects (6 male, 4 female, age: 23.9±2.91 years, height: 1.65±0.06m, All participants provided informed written consent before participating in the experiment.The Institutional Review Board (IRB) of the University of Central Florida (UCF) approved the study's protocol (IRB ID: STUDY00002011).The IMU data were recorded with a sampling frequency of 148 Hz, while the video data were collected at 30 fps.We down-sample the IMU data to have the same fps with the video data.Each subject was instructed to walk on several scenarios (Fig. 4).Each participant walked on both treadmill and ground with four different speeds (slow, normal, fast, and very fast).For each participant, we also recorded two trials for walking on stairs, and two trails for walking on slope.Furthermore, we collected two trials for each participant, where the participant walked on a round path and walked in a random path while avoiding two obstacles placed on the walking path.Overall, we collected 14 walking scenarios for each subject.Thirty-two reflective markers were placed on the participant based on a modified Helen-Hayes marker set [52], for ground-truth collection.Three-dimensional marker trajectories were captured by motion capture cameras.We obtain the ground truth of joint angles using OpenSim [53], an open source musculoskeletal analysis tool, on the marker tracking data from motion capture cameras.We use the musculoskeletal model to represent the 3D human walking pose (Fig. 1(c)).Musculoskeletal model is a skeleton model consisting of bones that are connected by joints.In total, we collected 140 pose sequences from 10 subjects, which are 327 minutes of IMU data along with 588 K video frames.[13]: The DIP-IMU dataset consists of 10 subjects (9 male, 1 female), each performing motions in five different categories, including controlled motion of the experiments (arms, legs), locomotion, natural full-body activities (e.g., jumping jacks, boxing), and interaction tasks with everyday objects.These motions are recorded from 17 IMU sensors.We follow the train-test splits provided by the dataset to evaluate our method.

DIP-IMU
TotalCapture [4]: The TotalCapture dataset consists of 5 subjects (4 male and 1 female), each performing several activities such as walking, acting, range of motions and freestyle motions, which are recorded using 13 wearable IMU sesors and 8 third-person view RGB-cameras.Since we aim to reconstruct and predict human poses from wearable sensors, we only use the wearable IMU sensor data for comparisons in our experiment.We follow the train-test splits provided by the dataset to evaluate our method.

B. Implementation Details
Our recurrent encoders are constructed using the Bidirectional GRU (B-GRU).The hidden state's dimension of each B-GRU is set to 256.The decoder is configured with Dynamic Temporal Attention Modules (D-TAM) and GRUs with hidden state of dimension 512.The channel numbers between two fully-connected layers are set as 64 and 1024 for the sensorwise attention and reconstruction modules, respectively.We use PyTorch to implement our proposed pose reconstruction and prediction model, and it takes about 60 minutes to train our network on the WearableMotionCapture dataset on a single Tesla V100 GPU.

C. Evaluation on WearableMotionCapture
Experimental setup: We consider two evaluation mechanisms: r Half and half evaluation: We first randomly shuffle the samples of each subject, and then consider one half of the dataset to train the model and the other half is kept for the testing.
r Leave-one-out evaluation: We use the samples from 9 out of 10 subjects for training, and the samples of the left one subject are reserved for testing.We repeat this process 10 times for 10 different testing subjects and report the average.Evaluation metric: We employ the Mean Absolute Error (MAE) as the evaluation metric for the pose reconstruction at the current time T as: where E reco is the MAE of the reconstructed pose, Pj T,l is the predicted angle of joint j at time T from the l-th testing sample and P j T,l is the ground truth.L test is the number of testing samples generated from all pose sequences of the testing data by sliding temporal windows with stride 1.For example, for half and half evaluation (i.e., the test set has 294 K time steps from 70 pose sequences), we generate around L test = 294 K − 70 × (Δ 1 + Δ 2 ) testing samples (i.e., each sample is defined as observing Δ 1 time steps to predict the following Δ 2 time steps).Similarly, we employ the MAE as the evaluation metric for the future pose prediction of Δ 2 time steps as: Reconstruction and prediction performances: As summarized on the first row of Table III, for the pose reconstruction, our AttRNet achieves MAE of 4.65 • for half and half evaluation, and 7.49 • for leave-one-out evaluation.On the other hand, for the future pose prediction, our AttRNet achieves MAE of 4.73 • and 7.58 • for half and half evaluation and leave-one-out evaluation, respectively, slightly higher than the reconstruction error.
From Table I, we observe that if a testing subject provides some calibration dataset, the pose sensing algorithm can perform better, as shown by the half-half experiment where half of the data from all subjects are used for training and the rest half is for testing.On the other hand, if a testing subject exhibits new pose patterns beyond the current training dataset, the pose sensing algorithm will have a slightly larger error, as shown by the leaveone-out experiment where a testing subject does not provide any calibration dataset, and the training is performed on other subjects.Since humans have variations on walking patterns, developing a one-size-fits-all algorithm to sense human poses could be challenging in real-world applications.To remedy this, in the future, we plan to investigate the customized sensing algorithms for individuals, which includes the investigation of transfer-learning algorithms to adapt a pose sensing algorithm trained on our dataset to individuals using a small calibration set of these individuals captured in labs.
Effect of our method on subjects with different body shapes: Although the kinematics of people tend to differ a lot, there is no direct relation with the body shape.For example, two people with the same height and weight will not walk with the same strategy.The main reason, which will cause the method to perform differently is the different walking patterns of different subjects.As a result, we see that our model performs differently when validating with the leave-one-out experiment compared to the half-and-half experiment.As the testing dataset is absent during the training, model does not have any prior knowledge of the walking pattern for the specific test subject.This causes performance degradation compared to half-and-half evaluation.
Ablation studies on different modules and modalities: To systematically evaluate our method and study the contribution of each algorithm component, we perform a number of ablation experiments: (i) our AttRNet without Sensor-wise Attention Module (SAM); (ii) our AttRNet without SAM or Dynamic Temporal Attention Module (D-TAM); (iii) our AttRNet without SAM, D-TAM or pose dynamics; (iv) and (v) our AttRNet with SAM, D-TAM and Pose Dynamics on IMU-only and Video-only, respectively.As shown in Table III, we can see that each algorithm component is contributing to our AttRNet to improve the performance for both half and half and leave-one-out Ablation studies on different walking scenarios: The ablation studies on different walking scenarios are shown in Table IV.Since the walking patterns on the treadmill are repetitive, the pose reconstruction and prediction for walking on the treadmill are relatively easier and the performances are better compared to other scenarios.The walking motions are continuously changed when the subjects try to avoid obstacles on the ground, making the pose reconstruction and prediction more difficult.For example, if the obstacle is close, people reduce the step length and the walking path is going to be more acute.On the other hand, if the obstacle is far and enough time to deviate the obstacle, the path is going to be greater and step lengths may be similar before initiating the dodging the obstacle.However, even though the walking patterns are continuously changed in most of the walking scenarios, our AttRNet still reconstructs and predicts the poses well.Overall scenarios in Table IV, our maximal relative reconstruction and prediction errors are 9.16/360 = 2.54% and 9.28/360 = 2.58%, respectively.
Per joint evaluation: The pose reconstruction and prediction errors regarding different joints are as shown in Table V.The reconstruction and prediction performances for the joint 'Pelvis list' are the best compared to other joints, while we get the lowest performances for the joint 'Pelvis rotation'.In our dataset, the subjects continuously try to rotate or change the walking direction.Therefore, reconstructing and predicting the joint angles related to rotation (e.g., 'Pelvis rotation', 'Hip rotation left', 'Hip rotation right' and 'Lumber rotation') become more difficult.
Parameter analysis: We perform experiments on different temporal intervals (Δ 1 and Δ 2 ) of the observed and future time steps.In the left side of Table VI, we show the prediction performance of our method for different temporal intervals Δ 1 of the observed sequences.The prediction error decreases with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the increased number of observed sequences and it saturates at Δ 1 = 50 (we believe the reason is because it covers a few full cycles of human walking gaits in the recent past, which are sufficient for the prediction at the following time steps).On the other hand, the right side of Table VI shows the performance for different temporal intervals Δ 2 in the future, where the prediction error increases for the far future pose prediction.

Comparison with baseline models:
We compare our AttRNet with some basline models, as shown in Table VII.Since we introduce our AttRNet based on recurrent networks to reconstruct the pose over time and predict the poses at the following time steps, we compare our AttRNet with some recurrent network-based baseline models such as Recurrent Neural Network (RNN), Bidirectional RNN (B-RNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (B-LSTM), Gated Recurrent Unit (GRU), and Bidirectional GRU (B-GRU).Our AttRNet achieves superior performance compared to these recurrent network-based baseline models.The performance improvement from our At-tRNet compared to the baseline models validates that our sensorwise attention-oriented recurrent encoder can effectively encode the most highly discriminative features from the observed input sequences, and our dynamic temporal attention-oriented recurrent decoder can dynamically compute the most relevant information from the encoded observed features.

DIP-IMU evaluation:
The DIP-IMU is designed to predict 3D full-body human poses from wearable IMU sensors.Following the literature [13], [54], we adopt two different modes: (1) the offline mode where the full sequence is available; and (2) the online mode where our AttRNet observes past 20 time steps, and predicts 1 current time step and 5 future time steps in a sliding window manner.  of our AttRNet with other methods on DIP-IMU dataset [13] for 3D human pose prediction, where the results are compared for both the offline and online modes.Over all scenarios, our method achieves superior performance and establishes the new state-of-the-art results on DIP-IMU dataset for 3D human pose prediction.
TotalCapture evaluation: The TotalCapture dataset is designed to reconstruct full-body poses from the wearable IMUs and multiple third-person view video cameras.Table IX shows AttRNet outperforms the current best results on the IMU dataset of TotalCapture [4].Note, for fair comparison, we only compare on wearable sensors, and TotalCapture only evaluates on positional errors.We follow the train-test splits provided by the corresponding dataset to evaluate our method.

E. Qualitative Analysis
We present some qualitative results on the test samples of the WearableMotionCapture dataset in Fig. 5, where our AttRNet captures data from wearable IMUs and wearable cameras, and reconstructs 3D walking poses.It can be seen that the proposed method can successfully reconstruct different walking poses on different scenarios (e.g., treadmill, stairs, slope, and ground with obstacles).The related video demos can be accessed at. 2 The joint angle error of avoiding obstacles is the highest in Table IV, but the visual evaluation in Fig. 5(b) is not very obvious in individual time steps.
Though our project focuses on the lower limb which has the potential to apply to people with movement disabilities such as stroke survivors, lower limb amputees, and children with cerebral palsy, the proposed method can also be generalized to full-body pose reconstruction using wearable sensors.We present some qualitative results on the test samples of walking, acting, freestyle and fighting pose sequences from the TotalCapture dataset, as shown in Fig. 6, where our AttRNet captures data from wearable IMU sensors and reconstructs full-body 3D pose sequences.It can be seen that the proposed method also shows good performance on the fully-body pose reconstruction.

V. DISCUSSIONS AND FUTURE WORKS
In our assistive walking project, we aim to create a proactive prosthetic device that can positively affect the lives of the 1.6 million people with amputation.However, the development of such a device requires an algorithm to reconstruct and predict walking poses in an uncontrolled daily-living environment.Prior works that require vision data from third-person views for pose reconstruction are not always possible for amputees alone outdoors, thus we propose the wearable motion capture problem of reconstructing and predicting 3D human poses from the wearable IMU sensors and wearable cameras, which aids the prosthetic device control and clinicians' diagnoses amputees out of clinics.For this challenging problem, we collected a new WearableMotionCapture dataset and proposed a novel Attention-Oriented Recurrent Neural Network (AttRNet) to reconstruct the 3D human pose over time and predict the 3D human poses at the following time steps.
Although our AttRNet achieves the promising performances from the fusion of IMU and video data, wearing cameras might raise privacy concerns from amputees.However, the usage of IMU sensors does not have any camera-related privacy

VI. CONCLUSION
In this article, we proposed the wearable motion capture problem of reconstructing and predicting 3D human poses from the wearable cameras and IMUs.We developed a novel Attention-Oriented Recurrent Neural Network (AttRNet) to solve the wearable motion capture problem, which contains a sensor-wise attention-oriented recurrent encoder, a reconstruction module, and a dynamic temporal attention-oriented recurrent decoder, to reconstruct the current pose and predict the future poses.The extensive experiments on a newly collected WearableMotion-Capture dataset show the effectiveness of each module of our AttRNet and the fusion of two sensor modalities.Our AttRNet also outperforms the current best methods on two full-body pose datasets [4], [13].

Fig. 1 .
Fig. 1.Illustration of our proposed approach: (a) Our experimental setup; (b) Our goal; and (c) Input and output of our proposed Attention-Oriented Recurrent Neural Network (AttRNet).Note, in contrast to prior works that require vision data from third-person views for pose reconstruction which are not always possible for outpatients alone, we propose the on-body camera and IMU sensor solution to reconstruct and predict walking poses in a daily living environment.

Fig. 2 .
Fig. 2. Illustration of our proposed Attention-oriented Recurrent Neural Network (AttRNet).The AttRNet contains attention-oriented encoderdecoder and a reconstruction module.Our attention-oriented recurrent encoder contains Sensor-wise Attention Modules (SAM) to compute sensorwise attention scores at each time step and bidirectional GRUs to embed attention-weighted features over different time steps of the observed sequences.The attention-oriented recurrent decoder is configured with Dynamic Temporal Attention Modules (D-TAM) that output a series of future poses by dynamically computing the relevant information from the encoded features.The reconstruction module reconstructs the three recent poses, which are also the initial inputs of the future pose prediction.The three recent poses are used to compute the pose velocity and acceleration.
where h IMU t and h IMU t−1 are the hidden states for the IMU features at time t and t − 1, respectively.The hidden state at T , h IMU T encodes the observed IMU features.Similarly, the attention-weighted video features

t
and h V ideo t−1 are the hidden states for the video features at time t and t − 1, respectively.The hidden state at time T , h V ideo T encodes the observed video features.Finally, we concatenate the last hidden state's output of two modalities to get a single encoded vector, h e T = [h IMU T , h V ideo T

Fig. 3 .
Fig. 3. Importance of Dynamic Temporal Attention Module (D-TAM).The observed poses at different time steps are highly related to different future poses.The blue arrows indicate that the corresponding poses are highly related.
Joints are the ground truth poses at time step T for the pose reconstruction, P [T +1,T +Δ 2 ] = [P T +1 , . .., P T +Δ 2 ] ∈ R Δ 2 ×J×D Joints are the ground truth poses from the time step T + 1 to T + Δ 2 for the future pose prediction, and || • || 2 denotes the l 2 norm.

Fig. 5 .
Fig. 5. Visualization of our human pose reconstruction over time on test samples of WearableMotionCapture dataset.(a) The subject walks on a treadmill; (b) The subject walks on stairs; (c) The subject walks on slope; and (d) The subject walks on ground and avoids the obstacles.Our AttRNet well reconstructs the 3D human poses based on the sensed data from wearable IMUs and wearable cameras.

Fig. 6 .
Fig. 6.Visualization of our human pose reconstruction over time on test samples of TotalCapture dataset.The four examples represent the full-body of 'walking', 'acting', 'freestyle,' and 'fighting', pose sequences, respectively.Our AttRNet well reconstructs the full-body 3D human poses based on the sensed data from wearable IMUs.

TABLE III EVALUATION
OF POSE RECONSTRUCTION AND FUTURE POSE PREDICTION, AND THE ABLATION STUDY ON DIFFERENT MODULES OF THE PROPOSED APPROACH ON OUR WEARABLEMOTIONCAPTURE DATASET FOR BOTH HALF AND HALF, AND LEAVE-ONE-OUT (LOO) EVALUATION TABLE IV ABLATION STUDY OF POSE RECONSTRUCTION AND FUTURE POSE PREDICTION ON DIFFERENT WALKING SCENARIOS TABLE V PER JOINT RECONSTRUCTION AND PREDICTION PERFORMANCE EVALUATION evaluations.Our AttRNet achieves the best performance from the fusion of IMU and video data, compared to the single-sensor approaches.

TABLE VI PARAMETER
ANALYSIS ON Δ 1 AND Δ 2 .(Δ 1 , Δ 2 ) MEAN THAT THE PREDICTION MODELS ARE EVALUATED IN THE ONLINE MODE USING Δ 1 OBSERVED TIME STEPS TO PREDICT Δ 2 FUTURE TIME STEPS Table VIII shows the comparison results

TABLE VIII 3D
HUMAN POSE PREDICTION PERFORMANCE COMPARISON WITH OTHER STATE-OF-THE-ART METHODS ON THE DIP-IMU DATASET FOR BOTH THE OFFLINE AND ONLINE MODES