Estimation of Lower Extremity Joint Moments and 3D Ground Reaction Forces Using IMU Sensors in Multiple Walking Conditions: A Deep Learning Approach

Human kinetics, specifically joint moments and ground reaction forces (GRFs) can provide important clinical information and can be used to control assistive devices. Traditionally, collection of kinetics is mostly limited to the lab environment because it relies on data that are measured from a motion capture system and floor-embedded force plates to calculate the dynamics via musculoskeletal models. This spatially limited method makes it extremely challenging to measure kinetics outside the laboratory in a variety of walking conditions due to the expensive device setup and large space required. Recently, employing machine learning with IMU sensors are suggested as an alternative method for biomechanical analyses. Although these methods enable estimating human kinetic data outside the laboratory by linking IMU sensor data with kinetics dataset, they are limited to show inaccurate kinetic estimates even in highly repeatable single walking conditions due to the employment of generic deep learning algorithms. Thus, this paper proposes a novel deep learning model, Kinetics-FM-DLR-Ensemble-Net for single limb prediction of hip, knee, and ankle joint moments and 3-dimensional GRFs using three IMU sensors on the thigh, shank, and foot under several representatives walking conditions in daily living, such as treadmill, level-ground, stair, and ramp. This is the first study that implements both joint moments and GRFs in multiple walking conditions using IMU sensors via deep learning. Our deep learning model is versatile and accurate for identifying human kinetics across diverse subjects and walking conditions and outperforms state-of-the-art deep learning model for kinetics estimation by a large margin.


I. INTRODUCTION
E STIMATION of human kinetics, specifically ground reaction forces (GRFs) and joint moments, play an important role in providing insights and fundamental information in clinical decisions and controls of exoskeleton devices. For example, joint moments can be used to understand the impact of different joint arthritis in walking [1], [2], [3] and to provide assistive torque to the powered exoskeleton to minimize muscle efforts [4]. Moreover, knee abduction moment (KAM) and knee flexion moment (KFM) are critical biomechanical measurements for assessing knee joint loading [5], which is considered to contribute to knee osteoarthritis [6]. GRFs have also been used for the evaluation of pathological gait patterns [7], [8], [9] or for analyzing the gait of amputees [10], [11].
Traditionally, joint moments are calculated using infraredbased motion capture cameras and ground reaction force plates. Collected joint kinematics from motion capture cameras and GRFs from the floor-embedded force plates are then implemented to computational musculoskeletal modeling softwares such as OpenSim [12], Visual3D (C-Motion, MD), Nexus (Vicon, U.K.), or Anybody (Anybody Technology, Denmark) to calculate joint moment. Although using experimental data and musculoskeletal modeling software can provide reliable joint moment data, this method requires extensive manual post data processing of motion and GRF data to employ musculoskeletal modeling softwares, hindering prompt evaluations. In addition to this, there are major technical hurdles when collecting human kinetics in different walking conditions outside the laboratory such as walking on ramps and stairs due to the specific bulky and heavy equipment setups. Thus, this method poses constraints on estimation outside the lab, especially in different walking environments that are mostly encountered in daily living.
To overcome these limitations imposed by traditional kinetics estimation methods, there is a trend in adapting wearable sensors with computational human dynamic models [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], neuromusculoskeletal models [22], [24], or wearable force plates [18], [19], [20], [21]. However, wearable sensors along with human dynamic models or neuromusculoskeletal model based kinetics estimation requires a large number of sensors (e.g., 7 IMUs in [14], 17 IMUs in [16], and 15 IMUs in [17]). Wearable electromyography (EMG) sensors on specific muscle groups can also be used with neuromusculoskeletal modeling to calculate joint moment [22], [24]. However, EMG signals are sensitive to skin impedance and the location of muscle belly, making it challenging to acquire consistent and repeatable signals. Also, EMG signals are prone to additional noise due to the motion artifacts and electrical fields in environment. Thus, neuromusculoskeletal model based estimation of kinetics with wearable sensors and EMG sensors are subjected to multiple limitations including a large number of sensors, subject specific anthropometric information, and variability of EMG signals. Moreover, all of these studies are limited to the level-ground condition, which imposes the applicability to different walking conditions such as stairs and ramps that are commonly found in daily living. There is also widespread research to get GRFs data with portable force plates. GRFs can be calculated by using a strain gauge transducer [25], piezoelectric sensors [26], and fiber-optic force sensors [27]. However, these sensors are vulnerable to hysteresis, sensitive to temperature, and constrained by force ranges and deformation of the sensors' materials [18]. Wearable GRF plates are also heavy and stiff, which hinders the wearer's control of natural dynamic tasks. Moreover, these shoe-embedded force plates needs to be custom fabricated to fit into a specific subject's foot size and shape.
To address the limitations imposed by musculoskeletal or biomechanical modeling, shoe-embedded force plates, simulated IMU data, multi-modal sensor systems, a large number of wearable sensors, conventional deep learning model, proper validation method (leave-subject-out), and estimation of kinetics in the simple repetitive level-ground / treadmill motion, we aim to predict the single limb sagittal plane joint moment of the hip, knee, and ankle, the frontal plane joint moment of the hip and knee, and 3D GRFs (anterior-posterior, vertical, and medio-lateral) using three IMU sensors on the foot, shank, and thigh using a novel deep learning model Kinetics-FM-DLR-Ensemble-Net in multiple walking environments. To implement the algorithm, we utilize two publicly available datasets-Dataset A [42] (sagittal plane hip, knee, ankle moments, hip abduction moment, and 3D GRFs) and Dataset B [43] (KAM, KFM, and 3D GRFs). To the best of our knowledge, this is the first study that implements both fundamental kinetics parameters-3D GRFs and lower extremity joint moments estimation in multiple walking conditions using IMU sensors via deep learning. Our contribution is seven-fold: (i) proposing an end-to-end trained model Kinetics-Net leveraging different deep learning layers to increase human kinetics prediction performance; (ii) presenting a Fusion Module (FM) to integrate output from three primary models in Kinetics-Net, creating Kinetics-FM-Net to further improve human kinetics prediction performance; (iii) introducing a novel technique to utilize two loss functions, which outperforms conventional loss design in deep learning models; (iv) further utilizing an existing ensemble technique, bagging to improve human kinetics prediction accuracy; (v) conducting extensive evaluation with ablation studies to show the effectiveness of our deep learning model; (vi) conducting experimental comparison with the state-of-the-art deep learning model for human kinetics estimation, and our proposed method outperforms these models by a large margin; (vii) demonstrating generalization capabilities of the model on two publicly available datasets.
The rest of this paper is organized as follows: Section II discusses the related work for IMU based kinetics estimation using musculoskeletal modeling and data-driven methods. The problem statement and the detailed structure of Kinetics-FM-DLR-Ensemble-Net are discussed in Section III. Section IV describes the protocol of the dataset, dataset pre-processing, validation method, and implementation details of the deep learning model. Section V demonstrates the results. In Section VI, implications of these results, limitations and future works are discussed. We conclude our paper in Section VII.

II. RELATED WORK
In this section, we will discuss related work to estimate kinetics using IMU sensors with musculoskeletal modeling or machine learning methods. First, we will discuss IMU and musculoskeletal or analytical model-based kinetics estimation methods. Later, we will discuss data-driven methods for kinetics estimation and their limitations.
Yang et al. [14] used seven IMU sensors to estimate GRFs and moments during walking with a three-dimensional analytical model. Karatsidis et al. [16] also estimated GRFs during walking using kinematic data from 17 IMU sensors with a biomechanical model. They have also predicted joint moments along with GRFs with 17 IMU sensors with a musculoskeletal model based inverse dynamics method for three level-ground walking speeds. Aurbach et al. [17] used a musculoskeletal model to compute GRFs using the kinematics data from 15 IMU sensors during level-ground gait. All these methods are using a large number of wearable sensors for the estimation of kinetics.
[35], [35], [37] using data-driven methods with IMU sensors. Dorschky et al. [37] used 4 IMU sensors to estimate hip, knee, and ankle joint moments, anterior-posterior GRF, and vertical GRF in the level-ground and treadmill walking and running utilizing a 2D Convolutional Neural Network (CNN) based deep learning model with the data augmentation techniques with musculoskeletal model simulation. Their approach is limited to the highly repeated level-ground or treadmill walking condition only. Leporace et al. [38] used a single accelerometer on the shank to predict the 3D GRFs during walking using a Multilayer Perceptron (MLP). However, the model was trained on limited samples (only 4 gait cycles per participant), and data was collected in level-ground walking condition only. Guo et al. [39] used only acceleration data from a single waist-mounted IMU sensor to predict the vertical GRF in self-selected walking speeds in outdoor level-ground settings using Orthogonal Forward Regression (OFR) algorithm. All these studies are implemented in simple repetitive motion such as level-ground or treadmill walking and the use of conventional machine learning models.
To utilize a dataset without the presence of experimentally collected IMU sensors with motion capture and GRF data, Mundt et al. [31] placed virtual IMU sensors on the retrospectively collected walking model to predict hip, knee, and ankle joint moments using feedforward and LSTM based models. Molinaro et al. [28] also used a musculoskeleletal model to generate virtual IMU sensor signals on the trunk and thigh to estimate hip joint moment using TCN [40]. As these studies are implemented using virtual IMU data, it is unclear how reliable the outcome of their methods is on real IMU applications with noise introduced from skin surface movements.
To verify which deep learning model has better estimates on joint moments, there was a comparison between current deep learning methods, LSTM, MLP, and pre-trained convolutional neural network [34]. This study showed MLP has better performance in joint moment prediction compared with other models, and LSTM would be considered for real-time estimation. More recently, Camargo et al. [29] estimated hip, knee, and ankle joint moments in multiple locomotion modes (treadmill, stair, ramp) using extracted features from the cluster of electrogoniometer, EMG, and IMU sensors' data, performed feature selection, and then used the selected feature as input into the Artificial Neural Network (ANN) and XGBoost. However, the total number of 18 sensors make doubt on practicality in real-world deployment. In addition to this, they used handcrafted feature engineering, which adds complexity into their proposed method. To reduce the sensor count, Lim et al. [30] proposed a single sacrum mounted IMU to predict joint moments and GRFs with extracted features (acceleration, velocity, displacement, and time) as the input to ANN, but they are still limited to using conventional deep learning models and predicting treadmill walking condition.
Most of these studies are implemented for simple repetitive motion in level-ground and/or treadmill, simulated IMU data, and multi-modal sensors system, and they are limited to a specific kinetics component, either joint moment or GRFs. These limitations of joint kinetics estimation are compounded further with the use of conventional deep learning models such as LSTM, CNN, TCN, and ANN, which limit more accurate multi-variable human kinetics estimations. To address the limitations imposed by these discussed works, we propose a novel deep learning method to estimate kinetics during gait using three IMU sensors in multiple walking conditions.

A. Problem Statement
This paper estimates three-dimensional GRFs and hip, knee, and ankle joint moment using three IMU sensors on the thigh, shank, and foot via a novel deep learning model Kinetics- Here, ΔT represents the window length of data that will be input to the model, N is the number of IMU sensors, D IMU is the dimension of IMU sensors, and D k is the dimension of total kinetics parameters of the single limb (anterior-posterior GRF, vertical GRF, mediolateral GRF, sagittal plane hip, knee, and ankle joint moment, and frontal plane hip joint moment for Dataset A and 3D GRFs, KAM, KFM for Dataset B). In this problem, we use D IMU =6; N = 3;

B. Kinetics-Fm-Dlr-Ensemble-Net
Kinetics-FM-DLR-Ensemble-Net is built mainly with the model Kinetics-FM-DLR-Net (Fig. 1). We implement bagging [44] techniques using Kinetics-FM-DLR-Net to create Kinetics-FM-DLR-Ensemble-Net, which is our final proposed model. Kinetics-FM-DLR-Net mainly consists of two Kinetics-FM-Nets (Fig. 2), where each model will be trained using two different loss functions and combined using a novel technique, Double Loss Regression (DLR). Kinetics-FM-Net is built with Kinetics-Net and a FM. We build Kinetics-Net using different deep learning layers-GRU, Conv1D, Conv2D, and fully connected dense layers. All the components of Kinetics-FM-DLR-Ensemble-Net will be described in this section.
1) Kinetics-Net: Kinetics-Net mainly consists of three primary models-GRU-Net, GRU-Conv1D-Net, and GRU-Conv2D-Net. Predictions from these three models will be different because of the difference in architecture. The combination of these primary models may increase prediction performance. In this paper, we combine the output of the three primary models by taking the average of the predictions. We train all three primary models simultaneously along with the final prediction from the combined model to create an end-to-end trained model kinetics-Net. We minimize the loss function for four output values of Kinetics-Net, which results in good predictive performance of the final model Kinetics-Net. If the predictions from GRU-Net, GRU-Conv2D-Net, and GRU-Conv1D-Net are respectively, then output from Kinetics-Net

a) Primary Models:
r GRU-Net: GRU-Net consists of an input layer followed by a Batch Normalization (BN) layer, a GRU block, a flatten layer, and an output layer. The BN layer is applied after the input signal to perform an operation similar to standard normalization of the input data [45]. After BN, a GRU block is added, and then output from the GRU block is flattened to add with the output layer.
r GRU-Conv2D-Net: In GRU-Conv2D-Net, we have two branches using the features from the flatten layer of GRU-Net and Conv2D-Net. For the second branch, we use a Conv2D block followed by a Fully Connected (FC) block. The output from the FC block is flattened and then concatenated with the features from the flatten layer of GRU-Net. The concatenated features are then connected to the Output-2 layer to make the prediction. The rationale for using two branches is to improve the prediction by integrating more diversified features from different types of deep learning layers.
r GRU-Conv1D-Net: In GRU-Conv1D-Net, we follow the same architecture as GRU-Conv2D-Net, but only replace the Conv2D block with Conv1D block. b) Fundamental Blocks: All the fundamental blocks ( Fig. 3) to create primary models are provided in this section: r GRU Block: GRU block uses two GRU layers. A dropout layer is added after the GRU layer to avoid overfitting during the training of the model. r Conv2D Block: In Conv2D block, a conv2D layer is used followed by a BN layer. BN layer helps to reduce internal co-variance shift. Then, a max-pooling2D layer is applied to reduce the feature space, which helps reduce the model's complexity and select dominant features. A conv2D, batch normalization, and max-pooling layer create the main unit of the Conv2D block. Four such units are added sequentially to create the Conv2D block.
r Conv1D Block: In Conv1D block, a conv1D layer is used followed by a BN layer. Then, a max-pooling1D layer is applied. A conv1D, BN, and max-pooling layer create the main unit of the Conv1D block. Four such units are added sequentially to create the Conv1D block.

2) Fusion Module (FM):
In Kinetics-Net, we initially take the simple average of the output from three primary models. However, this may not ensure optimal performance from three models as we assign equal weights to all the models. Since the performance of each model will be different, proper weight needs to be assigned to ensure best performance gain from these three models. To do this, we design a FM creating Kinetics-FM-Net using two fully connected layers. Suppose Output-1, Output-2, Output-3 ( Fig. 1) These three outputs are passed to two Fully Connected (FC) layers. All the FC layers are dense layers, where rectified linear unit activation is applied between the first two layers and a sigmoid activation is in the last layer to enforce the output between 0 to 1. If output of each primary model after passing through the the dense layers are respectively, then output of the FM will be Where, is element-wise multiplication.  the loss function [35], [37]. While evaluating the performance, both Normalized RMSE (NRMSE)-proportional to RMSE and Pearson Correlation Coefficient (PCC) were used [36]. Typically, when multiple loss functions are available, a single loss function is derived from the weighted sum of the loss functions. This approach of combining multiple loss functions (RMSE and PCC for our case) may not ensure proper performance for kinetics estimation as the optimizer is minimizing the combined loss without understanding the proper relationship between these loss functions. To use different characteristics of loss functions properly, we devise a novel strategy that employs two Kinetics-FM-Nets (Kinetics-FM-Net(RMSE), Kinetics-FM-Net(PCC)), which are trained with two loss functions (RMSE and PCC) separately (Fig. 1). In Kinetics-FM-Net(RMSE), the optimizer will try to minimize RMSE between ground truth and prediction, whereas, in Kinetics-FM-Net(PCC), the optimizer will maximize the PCC between experimentally collected ground truth and prediction. As a result, from Kinetics-FM-Net(PCC), we will have similar profiles of joint moments and GRFs with ground truth but shifted and scaled (right Y-axis of Kinetic-FM-Net(PCC) in Fig. 4).
As the Kinetics-FM-Net(PCC) is mainly focused on increasing PCC, this will result in a higher PCC than the Kinetics-FM-Net(RMSE). To maintain that high PCC while correcting the shifting and the scaling of the Kinetics-FM-Net(PCC) prediction, we need to acquire the actual offset and range. In Fig. 4, the prediction of Kinetics-FM-Net(RMSE) has closely similar range and offset as the ground truth. As a result, we can leverage that range and offset information from the prediction of Kinetics-FM-Net(RMSE) to correct the shifted and scaled prediction of Kinetics-FM-Net(PCC). If the prediction from Kinetics-FM-Net(PCC) is K P CC ΔT ∈ R ΔT ×D K , gain and the offset correction are B 0 = [B 00 , B 01 , . . .. . .. . .., B 0D B 1 = [B 10 , B 11 , . . .. . .. . .., B 1D k ] ∈ R D K respectively, then the corrected prediction from Kinetics-FM-Net(PCC) can be considered as the element-wise multiplication of gain and addition of offset corrected matrix.
In (3), K RM SE ΔT ∈ R ΔT ×D K is the prediction from Kinetics-FM-Net(RMSE), as it has a closely similar range and offset of the ground truth. As we can estimate K RM SE ΔT and K P CC ΔT from the two Kinetics-FM-Net models, we can calculate the coefficient matrix of B 0 , B 1 using D K number of linear regression for each component. After calculating the gain and offset correction matrix of B 0 , B 1 , the final prediction after gain and offset correction: In Fig. 4, we demonstrate the qualitative and quantitative impact on our DLR loss design for performance improvement. 4) Ensemble (Bagging): Finally, we apply bagging on Kinetics-FM-DLR-Net to create Kinetics-FM-DLR-Ensemble-Net. At first, from the training dataset, we create bootstrap samples (random sampling with replacement) from our training dataset. Suppose, we create K number of bootstrap samples from the whole training dataset. Each bootstrap sample will be used to train Kinetics-FM-DLR-Net. The output from each Kinetics-FM-DLR-Net will be K Bag ΔT,1 , K Bag ΔT,2 ,...., K Bag ΔT,K . Then, the final output of Kinetics-FM-DLR-Ensemble-Net:

5) State-of-The-Art (SOTA) Model Description:
For comparison with our proposed model Kinetics-FM-DLR-Ensemble-Net, we adopt multiple deep learning models from the literatures [28], [29], [31], [34], [37], [38]. Description of those models are given below: a) FFN (HF): In the FeedForward Neural network with Handcrafted Features (FFN-HF) from input data, we extract 17 features for each of the axis (3D accelerometer and 3D gyroscope) of IMU data. We extract those features for the window length of 100 samples (0.5 s) for Dataset A [42] and 50 samples (0.5 s) for Dataset B [43]. The extracted features are mean, RMS, max, min, mean absolute value, standard deviation, mean absolute difference, mean difference, median difference, median absolute difference, interquartile range, kurtosis, skewness, median, variance, median absolute deviation, and mean absolute deviation. We extract a total of 102 features from each IMU. Then, we use those extracted features as the input to an FFN neural network with multiple dense layers. Each layer is followed by a dropout to avoid overfitting during the training of the model. We flatten the output of the last dropout layer and connect it to the last layer for kinetics prediction. b) TCN: For TCN, we use a single stack of the residual block with weight normalization. The TCN layer is followed by two dense layers and a dropout layer after each of them. We flatten the features from the last dropout layer and then connect it to the prediction layer. c) FFN: For FeedForward Neural network(FFN), we use raw IMU data as input. Two dense layers with a dropout after each one is used. Features from the last dropout layer are flattened and connected to the prediction layer. d) Bi-LSTM: We use two bidirectional LSTM layers, where each one is followed by a dropout layer. After the dropout layer, two dense layers, each one followed by a dropout, are used, flattened, and connected to the prediction layer. e) Conv2D: We derive Conv2D architecture from GRU-Conv2D-Net by discarding the branch of GRU-Net (Fig. 2).

A. Dataset Description
This paper uses two publicly available datasets [42], [43] to build Kinetics-FM-DLR-Ensemble-Net for the joint moment and GRFs estimation.
1) Dataset A [42]: Twenty subjects' data (8 females and 12 males, age: 21.7 ± 3.65 years, height: 1.70 ± 0.07 m, weight: 68.21 ± 11.52 Kg) are used for model training, discarding two subjects (AB06, AB20) because of the absence of GRFs for level-ground condition (AB20) and IMU data mismatch (AB06) comparing to rest of the subjects. This dataset is comprised with four locomotion modes, i.e., treadmill walking, level-ground walking, ramp ascent/descent, and stair ascent/descent. Treadmill walking was collected for 28 different speeds ranging from 0.5 to 1.85 m/s in 0.05 m/s increments across seven trials (four speeds per trial). Level-ground walking data was collected for 30 circuits-5 clockwise and 5 counterclockwise for self-selected slow, normal, and fast speed, which includes both straight walking and turning. In stair walking, the subject was walking on a six-step stair-case with four different stair heights. All the subjects completed a total of 40 trials with a set of five trials starting with their instrumented leg (right leg). Four IMU sensors (thigh, shank, and foot), three electronic goniometers (hip, knee, and ankle joint), and 11 EMG sensors were attached on the instrumented leg. A set of five trials starting with their non-instrumented leg for each stair height (4 in, 5 in, 6 in, and 7 in). Sixty ramp trials were performed with six different inclination angles (5.2°, 7.8°, 9.2°, 11°, 12.4°, and 18°) with a set of five trials for each starting leg (left and right) on each inclination. Other detailed information on the protocol of the experiment is described in [42].
Dataset Pre-processing: IMU data and marker trajectory data from the motion capture system are collected at 200 Hz. GRFs data are measured at 1000 Hz, then they are re-sampled to 200 Hz to synchronize with IMU and motion capture data. Joint moment is calculated using the Opensim [12] inverse dynamics tool with motion capture and GRFs data. For level-ground, stair, and ramp walking, IMUs, joint moments, and GRFs data are segmented for a full gait cycle based on the availability of GRFs from the installed force plates on the instrumented right leg. On the other hand, data segmentation is not performed for the treadmill walking as GRFs are present for entire gait cycles. We normalize joint moment data with the body weight and height of the participants. We also normalize GRF data with the body weight of the participants. IMU data are collected from the foot, shank, thigh, and torso. IMU data from the foot, shank, and thigh are taken to build the deep learning model while discarding torso data, as we are primarily focused on predicting lower body joint kinetics. We use 100 samples (0.5 s) for feature extraction of the deep learning model, as it provides satisfactory results for kinetics estimation.
Details of the segmentation of force plates for level-ground, ramp, and stair conditions are given below. Force plate number and their position in the laboratory can be found in [42]. a) level-ground: For clockwise level-ground walking, we segment FP2, Combined, and FP5 force plate data, and for anti-clockwise level-ground walking, we segment FP5 data. b) Stair: For stair trials starting with the instrumented leg (right leg), we segment force plate data of FP1 and FP5 for the right leg. For the trials starting with the non-instrumented leg (left leg), we segment FP2, FP3, and FP4. c) Ramp: For ramp trials starting with the noninstrumented leg (left leg), we segment force plate data of FP2 and FP4 for the right leg. For the trials starting with the instrumented leg (right leg), we segment FP1, FP3, and FP5.
2) Dataset B [43]: Seventeen subjects (all male; age: 23.2 ± 1.1 years; height: 1.76 ± 0.06 m; mass: 67.3 ± 8.3 Kg) participated in that study. Different walking speeds, foot progression angles, step widths, and trunk sway angles were used as testing conditions. Subjects initially performed a 2-minute normal walking trial with a self-selected speed (1.16 ± 0.04 m/s), which was used as a baseline to determine foot progression angle and step width. Three foot progression angles (baseline-15°, baseline, and baseline+15°) with three different speeds (self-selected-0.2 m/s, self-selected, and self-selected+0.2 m/s) were used in combinations to perform nine trials. In the same way, three step widths (baseline-0.054 m, baseline, and base-line+0.070 m) and trunk sway angles (4°, 8°, and 12°) trials with three different speeds were also performed.
Dataset Pre-processing: Infrared-based optical motion capture data was recorded at 100 Hz with synchronized ground reaction force data at 1000 Hz. Eight IMU data were collected with a sampling rate of 100 Hz. In their pre-processed dataset, they have provided 3D GRFs with the same number of samples as IMU and joint moments. As they have only provided KAM, KFM, and 3D GRFs, we use these variables to build our algorithm. In the dataset, zeros are appended at the end of shorter steps to synchronize with the longest steps with a length of 152. We remove those extra zeros from the shorter steps and use the rest of the dataset to build our algorithm.

B. Implementation Details
We train all of our models in Keras with a TITAN Xp GPU (NVIDIA, CA). The Kinetics-FM-DLR-Ensemble-Net is trained with a run time of 20 hours (two hours per bootstrap sample) for Dataset A and of 4 hours (24 minutes per bootstrap sample) for Dataset B. Adam [46] is used as the optimizer. All the models with RMSE loss are run for 40 epochs, and models with PCC loss are run for 55 epochs with early stopping callbacks (patience of 15 epochs). The best model's weights are restored based on the best values (minimum loss) of patience epochs (15) on validation data. All the models are trained with a batch size of 64. We mainly use two loss functions, i.e. L RM SE , L P CC for training. Joint Loss (L JL ) is derived using these two loss functions with the following equation.
In Table I, III, we present all the hyperparameters of different layers of Kinetics-FM-DLR-Net and SOTA models respectively.

C. Evaluation Procedures
Leave-one-subject out cross validation is implemented to assess the performance of all the models. Excluding the test subject from the training data ensures proper validation as including test

A. Model Ablation
In Tables II and IV, we show the ablation study on our model and report the NRMSE and PCC values for kinetics estimation with different loss designs. 11 models trained with three loss designs result in a total of 33 models. We perform statistical analysis to validate whether our final proposed model Kinetics-FM-DLR-Net (Kinetics-FM-Net with DLR loss design) significantly improves kinetics estimation accuracy compared to the rest of the 32 models (Tables II and IV). For both datasets and evaluation metrics (NRMSE, PCC), overall Kinetics-FM-DLR-Net significantly outperforms rest of the models.

B. Fusion Module (FM)
Our designed fusion model that assigns proper weight to the prediction of primary models (GRU-Net, GRU-Conv2D-Net, and GRU-Conv1D-Net) increases joint moments and GRFs prediction in all walking conditions. Tables II and IV demonstrates that integrating the FM in Kinetics-Net decreases the mean NRMSE and PCC for both datasets.

C. Double Loss Regression (DLR)
We find that our proposed DLR has the superiority in performance over RMSE and JL design for all the models (Tables II  and IV). Although we initially design DLR to increase PCC of kinetics prediction, we have 'Double Reward' with DLR as it improves both NRMSE and PCC.

D. Sensor Combination
We use different combinations of sensors and apply Kinetics-FM-DLR-Net to see how performance changes due to different sensor locations. We implement statistical analysis to see if the combination of all three sensors (Shank-Foot-Thigh) can significantly improve the result compared to other sensor combinations. Compared to a single sensor, the Shank-Foot-Thigh sensor significantly improves the kinetics prediction. We have a marginal improvement compared to the combination of the two sensors, which is statistically insignificant. However, as Shank-Foot-Thigh sensors have the lowest NRMSE and highest PCC, we proceed with the combination of the Shank-Foot-Thigh sensors for the implementation of bagging.

E. Bagging (Ensemble Learning)
Kinetics-FM-DLR-Ensemble-Net, created from Kinetics-FM-DLR-Net with the application of bagging techniques, reduces mean NRMSE from 4.49 to 4.32 for Dataset A, from 7.54 to 7.43 for Dataset B and increases mean PCC from 0.923 to 0.929 for Dataset A, from 0.884 to 0.886 for Dataset B compared with Kinetics-FM-DLR-Net. Table V demonstrates the performance of adding different numbers of bagging samples to the Kinetics-FM-DLR-Net. Statistical analysis is performed to validate in which bootstrap sample number we achieve significant performance improvement over Kinetic-FM-DLR-Net. After adding four bootstrap samples in bagging techniques, there are significant improvements in both datasets. We continue to add samples of up to ten and eventually stopped the experiment due to increasing complexity and saturation in performance.

F. Parameter Sweep (W)
Typically, when we combine two loss functions to create joint loss, the weight of the loss is varied to ensure the best   Tables II and IV, we only show the results of JL with W = 1 ((6)), which may not give the best performance from the JL. To demonstrate how varying W impacts performance compared with our DLR loss design, we change the W in L JL ((6)) and summarize the results in Table VII. The rationale of choosing the specific value of W is to make the actual value of L P CC to be one-fifth, one-fourth, one-third, half, equal, two-times, three-times, four-times, and five times of L RM SE , which will ensure a wide range of weight on L P CC to make sure valid comparison of joint loss. We perform statistical analysis to determine if our model Kinetics-FM-DLR-Net significantly performs better than Kinetics-FM-Net with weighted joint loss. For Dataset A, we do not achieve significant improvement when W = 8 and W = 4 for PCC. For the rest of the cases of Dataset A including Dataset B, our model significantly outperforms the model with different joint losses.

G. Comparison With State-of-The-Art
In Table VIII, we show the comparison of our results with the state-of-the-art deep learning algorithm for kinetics estimation. For Dataset A, we have an improvement of up to 24.74% in NRMSE and 8% in PCC, and for Dataset B, an improvement of up to 11.54% in NRMSE and 3.38% in PCC are observed.
Tables IX and X show the prediction outcomes of Kinetics-FM-DLR-Ensemble-Net in each walking condition for Dataset A and Dataset B respectively. As an example, we plot one gait cycle of each kinetics component for Dataset A and stance phase of each kinetics component of Dataset B in each walking condition to demonstrate a sample qualitative comparison of ground truth and prediction from the model (Fig. 5).

VI. DISCUSSION
This study estimates lower extremity joint moments as well as anterior-posterior, vertical, and mediolateral GRFs in treadmill, level-ground, stairs, and ramp walking conditions using three IMU sensors on the thigh, shank, and foot using our Kinetics-FM-DLR-Ensemble-Net. This is the first study to implement GRFs and joint moments estimation on multiple walking conditions using IMU sensors via deep learning. We apply our algorithm to two datasets; Dataset A [42] with a good number of subjects, multiple walking environments, multiple walking speeds, multi-variable ramp angles, and stair heights, and Dataset B [43] with different treadmill conditions. We also perform validation of our model on an unseen subject (leave-one-subject out crossvalidation), which also makes sure that our model is not just memorizing a specific subject's IMU and kinetics relation. The extensive training sets, generalized and versatile capacity of our novel algorithm, and rigorous validations enable the most accurate estimates of joint moments and GRFs for new testing subjects compared with the other state-of-the-art deep learning algorithms. In addition to this, as we establish and evaluate our model on the public datasets, other researchers can validate our model with their algorithm and also have the potential to improve our contribution in this field further.
By leveraging different conventional deep learning layers, (i.e., 1D, 2D convolutional, GRU, and dense layers) this paper proposes an end-to-end model Kinetics-Net. In Tables II and  IV,      our model in two publicly available datasets. Firstly, we build three simple models GRU-Net, Conv1D-Net, and Conv2D-Net by utilizing convolutional, GRU, and dense layers. Then, we concatenate features of two models from the combination of these three models to build GRU-Conv2D-Net, GRU-Conv1D-Net, and Conv2D-Conv1D-Net. From the results, we have an improvement in kinetics estimation performance when a combination of features of two models is used. This validates our approach of adding features from two models to improve the prediction. From the first six models in Tables II and IV, we dismiss Conv1D-Net, Conv2D-Net, and Conv2D-Conv1D-Net for further model development due to their poor performance. Later, we use GRU-Net, GRU-Conv2D-Net, and GRU-Conv1D-Net to build three Kinetics-Sub-Nets using the average prediction of two models from these three models in an end-to-end manner. We see a further improvement in results by creating those Sub-Nets. Finally, we use GRU-Net, GRU-Conv2D-Net, and GRU-Conv1D-Net and take the average of these three models to create an end-to-end model Kinetics-Net, which outperforms all the previous models. We further add FM and DLR modules, which significantly improve the kinetics estimation compared to other models. This validates our approach of adding different layers to create a more complex model for performance improvement. Although many studies have applied deep learning algorithms to estimate kinetics [28], [29], [30], [31], [32], [33], [34], [35], [35], [35], [36], [37], [38], [39], a direct comparison of their results with ours cannot be valid due to the different sensor modalities, number of sensors, walking environments, and number of subjects. Moreover, those datasets are not publicly available, which makes it challenging to apply our algorithm to make the comparisons. To address this issue, we adopt their machine-learning algorithms and applied them to the Dataset A and Dataset B. However, directly applying their specific machine learning architecture to this dataset may not be a valid method, as those models are specifically built for their dataset and inputs (variable number and multi-modal sensors). For this reason, we use the same types of layers such as LSTM, TCN, Conv2D, FFN, etc, and optimize the hyperparameters of those models manually for our dataset to achieve high performance to ensure valid comparison. In Table VIII, we show the comparison of the performance of the state-of-the-art deep learning algorithm for kinetics with ours. Our algorithm Kinetics-FM-DLR-Ensemble-Net significantly outperforms those algorithms by a large margin, which proves the effectiveness of our model over the conventional deep learning method for kinetics estimation.
From Table IX, the performance of the level-ground condition is inferior to the other walking condition. The main reason for this is the lack of training data for the level-ground condition. As we segment the dataset of level-ground conditions, we have limited numbers of right leg strikes from the walking circuit, which results in fewer data (4% of the total dataset) compared to other walking environments. Prediction from the treadmill trials is highly correlated with the ground truth with a mean PCC of 0.945 compared to trials of other scenes, which is due to the collection of data in a repetitive manner with strictly controlled walking speeds for each of the participants.
We use different configurations of sensor placement in our model to identify how kinetics estimation performance varies. We find that IMU on the shank has a better performance compared to the sensor on thigh and foot location on both datasets when a single sensor is used for algorithm construction. One plausible explanation for inaccuracy while using only the foot sensor is due to the lack of movement of the foot (resulting in a lack of IMU signal) during the stance phase when the foot is firmly constrained to the ground. In addition to this, the thigh segment may have greater IMU signal noise due to the compliant surface and movement of muscle whereas the frontal area of shank may have less noise due to the tibia's boney surface. Moreover, we achieve more accurate results when the IMU sensor on the shank is combined with the thigh and foot sensors for both datasets, which implies that our model is able to perform effectively with multiple sensor configurations. Overall, there are no significant differences between Shank-Foot-Thigh and Foot-Shank, Shank-Thigh. This will help us to reduce the number of sensors without sacrificing performance for kinetics estimation.
Although we provide the most accurate prediction of joint moments and GRFs with extensive walking conditions, speeds, good number of subjects, and two independent public datasets, there are several limitations on this study. We use a relatively large number of IMU sensors (three) to estimate single limb kinetics. However, to ensure practicality and users' comfort, we need to minimize the number of IMU sensors. More specifically, if we can implement only shoe or foot mounted sensors similar to kinematics estimation in [47] or using insoles such as in [48], it would be more helpful to maintain those sensors. Since single limb joint moments and GRFs are affected by the contralateral limb during the early stance phase and terminal stance phase,  incorporating IMU information on both limbs would gather more meaningful knowledge of walking dynamics, and this can improve the prediction further. Additionally, adding six sensors on both feet would help to acquire both limbs' kinetics information. Ensemble learning-bagging is added to improve the prediction. However, it increases the computational complexity as the model is repeated ten times. If we want to make a trade-off between accuracy and computation complexity, we can avoid using bagging techniques in our model. As the level-ground condition has the highest error compared to other conditions, a direction for future improvement may be to augment the level-ground dataset by repetition and make it roughly equal to other walking environments. Although we have decent accuracy for healthy individuals, patients with musculoskeletal issues may not have an accurate outcome from the algorithm due to the absence of patients' training data. Moreover, our proposed approach relies only on IMU data while ignoring static and anthropometric properties of the individual. As this can provide essential information for representing the dynamic model of human kinetics, fusing this information into the model has the potential to improve the performance of the prediction. Thus, a probable future direction to address the limitation is to fuse deep learning-based features of static variables such as segment length, mass matrix of the segments, etc. to the model using any dense fully connected layers. IMU placement errors such as orientation and position errors can affect the accuracy of moment estimation [43] or significantly reduce GRF accuracy [49] when machine learning based data-driven methods are used. Our model may suffer the same performance degradation due to sensor placement errors. A future research direction to tackle the problem is to train the model with a simulated dataset of different placement and orientations of the IMU data with corresponding kinetics information.
As we use extensive datasets with our novel algorithm, our outcome provides reliable estimates of joint moment and ground reaction forces compared with state-of-the-art studies. Thus, our algorithm can be used to measure the joint kinetics and 3D ground reaction forces in the clinic or other research lab where they cannot accommodate prohibitive measurement modalities such as motion capture cameras, and overground force plates. To guarantee reliable performance, users need to set up the IMU with the protocol mentioned in [42], [43]. Our model also has the potential to serve as a platform that enables updates or modifications via retraining of the model to accommodate kinetics estimation of different subject populations who have deviated kinetics due to musculoskeletal issues or aging in the future. Our model can be found at Github (https://github. com/Sanzid-Priam/Estimation-of-Kinetics-using-Kinetics-FM-DLR-Ensemble-Net).

VII. CONCLUSION
This study proposes a novel deep learning model to estimate kinetics in multiple walking conditions and speeds. From our extensive evaluation of our developed model, we justify our design choices. This accurate estimation will enable tracking of kinetics parameters outside the lab, removing the limitation of traditional motion capture cameras and floor-embedded force plate based kinetics estimation.