LLIO: Lightweight Learned Inertial Odometer

The 3-D position estimation of pedestrians is a vital module to build the connections between persons and things. The traditional gait model-based methods cannot fulfill the various motion patterns. And the various data-driven-based inertial odometry solutions focus on the 2-D trajectory estimation on the ground plane, which is not suitable for augmented reality (AR) applications. Tight learned inertial odometry (TLIO) proposed an inertial-based 3-D motion estimator that achieves very low position drift by using the raw inertial measurement unit (IMU) measurements and the displacement prediction coming from a neural network to provide low drift pedestrian dead reckoning. However, TLIO is unsuitable for mobile devices because it is computationally expensive. In this article, a lightweight learned inertial odometry network (LLIO-Net) is designed for mobile devices. By replacing the network in TLIO with the LLIO-Net, the proposed system shows a similar level of accuracy but remarkable efficiency improvement. Specifically, the proposed LLIO algorithm was implemented on mobile devices and compared the computational efficiency with TLIO. The inference efficiency of the proposed system is up to 12 times improved than that of TLIO. Source code can be found on https://github.com/i2Nav-WHU/LightweightLearnedInertialOdometergithub.


I. INTRODUCTION
A UGMENTED reality (AR) exhibits tremendous potential for improving the quality of life. To support AR, a pedestrian positioning system can provide high accuracy 3-D trajectories in indoor and outdoor environments and play a crucial role in connecting AR devices to the Internet of Things (IoT) network [1].
Various techniques have been adopted to achieve indoor navigation in recent years. Positioning systems based on Bluetooth low-energy (BLE) [2] and WIFI [3], [4], [5] can only achieve low accuracy positioning. Ultra-wideband (UWB) systems [6] can provide decimeter-level positioning accuracy in theory, but their performance significantly degrades in nonline-of-sight (NLOS) environments. Furthermore, both techniques mentioned above rely on preinstalled infrastructures. Vision-based systems, such as the visual-inertial navigation system (VINS), have seen tremendous success today. The VINS [7], [8] can achieve high accuracy positioning over a long period through combined vision and inertial measures. Meanwhile, the hardware cost of VINS has become acceptable for consumer-level devices. These advantages have made the VINS one of the best for determining indoor positions, especially for AR. However, despite the impressive performance of state-of-the-art VINS solutions, applying these methods in product scenarios remains challenging. For example, the vision-based system relies heavily on consistent feature association. However, feature association cannot be achieved in certain challenging scenarios (such as positioning in a dark room or when the camera is blocked by obstacles). Thus, a system that can provide consistent pose estimation independent of external environments is necessary.
An inertial measurement unit (IMU) collects, i.e., linear acceleration (or specific force for strict speaking) and angular rate data, which are used in an inertial navigation system (INS) to estimate 3-D motion relative to the first instance. The INS is a fully self-contained positioning system. In other words, it estimates trajectory without any dependency on the external environment. This feature indicates that INS complements the visual-based system well in AR. In fact, INS is utilized widely in mobile devices that need indoor positioning. However, the MEMS IMUs embedded in mobile devices, such as mobile phones and AR headsets, cannot provide long-term motion estimation alone. This is because the noise in low-end IMUs is strong, and the position accumulation error of strapdown INS is proportional to the square of time.
Pedestrian dead reckoning (PDR) uses sensors in mobile devices to detect gait information and form a dead-reckoning model. Some approaches [9] used prior knowledge of human motion to eliminate the accumulation error of velocity. One way of applying the prior knowledge is to detect gait cycles and use this information to estimate trajectories. However, this approach consists of several submodules: step detection, step length estimation and step orientation estimation. Each module requires several hand-designed rules or machine learning. For hand-designed rules, it is difficult to determine rules that are suitable for every scenario and all users. For machine learning, it is not easy to collect massive data sets with the ground truth for certain submodules, e.g., step detection.
Recent research has shown that data-driven inertial odometers can directly provide trajectories by integrating the average velocity through machine learning. Many approaches have focused on achieving 2-D positioning [10], [11], [12], [13]. IONet [12] first proposed an LSTM-based architecture to estimate relative displacement in the ground plane. RoNIN [11] assumed that the global orientation is calculated by fusing linear acceleration, angular rate, and magnetic. Then, the velocity was estimated using a neural network (including, ResNet, LSTM, and TCN) based on acceleration and gyroscope data represented in the gravity-aligned frame. IDOL [13] uses acceleration, angular rate, and magnetic to estimate global orientation rather than use global orientation estimated conventionally. It achieves the best accuracy in terms of both orientation and position. Compared to traditional PDR, these learning-based methods exhibit higher accuracy and are more robust for various motion patterns.
However, for AR headsets in complex environments, a 3-D pose estimator is necessary. Tight learned inertial odometry (TLIO) [14] achieved learned inertial odometry for AR headsets and can estimate 3-D poses accuracy for complex scenarios. It adopts ResNet to estimate the 3-D displacement in short periods and uses a Kalman Filter to fuse it with IMU measurements to achieve long-term dead reckoning. It exhibits the best performance in field testing but is computationally expensive. Compared with the visual solution, the method proposed in this article only uses IMU observations, which is not affected by the external environment and can provide more stable positioning performance.
However, computational efficiency is a vital metric for AR applications run on mobile devices because the computation power of mobile devices is limited. In reality, the efficiency bottleneck of TLIO is the ResNet-based neural network, which is used to infer 3-D displacement. More specifically, the ResNet-based architecture adopted in TLIO is computation expensive and not friendly for mobile-device code implementations. Previous researchers replaced LSTM-based with WaveNet-based architecture in IONet and significantly increased computing efficiency [15]. However, the size of the tested data sets is relatively simple, and their performance needs to be verified when faced with large data sets.
Recent multilayer perceptron (MLP) models [16], [17], [18] have shown potential to replace ResNet because their architectures can provide a better efficiency and accuracy tradeoff. For example, recent MLP-based models can improve efficiency while achieving similar accuracy in image classification. Moreover, the MLP architecture mainly uses matrix multiplication, which has been highly optimized in mobile devices. This fact indicates that the MLP architecture could be easily implemented on mobile devices.
In this article, we propose a lightweight learned inertial odometry for mobile devices whose primary goal is to improve computing efficiency while ensuring that the accuracy is not significantly reduced. This article has two major contributions. 1) We proposed a lightweight MLP-based network to perform regression on both the 3-D displacement and the corresponding covariance. Specifically, we use this network to replace the ResNet architecture in TLIO and evaluate the system performance. The proposed networks provide similar performance and are (1.9-12.0)× faster than ResNet-based methods when implemented on mobile devices. 2) We conduct systematic research into the relationship between the computational efficiency and the positioning performance of the neural network models on mobile devices. The remainder of this article is organized as follows. Section II gives a brief description of the entire system. Section III describes the whole solution in detail. Section IV uses real test data sets to prove that the proposed network achieves similar accuracy while significantly improving efficiency. Section V summarizes the entire study.
We denoted the proposed system as the LLIO and the lightweight MLP-based network as lightweight learned inertial odometry network (LLIO-Net) for the remainder of this article.

II. SYSTEM OVERVIEW
The proposed system uses raw IMU measurements (linear acceleration and angular velocity) and performs 3-D motion estimation using the first instance. As shown in Fig. 1, this system consists of two components: 1) a stochastic clone extended Kalman filter (SCEKF) [19] and 2) a lightweight inertial odometry neural network (denoted as LLIO-Net).
The SCEKF estimates the 3-D motion (including position, orientation, and velocity) and the biases of IMU. Block IMU mechanization is the propagation of the SCEKF. It predicts the system state through INS mechanization based on IMU raw measurements. The input of the measurement update of SCEKF is the 3-D displacement provided by the LLIO-Net. In summary, the filter tightly couples the raw IMU measurement and the displacement, which is provided by LLIO-Net to estimate the 3-D motion and the IMU biases.
The LLIO-Net takes a sequence of IMU measurements represented in a gravity-aligned frame to estimate the displacement between the first and last instances. In the IMU coordinate conversion block, the IMU measurements are converted from the IMU frame to the navigation frame using the rotation matrix estimated by the SCEKF. The network block estimates the displacement and the corresponding covariance based on the converted IMU measurements. The LLIO-Net is inferred every 0.1 s and uses the previous 1 s of IMU measurements as inputs. Thus, each measurement of IMU was used ten times for inferencing.
The IMU measurement is utilized twice in the entire system. First, the raw IMU measurements are input for the IMU mechanization to estimate the prior distribution of the system state. Second, in the measurement update, the displacement is estimated by the IMU measurements used to mitigate the accumulation errors of SCEKF. The primary information source of the measurement update is the human motion patterns memorized in the LLIO-Net rather than the IMU measurement itself.

A. Coordinate Definition
In this article, three coordinate frames, as illustrated in Fig. 2, are defined: 1) the navigation frame denoted as F N ; 2) the tth body frame denoted as F B t ; and 3) the tth local gravity-aligned coordinate denoted as F L t . F B t aligned the coordinates of IMU at t moments. F N is a gravity-aligned coordinate.5+ It is aligned with the IMU center at the initial moment. F L t is the gravity-aligned coordinate frame at which the yaw is the same as F B t . In this article, the 3-D motion in the SCEKF is parameterized as position (t nb t ) of the tth body frame, rotation (R nb t ) from the tth body frame to the navigation frame, and the velocity of the tth body frame in the navigation frame. The IMU raw measurements at the tth moments are denoted as a B t t and ω B t t . Moreover, a N t and ω N t denote them represented in the navigation frame.

B. Lightweight Learned Inertial Odometry Network
In this section, we introduce the LLIO-Net. Fig. 3 shows the framework of LLIO-Net.
1) Network Architecture: The LLIO-Net uses a residual MLP (ResMLP) architecture [17] as a feature extractor to predict displacement and corresponding covariance. Compared with the traditional MLP, the ResMLP uses fewer parameters and can establish interactions between any two positions in the feature matrix, which the traditional MLP does. Compared with the ResNet, the ResMLP can achieve long-range interaction easier and with lower inductive bias. The proposed LLIO-Net consists of three modules. The feature conversion module rearranges the raw input as a feature matrix. The ResMLP module extracts high-level features from the input feature embeddings. The regression module regresses the displacement and corresponding covariance.
The feature conversion module rearranges the IMU measurements in the navigation frame. Using the IMU measurements between t−L and t moments, the input is a 6×L matrix. The input is split into N patch patches, where each patch contains L feature measurements (L = N patch × L feature ). Then, each patch is flatted, and all features are combined. Subsequently, we obtain N patch (6 × L feature )-dimensional embeddings. The resulting set of N patch embeddings is fed to a sequence of ResMLP blocks.
The ResMLP module consists of a sequence of ResMLP blocks that all have the same structure. Before the ResMLP modules, a linear layer converts the N patch (6 × L feature )dimensional embeddings into N patch L inner feature -dimensional embeddings. L inner feature is the feature dimensions in the ResMLP block. Each ResMLP block is a combination of an affine layer (AFF), linear layer, and Gaussian error linear unit (GELU) layer (GELU).
The affine layer function performs a modified layer normalization. It simply rescales and shifts the input componentwise [17]. More specifically, the affine layer is defined as follows: where α and β are learnable vectors. Note that AFF() for a matrix is applied independently to each column of the matrix. Overall, the ResMLP is a combination of the cross-patch interaction block and the cross-channel interaction block. The cross-patch interaction block is defined as follows: The cross-channel interactions block is defined as follows: where A, B, and C are the main learnable parameters. The dimensions of the parameter matrix A are N patch × N patch . Consequently, the dimensions of Z are the same as those of and (E × L inner feature ) × L inner feature , respectively. Thus, the dimensions of Y are the same as those of Z and M.
Here, E is the expansion dimension. Note that, in contrast to the original ResMLP, we added a dropout layer after GELU. This block is not shown in Fig. 3.
The regression block uses the features extracted from ResMLP module to estimate two 3-D vectors: 1) displacementd N t and 2) the diagonal of the covariance matrix d N t .
We assumed that the uncertainty ofd N t at each axis is independent to simplify the problem. Thus, the covariance matrix is a diagonal matrix. The network structure is shown in Fig. 3 and consists of the average pooling layer, linear layer, and GELU layer. 2) Training Methodology: The LLIO-Net is trained based on two loss functions: 1) the mean square error (MSE) loss function ford N t and 2) the negative log-likelihood (NLL) loss function ford N t and d N t together. The MSE loss is defined as follows: where d is the ground truth displacement andd is the estimated displacement of the network. By minimizing L MSE , the network learned to estimate 3-D displacement. The NLL loss function is defined as follows: whereˆ is the corresponding covariance matrix ofd. Additionally, d −d 2ˆ is defined as follows: By minimizing L NLL , the network learned to estimate the covariance corresponding to the 3-D displacement.
During the training stage, we adopt the same training strategy as proposed in TLIO [14]. L MSE is adopted first for training until the network converges. Then, we switch to using L NLL only to trainˆ andd together until the network converges again.

1) System State Definition:
The full systemstate at t moment is defined as follows: where η are past system states, and s t is the current system state. More specifically We express R nb t as the rotation from F B t to F N , and t nb t and v nb t are the position and velocity of F B t in F N , respectively. b a and b g are the IMU accelerometer and gyroscope biases. Indeed, we use the IMU noise model defined as follows: a b t t and ω b t t are the measured values of acceleration and angular rate, respectively.â b t t andω b t t are the true values of acceleration and angular rate, respectively. n a and n g are random noise variables following a zero-centered Gaussian distribution. Moreover, the evolution of b a and b g is modeled as discrete random walk processing.
The error-state-based indirect Kalman filter is utilized in the proposed system. The error state indicates the difference between the estimated and real values, estimated in the SCEKF. It is defined as follows: δX t = (δs t , δη 1 , . . . , δη m ) (12) δs t = δt nb t , φ nb t , δv nb t , δb a , δb g (13) Hence, the dimension of the system is 15 + 6m, where m is the number of cloned system states and 15 is the dimension of s t .
Since the rotation cannot be added directly, the error of rotation φ nb t is defined as follows: R nb t and R nb t represent the estimate and real values of rotation, respectively. exp SO3 (·) denotes the SO(3) exponential map.
2) State Propagation: The filter propagates the system state using the IMU raw measurements based on IMU mechanization. Because the proposed system aims to perform pedestrian motion estimation, the trajectory length is limited. Thus, we ignored the Earth's curvature. Meanwhile, the gravity g N is assumed to be equal at every place of the navigation frame. The simplified strapdown IMU mechanization is defined as follows: The cloned states need not update in the propagation stage. The error-state covariance propagation can be written as follows: where s t and G s t are the linearized state propagation matrix of the previous stateŝ t−1 and all noise (including sensor noise and biased random walk noise), respectively. I 6m is a 6m-dimension identity matrix.
3) State Augmentation: The measurement update of the SCEKF using the relative position of the current system state and a previous system state. Thus, the previous system state should be maintained in the SCEKF through stochastic cloning. The cloned system state is a direct copy of the current system state (only t nb t and R nb t ). The probability propagation of the stochastic clone step in the proposed system is defined as follows: 4) Measurement Update: LLIO-Net provides the pseudo measurement using the acceleration and angular rate in a gravity-aligned coordinate frame. Thus, the output of LLIO-Net is represented in these gravity-aligned frames. As described in Section II, each IMU measure uses ten times in LLIO-Net. Converting the measurements at different moments to the same coordinate frame can avoid redundant coordinate conversion. Thus, all IMU measurements are converted to the navigation frame, and the output of LLIO-Net is represented in the navigation frame. However, the displacement in the navigation frame imposed a constraint on the absolute heading. More specifically, the displacement in the navigation frame indicates the absolute heading observable in the SCEKF. Nevertheless, the absolute heading is unobservable in theory. To mitigate this problem, the 3-D displacement was converted to the local gravity-aligned frame F L t , anchored to the tth body frame F B t .
The displacement and its covariance, which the LLIO-Net outputs, are represented in F N and are denoted asd N t and respectively. Additionally, the information used in the measurement update is defined as follows: R yaw t is the heading rotation matrix ofR nb t . More specifically, R nb t can be decomposed to the three-rotation matrix (R nb t = R yaw tR pitch tR roll t ), whereR yaw t ,R pitch t , andR roll t denote yaw, pitch, and roll, respectively. Hence, the measurement function can be written as follows: where t nb j represents the position of the jth moment. In this article, the jth moment is 1 s before ith moment. ndL t t follows the normal distributed N (0, d Lt t ).
In practice, there are some abnormal pseudo observations contained in the LLIO-Net outputs. To mitigate the effect of abnormal observations, a χ 2 -test is employed. In detail, the observation is checked by the following condition: H is the Jacobian matrix of h(X t ). α is the threshold of χ 2 -test. We choose α = 11.345 corresponding to 99% of the χ 2 distribution with 3 degree of freedom. Furthermore, to avoid continuous reject pseudo observable leads to SCEKF crashed, we directly accept the LLIO-Net output if the previous three are rejected.

IV. EXPERIMENTS
In this section, we compared our proposed LLIO-Net with TLIO when used with a head-mounted AR device. We refer to the ResNet architecture in TLIO as ResNet for the remainder of this article. Furthermore, we compared three versions of LLIO-Net, denoted as ResMLP512, ResMLP256, and ResMLP512. All metrics are compared on the test set, which is never present in the training stage. Note that all methods use the same setup, except for the hyperparameters that define and train the networks.
The remainder of this section is organized as follows. Section IV-A describes the test implementation details. Section IV-B describes the metrics to evaluate the accuracy of the estimated trajectories. Section IV-C compares the accuracies of all methods. Section IV-D analyzes the inference efficiencies of the proposed methods. Section IV-E study the impact of the different components in the proposed LLIO-Net.

1) Data Preparation:
The data set is collected using an Asus Tango phone, which is widely used in the domain of data-driven inertial odometry. The Asus Tango phone can estimate 3-D motion through a fisheye global shutter camera and embedded IMU module based on the visual-inertial odometry technique. In our experiments, we close the trigger of area learning to obtain a smooth trajectory. The trajectory is output by the visual-inertial odometry function as ground truth in both the training and evaluation stages. The full data set contains over 40 h of head-mounted pedestrian data, including various activities (walking, standing still, sitting down and up, and going down and up the stairs). The data sets are captured by six people with multiple different physical devices. Thus, the data set contains various individual motion patterns and IMU systematic errors. In the AR applications, this ground truth is easy to collect by the AR device itself. We follow the data split method of TLIO and split the data set into 80% training, 10% validation, and 10% testing subsets randomly.
2) Data Augmentation Strategy: As described before, we use L IMU samples as input for the network. We select L = 100 as we collect IMU at 100 Hz. Thus, we send 1 s of IMU samples to the network at each instance. We first convert all IMU measurements from the body frame to the navigation frame based on the ground truth rotation matrix in the training stage. Moreover, the following data augmentation strategies are adopted to improve the generalization of the network. to make the network robust to the gravity orientation perturbation.

3) SCEKF Setting:
In this study, we use measurements updated at 10 Hz, providing 1 s of IMU samples to estimate displacement, as described before. Thus, the SCEKF contained nine cloned states in the filter. Furthermore, SCEKF removes the last cloned state immediately after the measurement update. In the evaluation stage, the SCEKF needs to be initialized. We initialize the SCEKF based on the rotation matrix at the first moment only to simulate the scenario that the IMU system should be working individually. This is because the rotation could be provided by an AHRS system easily. The bias of IMU is set to zero as the initial value and is estimated online.

4) Training Details:
We implemented all models using a PyTorch 1.8 framework [20]. The ResNet uses the hyperparameters provided in TLIO [14]. Those hyperparameters exhibit the most outstanding performance in our data set as well. The proposed ResMLP series model uses six ResMLP blocks with a 0.2 dropout probability. The patch length L patch is 25. The total length L is 100 for 1 s IMU measurements collected at 100 Hz. The inner feature dimensional L inner feature of ResMLP is 512, 256, and 128 for ResMLP512, ResMLP256, and ResMLP512, respectively. All models are trained via an ADAM optimizer [21]. The ResNet uses a learning rate of 1e-4. Moreover, the ResMLP series uses a learning rate of 5e-4. We trained each of the model configurations using an NVIDIA GTX 3090.

B. Evaluation Metric
To evaluate the positioning performance, we define the following metrics similar to TLIO.
1) ATE (m): (1/n) t nb t −t nb t 2 , t nb t andt nb t are the ground truth position and estimated position at the tth moment, respectively. The absolute translation error (ATE) is computed as the root-MSE (RMSE) of the estimated trajectory and the ground true trajectory. 2 , R yaw andR yaw are the ground truth and estimated heading rotation matrix, respectively. The relative translation error (RTE) is the error of the trajectory represented in the local gravity-aligned frame at the t − tth moment. Hence, the RTE is not affected by the global yaw drift. In our implementation, t = 1 min.

3) RTE-L (m): (1/n)
γ t andγ t are the heading orientation of the ground truth and estimated trajectory, respectively. The absolute yaw error (AYE) is the RMSE of the absolute heading drift. In practical, we calculation the rotation difference through a three-axis rotation matrix and decompose the yaw component to calculate the AYE. 5) RYE-t ( 2 , the relative yaw error (RYE) is calculated in the same manner as the AYE. t = 1 min.

C. System Performance
We describe a systematic comparison of the ResNet and ResMLP series in this section. Specifically, we compared the inference accuracy of the network and the positioning accuracy of the entire system using different networks. Table I gives an overview of the performance comparison. The distance error in Table I indicates the average error of the inference results defined as (1/n) t nb t −t nb t 2 , which is slightly different from the RMSE. This metric is used to directly evaluate the neural network performance. Additionally, Fig. 4 depicts the cumulative distribution function (CDF) of inference RMSE. The ResMLP series shows similar inference performance for both average distance error and the CDF of inference RMSE. Specifically, the ResMLP512 and ResMLP256 achieve slightly better inference accuracy than ResNet, and the ResMLP128 is slightly worse than ResNet.
Table I also provides the ATE, RTE-t, RTE-L, AYE, and RYE-t of the trajectories using different networks. Note  that all the relative metrics use a sliding window to calculate those metrics. The step length of the sliding windows is (1/10) of the sliding window length (eg., 6 s for TRTEt). Even ResMLP512 achieved a slightly better RTE-t, RTE-L, and RYE than ResNet, but the ResNet showed better ATE and AYE. Since those trajectories are collected over 15 min, the AYE and ATE are easily affected by random perturbation. Fig. 5 shows the CDFs of RTE-t, RTE-L, and RYE-t. All methods exhibit the same level of accuracy except ResMLP128. The ResMLP128 shows slightly worse performance than ResNet from the perspective of stochastic metrics.
Meanwhile, we select a group of trajectories to compare the positioning performance. Fig. 6 shows a selection of trajectories to illustrate their performances with different contours. All methods work well when the pedestrian walks straight and worsen when the pedestrian stands still or walks around a small area for a long time. Fig. 7 depicts a 3-D trajectory. All methods can correctly estimate the trajectory when a person is going up and down stairs.
This section fully compares the performances of the proposed networks and ResNet with respect to inference accuracy, stochastic absolute and relative metrics, and illustration of trajectories in 2-D and 3-D. In summary, ResMLP512 and ResMLP256 show similar performance to the ResNet. ResMLP128 is slightly worse than ResNet in all metrics but still shows the same level of performance in positioning. As mentioned in TLIO [14], its performance has significant advantages over other algorithms. This article has similar accuracy to TLIO, so we can consider that it has an advantage in positioning accuracy compared to other data-driven inertial odometry algorithms.
Furthermore, because the ground truth is used in visualinertial odometry without loop closure, the ground truth trajectory exhibits some cumulative positioning error, as shown in Fig. 6. The following section compares the network performance based on inference accuracy and relative positioning accuracy to avoid the cumulative error.

D. Inference Efficiency on Mobile Devices
In the proposed system, the main computation cost is generated from two modules, namely, the SCEKF (including propagation, state clone, and measurement update) and the network. We implement a C++ version of the SCEKF to test the computational efficiency. The C++ version of SCEKF can process data 190× faster than in real time (2.9-s processing time for a 561 s data set). Meanwhile, the inferencing speed of networks is significantly lower than that of SCEKF. Specifically, if ResNet is used for inference, it will cost 33 s that inference on this data set. Thus, the efficiency bottleneck of this 3-D inertial odometry is the network. With an aim to achieve implementation on mobile devices, the efficiency of the network is systematically analyzed in this section.
At the same time, we compared the proposed method with the article whose aim is to achieve a lightweight version of IONet. This article replaced the LSTM architecture network in IONet with a WaveNet-based network architecture, which reached a significant performance improvement. However, the inputs and outputs of IONet are different from the method discussed in this article. Thus, we made some changes based on the network structure used in this article to achieve the same function for a fair comparison. In detail, we add two two-layer MLP to output displacement and corresponding covariance based on the output of LSTM and WaveNet. The name of compared methods we use is built from these aspects. For example, LSTM-2 is a model that takes two layers of bi-directional long short-term memory (Bi-LSTM) with 128 hidden states. And, WaveNet-32 is a WaveNet-based model that takes eight layers and uses 32 channels as described in [15].
To illustrate the computational efficiency, we compared the inference time in the following devices  The inference time ratio compared to ResNet at each setup was also provided. The FLOPs of ResMLP256 and ResMLP128 are significantly lower than those of the ResNet and show efficiency improvement when either the JIT model or the Mobile model is used. The ResMLP512 has more FLOPs than ResNet but exhibits better inference efficiency in testing the Mobile model. This may benefit from the optimization strategy in the mobile optimizer of PyTorch. In detail, the ResMLP series shows better efficiency improvement on the Mobile model. The inference time of ResMLP256 is 4.7-7.2 times faster than the ResNet. Moreover, the ResMLP128 shows a 9.2-12 times faster inference time but with slightly worse accuracy.
Compared with other algorithms, LSTM has obvious disadvantages in inference efficiency. WaveNet-64 can achieve inference accuracy similar to that of ResNet, ResMLP512, and  ResMLP256. However, its reasoning efficiency has a significant disadvantage compared with the ResMLP series algorithms proposed in this article. On the other hand, WaveNet-32 shows similar inference accuracy to LSTM-1, consistent with the experimental results in [15]. But this inference accuracy is worse than ResNet. Fig. 8 shows the inference performance and accuracy of ResNet, WaveNet, and ResMLP. Since the inference efficiency of LSTM is significantly worth than other algorithms, it is not shown in this figure.
It is worth noting that we monitored the CPU usage during the test. In the inference process of all models, the CPU runs at full capacity. Therefore, the length of inference time can reflect the computational load required by the model.
To illustrate the relation between accuracy and efficiency of the whole method, Fig. 9 shows the relationship between the inference time and RTE-t of ResMLP and ResNet models on a Huawei Mate 30.
In summary, the ResMLP series shows a higher accuracy-efficiency ratio than the ResNet, LSTM, and WaveNet. Although all the models can run in real time on current mobile devices, the computational cost is still a key metric for evaluating the suitability of executing the algorithm on mobile devices. The positioning algorithm usually functions as a fundamental component of other applications and runs during the entire workflow. Thus, the improvement in the efficiency of ResMLP is vital in this scenario.

E. Ablation Study
This section provides and analyzes the effect of several parameters of the proposed LLIO-Net. As illustrated in Table III, all ablation experiments are conducted based on the ResMLP512, presented in Section IV-A. Each parameter different from the ResMLP512 are marked using bold. The meaning of conducted parameters can be found in Section III-B. Furthermore, in Table III, we provide the distance error and FLOPs in each experiment to compare both accuracy and efficiency simultaneously.
The feature dimension (ResMLP512, ResMLP256, and ResMLP128) and the layer number (ResMLP512, C, and D) show significantly contribution to model performance and FLOPs. The expansion dimension (ResMLP512, A, and B) and the patch size (ResMLP512, E, and F) can influence the FLOPs but do not significant affect the prediction accuracy compared to the feature dimension and the layer number. However, an appropriate value is necessary to achieve high efficiency. Moreover, the feature dimension and the layer number should be priorities considered in order to obtain tradeoff between accuracy and efficiency.

V. CONCLUSION
In this article, we proposed LLIO, a lightweight learned inertial odometry, which introduces the LLIO-Net to replace the ResNet-based architecture. The LLIO-Net module estimates the 3-D displacement and the corresponding covariance, which is essentially based on human motion patterns, to mitigate the accumulation error of the INS mechanization. The experiments proved that the proposed LLIO-Net could achieve the same level of accuracy as TLIO while significantly improving computational efficiency (2× to 12× faster). The inference efficiency test on mobile devices reveals that the proposed inertial odometry can be implemented on mobile devices and functions as a low-drift 3-D pedestrian motion estimator. Because of its low computational load and low drift, the LLIO can be adopted as a backup for visual-inertial odometry in AR applications. Alternatively, it can function as an independent dead reckon module for fusing other sources of information.
Further work would focus on the generalization of the proposed model. For example, the performance for estimating the 3-D trajectories of pedestrians without or with a small scale of labeled IMU sequences could be improved.