On Performance Evaluation of Inertial Navigation Systems: The Case of Stochastic Calibration

In this work, we address the problem of rigorously evaluating the performances of an inertial navigation system (INS) during its design phase in presence of multiple alternative choices. We introduce a framework based on Monte Carlo simulations in which a standard extended Kalman filter is coupled with realistic and user-configurable noise generation mechanisms to recover a reference trajectory from noisy measurements. The evaluation of several statistical metrics of the solution, aggregated over hundreds of simulated realizations, provides reasonable estimates of the expected performances of the system in real-world conditions. This framework allows the user to make a choice between alternative setups. To show the generality of our approach, we consider an example application to the problem of stochastic calibration. Two competing stochastic modeling techniques, namely, the widely popular Allan variance linear regression and the emerging generalized method of wavelet moments, are rigorously compared in terms of the framework’s defined metrics and in multiple scenarios. We find that the latter provides substantial advantages for certain classes of inertial sensors. Our framework allows considering a wide range of problems related to the quantification of navigation system performances, such as the robustness of integrated navigation systems [such as INS/global navigation satellite system (GNSS)] with respect to outliers or other modeling imperfections. While real-world experiments are essential to assess to performance of new methods, they tend to be costly and are typically unable to lead to a sufficient number of replicates to provide suitable estimates of, for example, the correctness of the estimated uncertainty. Therefore, our method can contribute to bridging the gap between these experiments and pure statistical consideration as usually found in the stochastic calibration literature.


On Performance Evaluation of Inertial Navigation
Systems: The Case of Stochastic Calibration Davide A. Cucci , Lionel Voirol , Mehran Khaghani , and Stéphane Guerrier Abstract-In this work, we address the problem of rigorously evaluating the performances of an inertial navigation system (INS) during its design phase in presence of multiple alternative choices. We introduce a framework based on Monte Carlo simulations in which a standard extended Kalman filter is coupled with realistic and user-configurable noise generation mechanisms to recover a reference trajectory from noisy measurements. The evaluation of several statistical metrics of the solution, aggregated over hundreds of simulated realizations, provides reasonable estimates of the expected performances of the system in realworld conditions. This framework allows the user to make a choice between alternative setups. To show the generality of our approach, we consider an example application to the problem of stochastic calibration. Two competing stochastic modeling techniques, namely, the widely popular Allan variance linear regression and the emerging generalized method of wavelet moments, are rigorously compared in terms of the framework's defined metrics and in multiple scenarios. We find that the latter provides substantial advantages for certain classes of inertial sensors. Our framework allows considering a wide range of problems related to the quantification of navigation system performances, such as the robustness of integrated navigation systems [such as INS/global navigation satellite system (GNSS)] with respect to outliers or other modeling imperfections. While real-world experiments are essential to assess to performance of new methods, they tend to be costly and are typically unable to lead to a sufficient number of replicates to provide suitable estimates of, for example, the correctness of the estimated uncertainty. Therefore, our method can contribute to bridging the gap between these experiments and pure statistical consideration as usually found in the stochastic calibration literature. well-defined coordinate reference system. The inertial navigation system (INS) is found at the core of many modern navigation systems: inertial sensors keep track of the acceleration and the angular velocity (rotation rate) of the body, which can be integrated over time to estimate its position, velocity, and attitude [1]. INS is autonomous (largely unaffected by the surrounding environment), widely available and affordable, high rate of output (up to a few kHz), and is accurate in a short time. However, it suffers from an accumulation of random errors in time, eventually leading to unreliable position and orientation estimates. This limitation is magnified with low-grade inertial sensors, such as the ones typically employed in smartphones or drones, where position estimates can become practically unusable in as low as few tens of seconds.
To control this inevitable position and orientation drift, inertial measurements are typically fused with aiding information coming from other sensors. In outdoor applications, such as in planes, cars, ships, and drones, the aiding data typically come from global navigation satellite systems (GNSSs). These systems provide absolute position, velocity, and timing (PVT) data at lower rates compared to INS (typically 1-20 Hz) but are immune to error accumulation. Therefore, GNSS is complementary to INS, and the INS-GNSS integration is widely used in practice. In indoor applications, or whenever the system cannot depend on the reliable and continuous reception of uncorrupted GNSS signals, such aiding information can come for example from cameras, in visual-inertial systems [2], or ultrawideband beacons [3]. The fusion of data from INS and other aiding systems, such as GNSS, is typically performed via variants of the Kalman filter, such as the commonly used extended Kalman filter (EKF) [4].
One of the most critical steps in the design of a navigation system is the quantification of its performance. When several types of heterogeneous sensors, such as inertial and GNSS, are fused together, many factors can influence the properties of the estimation error. Indeed, their combined impact then tends to be difficult to predict. Once the system is assembled, the experimental assessment is complicated by the difficulty of acquiring a sufficiently accurate ground truth. Such experiments can be even impossible because, for example, the safe and proper functioning of the platform's guidance and control system (such as the autopilot in drones) requires the correct functioning of the navigation system. Furthermore, the validity of performance indicators, such as the consistency of position and orientation uncertainty estimates, requires the execution of hundreds or thousands of trajectories, which is unfeasible in practice.
In this work, we propose a framework for quantifying navigation system performances that rely on Monte Carlo simulations. We consider a classical EKF for INS/GNSS navigation that the user can configure according to different competing setups, such as sensor selection or modeling choices. Our framework allows performing thousands of Monte Carlo simulations using a user-specified trajectory that is representative of the chosen application scenario. In each run, we generate noise-free sensor readings from the user-supplied trajectory and corrupt those with realistic noise samples, either taken from static data (typically available and widely used in the case of inertial sensors) or sampled from a user-specified noise model. The results of all simulations are aggregated to compute several types of metrics and allow the user to compare the results quantitatively. Optionally, static data can be replaced with data collected in different dynamic conditions, e.g., using a rotation table, and noise samples can be chosen according to the dynamics of the user-specified trajectory. While this approach cannot entirely substitute the experimental evaluation of the final system, the use of realistic noise samples and user-defined noise generation mechanisms allows users to nearly reproduce real-world operation scenarios and substantially decrease the number of required experiments.
The idea of generating synthetic sensor measurements as a way to validate navigation systems has been already proposed in the literature. Jwo et al. [5] focused on generating synthetic specific force and angular velocity measurements starting from a user-defined trajectory for which the kinematic properties are known analytically. However, this work ignores stochastic or deterministic error generation. Similarly, in [6], a motion model of the platform to be studied (a ship in this case) is used to generate a realistic trajectory, also considering the effects of the environment on the motion, such as wind and waves. Young et al. [7] provided an open-source simulator for inertial sensors that works by interpolating a discrete time trajectory obtained from a motion capture system. Other works, e.g., [8] and [9], also consider sampling random errors by relying on common stochastic models and their power spectral density functions. In our work, we aim at moving one step further in the quantification of performances of INSs by introducing a sensor fusion step based on an EKF that aims at recovering the true trajectory from noisy data. This step is repeated hundreds of times in a Monte Carlo fashion to determine reliable statistics of the navigation performances. Furthermore, we circumvent the problem of simulating realistic random errors, which is hard to validate in practice, since the true noise models for a given inertial sensor are unknown, by using blocks (containing a suitable number of data points) of real noise samples available from static acquisitions. This approach relies on the idea of block bootstrap, which is a standard statistical technique allowing to resample the dependent data [10].
To show the benefits of our approach, we consider an example application focusing on the problem of establishing a suitable stochastic model for inertial sensors. Indeed, inertial sensors, like any other sensor, have errors, both deterministic and stochastic. Deterministic errors, such as the stable parts of scale factors and axis nonorthogonality, can be precalibrated and removed from the measurements directly (for example, see [11] and the references therein). The additive stochastic part of the error, for example, composed of white noise (WN), turn-on biases, and other time-correlated processes, can only be taken into account "on-flight" during navigation, provided that a suitable stochastic model for the sensor errors has been established beforehand. The latter process is often referred to as "stochastic calibration," and it is an important step in navigation system design and implementation. Indeed, an accurate stochastic calibration allows, among others, to maximize the estimation accuracy by enabling the correct estimation and removal of the maximum possible portion of the stochastic errors from sensor measurements and identifying misbehaving sensors or measurements via fault detection and exclusion mechanisms.
Stochastic calibration of inertial sensors has been widely studied in the last decades, and several methods are available, ranging from the Allan variance (AV) linear regression method [12], [13], which we refer to as AVLR, to the more recent Generalized Method of Wavelet Moments (GMWM) [14], [15], not to mention the methods based on the analysis of the power spectral density of the sensor errors [16], [17], correlation of filtered sensor outputs [18], and maximum-likelihood estimation [19]. The AVLR and the GMWM are among the most popular choices by practitioners and researchers, and can be employed to model stochastic errors in both static and dynamic conditions [20], [21].
In the context of stochastic calibration, each technique allows to consider a class of possible stochastic models and selects the one within this class that appears to be the most suitable according to some technique-specific "goodness-offit" metric. This means that different techniques typically yield different stochastic models, and many models can be practically equivalent given a metric. It is, in general, difficult to relate these goodness-of-fit metrics used in stochastic modeling to the actual navigation performances that users will obtain in their applications. Furthermore, the specific values determined for stochastic parameters, such as, for instance, the variance of the rate random walk (RW) innovations in a gyroscope, are very distant from navigation performance metrics meaningful for the final user, such as orientation error. Following our proposed approach, it is possible to compare different stochastic modeling techniques according to the criteria that are highly significant from the user's perspective, such as the obtained position and orientation estimation error statistics or the consistency of the derived confidence intervals, as obtained by the navigation filter in simulated, yet realistic, scenarios.
As an illustrative example, in this work, we show that, in general, the models obtained with GMWM perform better in navigation than the ones obtained with the AVLR. This conclusion is supported by extensive simulation analyses based on real-world trajectories and further backed up by comparison against data from a real sensor. The differences are significant when a low-cost microelectromechanical system (MEMS) inertial measurement units (IMUs) are considered, while less pronounced if the stochastic nature of the error processes is the one typical of tactical or navigation grade sensors. The proposed framework also allows us to easily answer other subtle questions, such as evaluating the impact of mismodeling in stochastic calibration, for example, if the value of one parameter has been wrongly estimated, or if the stochastic model has been wrongly chosen, but also in relation to sensor selection, comparison, and performance evaluation in general. To the best of our knowledge, it is difficult to rigorously evaluate these aspects of the design of an INS without repeating the process of realistic noise generation and sensor fusion in a Monte Carlo fashion, as proposed in this work.
The rest of this article is organized as follows. In Section II, we summarize the main contributions of this article. A brief review on stochastic calibration is given in Section III, specifically in relation to the two methods considered in this work, namely, the AVLR and the GMWM, discussed in Sections III-A and III-B, respectively. In Section IV, we introduce the proposed framework to evaluate the performances of navigation systems, and we discuss its components in detail. Such a framework is applied in Sections V and VI to compare the impact of the stochastic calibration technique, AVLR or GMWM, on the performances of the navigation filter, both based on simulated sensor errors and real-world ones. Finally, Section VII concludes this article.

II. MAIN CONTRIBUTIONS
The main contributions of this work are summarized as follows.
1) We introduce a framework based on Monte Carlo simulations and realistic, user-configurable noise generation mechanisms that allow us to closely replicate real-world navigation conditions. This framework can be used to assess and quantify the performances of an INS in different conditions, such as nominal, in the presence of GNSS outages, when sensor measurements are contaminated with outliers, or if the stochastic model for any sensor has been misspecified. The performance metrics evaluated by the framework also include measures of the correctness of the uncertainty estimates for the navigation states, which is often difficult to verify in practice. 2) To illustrate the generality of the proposed approach, we consider an example application targeting an open research question: we study the impact of different statistical procedures to determine a suitable stochastic calibration for inertial sensors, namely, the AVLR and the GMWM. We propose to compare different techniques not in terms of the usual metrics used in system identification, such as information criteria or the likelihood, but in terms of the actual navigation performances that can be expected once the determined model is employed in an INS-based navigation system. Due to its generality and flexibility, our framework provides a suitable environment to evaluate the performance of new stochastic calibration methods compared to the existing ones.
3) A comprehensive simulation study suggests that the AVLR, which is widely used in practice, in certain cases may lead to stochastic models characterized by worse performances in navigation, e.g., in terms of position and orientation errors or correctness of the estimated uncertainty. In contrast, this appears not to be the case if the GMWM is employed, and our study sheds new light on what the conditions are under which the differences in navigation performances are substantial. 4) Our framework is implemented in an open-source R package, making it available widely and allowing any user to easily perform a wide range of assessments of their own navigation system and reproduce our results.

III. STOCHASTIC CALIBRATION OF INERTIAL SENSORS
The measurements of inertial sensors are intrinsically affected by random errors that originate from the internal physics of the device and whose nature depends on the actual technology employed [22]. For instance, in a typical gyroscope, the error terms include WN, correlated random noise, bias instability, and RW. The sources of these errors differ from one gyroscope to another. Most of these error sources are correlated in time. Stochastic calibration refers to the modeling and characterization of such error sources for a specific device prior to its usage [11]. Indeed, any navigation filter performing information fusion between inertial, GNSS, and possibly other sensors needs correct stochastic models of the measurement errors. The quality of those models impacts directly the accuracy of the navigation solution and the correctness of its estimated uncertainty.
In general, it is extremely difficult, if possible at all, to model the physical processes behind inertial measurement error, and anyways, such a model would be specific to a single device or family. Therefore, stochastic modeling methods are typically employed instead. Typically, a mixture of simple stochastic processes is selected such that it provides a suitable approximation of the stochastic properties of the measurement error. For example, one common choice is the sum of WN and RW processes. The selected model is then integrated into the navigation filter of choice (e.g., an EKF), after its characterizing parameters have been estimated. In the remainder of the section, we discuss two of the most common methods for the calibration of inertial sections, namely, the AV-and GMWM-based methods, whose performances are compared later in Sections V and VI.

A. Allan Variance Linear Regression Method
For inertial sensors, the most widespread modeling technique among practitioners and device manufacturers is the AVLR [12], conceived for the characterization of phase and frequency instability of precision oscillators, and suggested for the stochastic characterization of interferometric fiber optic gyroscopes in [17]. This is mainly a graphical technique based on the manual inspection of AV plots. The AV plot is derived from a sufficiently long error time series: several hours of sensor data are acquired, while the device is not moving, and thus, no signal is observed other than the measurement error and constant quantities, such as the Earth rotation rate and the gravity. A stochastic model for the measurement error, along with its characterizing parameters, can be determined based on the assumption that the different stochastic processes composing the total error appear in seemingly different regions of the AV plot as distinctive patterns, such as linear regions, as illustrated in Fig. 1. Practitioners typically model the error as a sum of simple stochastic processes, e.g., quantization noise (QN), WN, and RW, based on which patterns can be qualitatively identified in the AV plot. After the model has been selected, its parameters can be determined from those, for example, by means of linear regression. This procedure is commonly referred to as the AV linear regression method, and it is summarized in the IEEE standard [17]: "a first approximation [. . .] can be estimated by sketching in the asymptotes to the charted data analysis and computing approximate model coefficients." A comprehensive discussion of this approach is presented in [13].
The stochastic modeling procedures based on the AV suffer from several limitations. First, it relies on the assumption that only one underlying process is completely determining the shape of the AV at a given region of the plot. This perfect separation of the processes is actually not true as all underlying stochastic processes have an impact on the entire range in the AV plot. Consequently, the AV-based method leads to inconsistent estimated parameters characterized in practice by large biases even in the case of simple stochastic models (e.g., the sum of a WN and an RW [23]). The second limitation is that no pragmatic rule is available to the users to solve the model selection problem, i.e., decide which stochastic processes compose the total measurement error. Indeed, in realistic scenarios where multiple underlying stochastic processes are present, it is difficult to approximate the observed AV with simple stochastic processes having a linear representation in the standard AV log-log graph. More often, the empirical AV has a complex shape, and the user needs to resort to their intuition and experience to select a suitable model and the relevant scales on which to perform the linear regression. Finally, the AV technique is only able to estimate the parameters of models having a linear representation in the AV log-log plot. In order to obtain reasonable point estimates, it is also needed that the parameters of the underlying stochastic processes are such that each process dominates in a distinct region of the AV, which is often not the case in practice. The presence of correlated noise, e.g., first-order Gauss-Markov processes, which are not linear in the AV log-log plot, can render the use of the AVLR method even more challenging.

B. Moment Matching Techniques
Many methods have been proposed, which aims to minimize the distance between an empirical quantity (e.g., the AV) and its model-based counterpart (e.g., the AV implied by a given stochastic model), which can be expressed as a function of its parameters. An example of such methods developed in the context of inertial sensor stochastic calibration is the GMWM, as proposed in [14] and [24]. This method uses wavelet variance (WV) instead of the AV. This choice is due to the statistical properties of the WV, which have been studied further than the ones of the AV. However, these two quantities are very similar, and the interpretation of the AV presented in Section III-A also applies to the WV. Other momentmatching methods, such as the autonomous regression method for AV [25], have been proposed for the estimation of inertial sensors' stochastic models. However, Guerrier et al. [15] demonstrated that these methods can be seen as special cases of the standard GMWM approach.
As previously mentioned, the underlying idea of the GMWM approach is to match the empirical and modelbased WVs. Indeed, the theoretical WV for several stochastic processes, such as autoregressive moving average (ARMA), QN, and WN, is known analytically as a function of the model parameters due to the results of Zhang [26]. Moreover, the WV of the sum of several stochastic processes is simply the sum of all the processes of WVs. Thus, the model parameters say that θ ∈ can be obtained as follows: where θ is the vector of model parameters,ν is the empirical WV of the time series, ν(θ ) is the theoretical WV implied by the parameter vector θ , and is a weight vector that depends on the uncertainty ofν. The optimization problem in (1) is typically solved iteratively using gradient-based or Gauss-Newton methods. If the WV of a given model is linear in the model parameters, which is the case for arbitrary sums of QN, WN, RW, and drift, the optimization problem can be solved in the closed form [15]. Additional information on this method can be found, for example, in [14] and [24].

IV. ASSESSMENT FRAMEWORK FOR STOCHASTIC MODEL IMPACT ON NAVIGATION
As previously mentioned, stochastic calibration is required in order to characterize the inertial sensor measurement uncertainty so that the chosen sensor fusion algorithm can properly integrate them with other sensor measurements and estimate a reliable navigation solution. While many techniques exist for such purposes, such as those reviewed in Section III, it remains difficult to quantify the actual impact of a given choice of stochastic models on the navigation performances of the system in a given application scenario. For instance, the stochastic calibration procedure may yield an accurate model of the long-term correlations of the measurement error, whereas a much simpler model could perform similarly for the user application where, for instance, a drone is flown only for a few minutes at a time. As another example, many alternative models may exist for a given sensor that performs comparably with respect to the metric employed in stochastic calibration (i.e., similar value for the objective function defined in (1) at the solution), making the choice difficult and potentially subjective for the users.
In this section, we propose an approach to quantify the impact on navigation performances of different, competing setups, such as sensor selection or modeling choices. The proposed approach is depicted in Fig. 2. To illustrate the generality of our framework, we will later consider the problem of comparing the stochastic calibrations obtained from different statistical procedures, namely, the AVLR method and the GMWM. More precisely, we consider a user-specified trajectory that is typical of the application scenario, available as samples of the body position and orientation. This trajectory will be used as a ground truth. The available samples are differentiated to obtain higher order kinematic states, such as velocities and accelerations, which are used to calculate noise-free readings, as they would be measured by perfect sensors mounted on the body. The noise-free readings are then corrupted with noise samples taken contiguously from random portions of static acquisition data (e.g., as collected during standard stochastic calibration procedures), thus consisting of samples of only the actual sensor noise. A sensor fusion algorithm is configured with the stochastic model under investigation, for example, determined with the AVLR method. The sensor fusion algorithm processes the noisy readings and estimates the final navigation solution. This procedure is repeated in a Monte Carlo fashion. The set of obtained navigation solutions is aggregated to compute a suite of statistics relevant to assess the expected performances of the system in the given scenario, for example, in terms of mean position and orientation error, or consistency of the estimated uncertainty. In order to evaluate the impact of different choices of sensor stochastic models, the user needs just to configure appropriately the sensor fusion algorithm and compare the different statistics obtained after the Monte Carlo simulations.
As previously mentioned, the proposed approach has been implemented in the form of an open-source R package. 1 This software allows to perform easily all the steps outlined in Fig. 2 and compute many relevant statistics to compare competing stochastic models. We use this software package to perform all the investigations presented in the rest of this work, and therefore, our results can be easily replicated. In the remainder of the section, we discuss the different components of the framework in further detail.

A. Noise-Free Measurements Generation
We assume in this section that a suitable trajectory is available. This trajectory can be chosen to resemble the user's use case, for example, in terms of the dynamics of the body motion and duration. The latter is available as a sequence of positions, r t , and orientations, R n b,t , of the body frame b at the target inertial sensor rate. These are expressed with respect to a fixed, Cartesian, nonrotating navigation frame n. The trajectory is assumed to be the ground truth in the following; therefore, neither its accuracy nor how it was originally determined is relevant to the following.
From the position and orientation samples, we derive higher order kinematic properties of the body frame as follows: where v n t and a n t are the body frame velocity and acceleration at time t expressed in n, respectively. Moreover, ω b nb,t is the body frame angular velocity of b with respect to n, expressed in b, log (·) is the logarithmic map in the 3-D rotation group SO(3), and t is the trajectory sampling time.
Next, noise-free inertial sensor readings are computed. The specific force and the angular velocity readings are given by where g n is the gravity acceleration expressed in n, R n e is the (constant) orientation of the Earth-centered Earth-fixed (ECEF) frame e with respect to the navigation frame n, and ω e ne is the Earth rotation rate. In the following, we will ignore effects such as the Earth's rotation or a position-dependent gravity vector. While they are important in real-world navigation systems, at least above certain accuracy requirements, they are deterministic; once they have been accounted for in the sensor fusion algorithm, they play a limited or no role when looking at the performances of inertial stochastic models. We also note that, in MEMS IMUs, these effects are often masked by the nonnegligible turn-on bias and bias instability.
For simplicity, position and orientation derivatives are evaluated in (2) by means of first-and second-order forward finite difference schemes. If desired, more sophisticated differentiators can be employed (see [27]). Alternatively, as done in [28] and [7], a continuous-time representation of the discrete input trajectory can be obtained using splines of sufficient order. From the splines, higher order kinematic quantities can be obtained analytically at any desired (continuous) time t. In this work, we have chosen the approach in 2 so that a Kalman filter employing the corresponding integration scheme (see Section IV-C) would recover the input trajectory exactly if noise-free measurements were provided as input. By doing so, no spurious integration noise is introduced by the filter process model, and the error sources under investigation, e.g., an imperfect stochastic model, can be investigated selectively. Also, note that, as shown, for example, in [29], differences between integration schemes tend to fade when nonnegligible random noise is considered.

B. Noisy Measurement Generation
We provide two methods for generating noisy measurements: using collected data from the sensor and using a reference model to simulate the noise data.
In the first (and preferred) method, we assume that the user has collected data from the inertial sensor under evaluation in static conditions for stochastic calibration purposes. Since the device is static, the collected data consist of actual noise samples that can be used to corrupt noise-free measurements.
Under the assumption that inertial sensor's performances are relatively independent of environmental conditions, the noise samples collected in static conditions have similar, if not the same, stochastic properties of the noise during real-world operation. Thus, we add contiguous chunks of the available static data to the noise-free inertial measurements to obtain the noisy ones. A different start time for each chunk is randomly selected for each Monte Carlo simulation. However, the assumed independence between IMU stochastic models and environmental conditions does not hold entirely, at least for low-cost IMUs. Indeed, the stochastic properties of the measurement error and/or the deterministic calibration (e.g., scale factors) may depend on temperature (see [30]) and motion dynamics [31]. The effect of the device temperatures can readily be investigated with the proposed framework simply collecting static data in a temperature chamber and varying the temperature according to the user application. However, the dependence on motion dynamics is more difficult both to model and then to compensate for in a sensor fusion system, and it is not considered in the following.
Alternatively, the reference stochastic model can be assumed for the inertial sensors and sampled to obtain noise data. This is useful if static data are not available or, for example, if one wants to quickly evaluate different sensors available on the market.
A priori stochastic model is also used to generate noise samples for the GNSS sensor. Indeed, differently from inertial sensors, it is well known that the noise properties of GNSS measurements are very much dependent on environmental conditions, such as receiver surroundings, satellite constellation, ionosphere and troposphere conditions, and so on, so that static data are probably not representative of realworld operations. Since this work focuses on inertial sensors, the GNSS measurement error is assumed to be white and uncorrelated. More realistic GNSS error samples could be obtained using a GNSS simulator and then used to corrupt noise-free position and velocity measurements.

C. Sensor Fusion Algorithm
Let N denote the number of Monte Carlo simulations. For simulation i, with i ∈ {1, . . . , N }, a sensor fusion algorithm is employed to fuse together noisy inertial and GNSS observations, and estimate the navigation solution, (i)r t and (i)Rn b t , as well as their estimated covariance (i) t . For the sensor fusion algorithm, we chose the EKF as this method is at the core of most of the real-time navigation systems, and its behavior is well-known and understood among practitioners. The choice of the sensor fusion algorithm does not affect the principle of operation of the approach proposed in Fig. 2, and any other method, such as unscented Kalman filters, particle filters, or even smoothing-based methods, such as dynamic networks [32], could replace the EKF and directly allow to compare the performances of different state estimation algorithms.
We consider a standard error-state formulation of the EKF, as presented, for example, in [1]. In such EKF, the inertial sensor process noise models can be configured as an arbitrary sum of the following stochastic processes: 1) random constant; 2) WN; 3) autoregressive process of order one (AR1), which is equivalent to a first-order Gauss-Markov process; 4) RW; 5) drift. Multiple AR1s (or Gauss-Markov processes) can be considered, as it will be demonstrated in Section VI-A2. By combining the previously mentioned processes, a wide class of models can be constructed which includes standard models for inertial sensor errors. Other Gaussian processes, such as ARMA models of order higher than one, can be considered (see [33]). The proposed framework could be easily extended to include such processes. However, this choice does not appear to be widespread in practice and is left for future research.
The implemented EKF employs a first-order Euler method to integrate the process model. This method is simplistic, and higher order integration methods are preferable in real-world applications. However, here, the goal is to evaluate the impact of the stochastic models on the sensor errors, not the fidelity of the filter mechanization. Thus, the latter has been chosen to match the noise-free measurement generation mechanism presented in Section IV-A so that it produces exactly the original position and orientation samples when integrating noise-free measurements. This choice prevents spurious integration noise to impact further analyses.

D. Navigation Performance Statistics
The navigation solutions are then aggregated to compute statistics that are useful to evaluate the performance of the navigation system. The details are given in the following.

1) Position and Orientation Error:
The original trajectory provided by the user gives the ground truth, and each estimated solution can be compared to the reference in terms of position and orientation error where the (·) ∨ operator gives the three elements composing the skew-symmetric matrix returned by log(·), thus being a 3-D representation of the difference between the estimated and reference orientations. In the following, for simplicity, we will average both position and orientation errors over the three axes.
The error samples computed with (4) can then be aggregated with respect to the different navigation solutions, with respect to time, or both, by means of any sample-based operation, such as the sample mean and sample covariance. Given the statistical properties of such sample-based estimators, and provided that N is sufficiently large (e.g., several hundreds), those quantities are expected to approach the true value of the underlying quantities.

E. Normalized Estimation Error Squared
The normalized estimation error squared (NEES) is a commonly employed metric to evaluate whether the error in position and orientation is consistent with their estimated covariance. An in-depth discussion can be found in [4,Ch. 3.7.4].
The NEES is defined as follows: In a Monte Carlo setup with N simulations, the average NEES at time t, which we denote as NEES t , follows (approximately) a chi-square distribution with 6N degrees of freedom (6 being the dimension of the stacked (i) r t and (i) R t vectors). Thus, a two-sided confidence interval for NEES t with level 1 − α can be expressed as [ϵ 1 , ϵ 2 ], where where χ 2 n (α) is the αth quantile of a chi-squared distribution with n degrees of freedom. A one-sided confidence interval can be constructed similarly.
The evolution of NEES t with respect to the bounds [ϵ 1 , ϵ 2 ] is an important metric used in practice to assess whether the sensor fusion algorithm is overconfident, meaning that the actual error is typically larger than the estimated uncertainty of the position and orientation estimates, or underconfident, the other way around, or none of the two.

F. Coverage
The consistency and the efficiency of the estimated position and orientation, and of their estimated covariance, can be evaluated in an alternative way, as commonly done in the statistical literature. We first define the one-sided confidence interval of level 1 − α for (i) NEES t as follows: Next, we introduce the following binary variable: If the estimated covariance (i)ˆ t is sufficiently well estimated, (i) c t follows approximately a Bernoulli distribution with parameter p = 1 − α, allowing to assess the simulation error. The coverage of the confidence intervals defined in (7) is given by the average of (i) c t over the N Monte Carlo runs, c t .
The coverage, c t , measures how often, in practice, the confidence intervals constructed from (i)ˆ t and centered at (i)r and (i)Rn b,t include, or cover, the true position and orientation. In other words, such confidence intervals are trustworthy if c t is close to 1−α. This measure provides an alternative approach to assess the quality of the computed navigation solution.

V. CASE STUDY: ALLAN VARIANCE LINEAR REGRESSION METHOD VERSUS GMWM
In this section, we apply the proposed framework to compare different statistical estimators, leading to different stochastic models for the inertial sensors in terms of navigation performance. More precisely, we compare the AVLR method and the GMWM, both briefly presented in Section III. This example is relevant since it allows investigating whether, and when, a more sophisticated statistical method, such as the GMWM, should be used instead of the commonly employed AVLR method. Indeed, we consider an experimental setup composed of two steps: 1) a calibration step, in which the stochastic models are determined from available noise data (for example, collected during static acquisitions) using the two previously mentioned estimators; 2) Monte Carlo navigation simulations, where the framework depicted in Fig. 2 is employed to quantify the navigation performances of the estimated models. In our experiment, the estimation of the parameters of the stochastic models taking place in the first step is performed based on several hours of data collected from a static device. Since the device is static, the collected measurements will be composed of noise samples only, and the selected stochastic modeling techniques are employed to determine the two sets of stochastic models (each set being composed of the gyroscope and accelerometer model) that we seek to compare. Next, we evaluate the determined stochastic models in a realistic scenario using the proposed Monte Carlo simulation framework. We consider a trajectory of a real fixed-wing unpiloted aerial vehicle (UAV) performing an aerial mapping mission lasting approximately 40 min. By performing hundreds of simulations, each time employing different noise realizations, we obtain reliable statistics of the navigation performances that can be obtained employing each of the two sets of models. From these statistics, the model that should be preferred for implementation in the final application can be picked.
As discussed in detail in Section VI, we consider two scenarios. In the first scenario, we consider synthetic inertial sensors for which the true noise model is assumed to be known. The chosen models aim to mimic the observed AV/WV shape of real-world low-to-mid-grade inertial sensors. These models are sampled to generate both synthetic noise data for the stochastic calibration and the generation of noisy inertial readings for the Monte Carlo simulations. In the second scenario, a real sensor is considered, and multiple static acquisitions are performed to collect data for both the stochastic calibration and the realistic noise samples to corrupt inertial readings during navigation.

A. Flexible Set of Stochastic Models
We define a general set of models that can well approximate the vast majority of models used in inertial sensors. Such set is defined as the sum of M AR1s, which we write as first-order Gauss-Markov processes because of the practical meaning of the parameters (correlation time and variance of the innovation) where e(t) is the measurement error affecting the inertial sensor at time t, ξ i (t) is a (continuous time) WN with power spectral density q i , and τ c,i is the ith process correlation time.
This class of models is very general and includes most of the processes typically considered in modeling inertial sensors, where some well-known special cases are given in the following. 1) WN, typically referred to as angular RW for gyroscopes and velocity RW for accelerometers, is obtained when τ c,i → 0.
2) RW, or rate RW, in the case of gyroscopes, when τ c,i → ∞ (any long-term correlation observed in practice can be modeled with a sufficiently high τ c,i ). 3) Bias instability, or flicker noise, is defined in terms of its power spectral density Since no state-space model can be derived for this process, it cannot be employed directly in state estimation algorithms, e.g., in an EKF. As suggested in [34,Sec. 4.3], or in [17], it is "sometimes approximated by a Markov model or a multiple stage ARMA model," such as the one in (9). 4) Turn-on bias, or random constant, the constant part of the sensor error that changes at every power cycle, can be modeled setting q i = 0 and τ c,i → ∞. Two other commonly employed processes do not fall in the proposed model family. Those are the QN and the drift or rate ramp in gyroscopes. Since both of these can be estimated with both the AVLR and the GMWM, they are not considered in this comparison study between those two.
In stochastic calibration, it is more customary to consider the equivalent, discrete-time formulation of (9) where the well-known relation between the correlation time and the innovation power spectral densities, and the discrete time φ i ∈ [0, 1] and σ 2 i > 0 are given in the following: where f is the sampling frequency. For a given Gauss-Markov process i, the two different parameterizations (φ i , σ 2 i ) and (τ c,i , q i ) are equivalent given f , and they will be used interchangeably depending on which one is more intuitive in the given context.
The GMWM provides estimates having suitable statistical properties for the set presented in (11) when the number of Gauss-Markov processes, M, is arbitrary but known and also to determine the optimal M for a given calibration data. On the contrary, the AVLR method would typically estimate the model, where M = 2 Gauss-Markov processes belonging to the WN (or angular or velocity RW) and RW (rate RW) special cases. The AVLR is also capable of estimating bias instability in terms of B and f , as in (10). However, it is difficult to relate those to a suitable state-space model that can be employed in an EKF.
Given the specific features of the two stochastic calibration methods presented before, their models may be different. To illustrate this difference, we considered approximately Fig. 3. Top: 200-Hz static data acquired from the X -axis of a KVH1750 accelerometer (only the first hour is shown). Bottom: the empirical WV of the data (in red) and the theoretical WV as implied by the model obtained with AVLR (green) and the GMWM (brown). The decomposition of the GMWM model in four first-order Gauss-Markov processes, one of which degenerates into a WN, is also shown for the GMWM model. Fig. 4. Data acquisition setup. The sample IMU is mounted on a high-precision single-axis rotation table from Actidyn. In the figure, the IMU was offset with respect to the center of the rotation table to obtain a nonzero signal on the xand y-axes of the accelerometer in dynamic acquisitions. Courtesy of Clausen [11].
2 h static data collected using the Novatel KVH 1750 IMU mounted on a high precision single-axis rotation table presented in Fig 4. In Fig 3, we compare the empirical WV (which can be interpreted similar to the AV) of the X -axis accelerometer with the models estimated with AVLR (in green) and GMWM (in brown). It can be observed that the AVLR model does not provide a close match to the empirical WV between scales 10 1 s and 10 3 . Indeed, the AVLR method does not allow estimating state-space stochastic models that would well approximate the flat region in the WV evident at those scales. On the contrary, the GMWM estimates appear to provide a better model fit considering a model composed of M = 4 Gauss-Markov processes, the first of which degenerates into a WN (τ c,1 = 0). The theoretical WV of the resulting model matches the observed one to an excellent degree. In particular, the GMWM allows obtaining a model for the long-term correlation of the error, for example, between scales 10 1 and 10 3 s. No such model is provided by the manufacturer [35] or can be estimated reliably with the available graphical methods based on the AV.
In Section VI, we will investigate in which cases the extra modeling power offered by GMWM would lead to superior performances in navigation.

VI. SIMULATION RESULTS
In this section, we compare the performance of stochastic calibration based on the AVLR and the GMWM, respectively, in two different scenarios. In the first scenario, we assume that the stochastic model behind noise generation in inertial sensors is known and belongs to the set of models introduced in Section V-A. Many different parameter values are considered. In the second scenario, we consider a real sensor, the Xsens MTI-G IMU [36], for which data from a static acquisition are available. This sensor is characterized by a significant bias instability, and the WV of the static acquisitions matches the ones of some of the models considered in the first scenario.

A. Scenario 1: Known Stochastic Models
We consider a simulation scenario in which the true stochastic model behind inertial sensor noise is known. In the following, we first consider the class of models introduced (11), with M = 2, in Section VI-A1. Next, in Section VI-A2, we consider a more complex model family, where M = 3. In both cases, the first Gauss-Markov process (or AR1) always degenerates to a WN, e.g., τ c,1 → 0.
1) Model 1: 1 WN + 1 AR1: The stochastic model for the gyroscope is assumed to be the sum of a WN and a Gauss-Markov process: 24 different instances from this model class, each one being characterized by different parameters, have been chosen as follows.
1) The standard deviation of the WN is fixed at σ ξ = 10 −6 rad s −1 . 2) For the first 12 instances (large set from now on), the variance of the AR1 process is large compared to the WN, and it is small for the remaining 12 (small set from now on). The variance of the AR1 process is given by σ 2 ξ /(1 − φ 2 ), and it is kept constant to 5 × 10 −8 rad 2 s −2 for the large set and to 5 × 10 −9 rad 2 s −2 for the small set.
3) The jth instances of both the large and small sets, with j ∈ {1, 2, . . . , 12}, have the same value of φ, and φ j spans [0, 1]. In other words, we consider processes with increasing correlation time τ C in the first-order Gauss-Markov parameterization. Moreover, we consider a fixed model for the accelerometer noise, being a WN with standard deviation σ ξ = 5 × 10 −5 m s −2 . The resulting stochastic processes are realistic for MEMS IMUs. The theoretical WVs of the considered models are depicted in Fig. 5, with the large set being on the left and the small set on the right.
For each of the model instances, we sample a time series, mimicking the static acquisition process that would provide the stochastic calibration data for a real inertial sensor. Next, we use the generated noise time series to estimate a stochastic model using the GMWM and the AVLR. The estimated model is used in an EKF to process noisy GNSS and inertial readings, corrupted by noise sampled the same way as for the synthetic static acquisition data. The performances of the navigation solution are measured according to the metrics presented in Sections IV-D-IV-F. This procedure is repeated 500 times for each model instance in a Monte Carlo fashion.
From Fig. 5, it is possible to see that the WVs of some of the model instances (e.g., the one in violet) have large parts that are linear in the scales τ and can be well approximated by the sum of a WN and an RW using the AVLR. However, this does not hold for most of the others (e.g., the ones highlighted in red, green, and light blue): for those cases, the model obtained by means of the AVLR is different from the true one, in terms of its WV. In contrast, by means of the GMWM, it is possible to recover a good approximation of the true model for all the model instances.
If the model employed in the EKF is substantially different from the true one, used to generate noisy inertial readings, the quality of the navigation solution may degrade. In the following, we quantify such degradation in terms of position and orientation errors within GNSS coverage, i.e., when a position fix is available and during a GNSS outage period of 60 s. The position and orientation errors for each model instance are presented in Figs. 6 and 7. These quantities correspond to (i) r t and (i) R t , as defined in Section IV, averaged over the Monte Carlo simulations and the three  axes. The results are expressed in relative units, in terms of the performances of the model estimated by means of the AVLR versus the one estimated with the GMWM. For example, a value of 110% implies that the model estimated with the AVLR achieves, on average, a position (or orientation) error 10% higher than the model estimated with the GMWM. In these figures, and the ones that will follow, the x-axis is the time relative to the beginning of the GNSS outage, at second 0, while, on the y-axis, we have the time constant τ C of the autoregressive process contained in that specific model instance.
From Figs. 6 and 7, we can observe the following. 1) When the WV of the true model can be well approximated by the sum of a WN and an RW, i.e., when φ → 1 or, equivalently, τ C is high, both the GMWM and the AVLR achieve similar performances both in terms of position and orientation error. This result is somewhat expected as, in this case, both the AVLR and the GMWM are able to provide suitable approximations of the underlying data-generating process. 2) When the WV of the true model presents a peak or, anyways, it is not linear in the scales τ (see again Fig. 5, e.g., the model highlighted in green), only GMWM can estimate an accurate model, and the performances degrade by up to 50% in position and 80% in orientation with AVLR method, as this method, unlike the GMWM, is unable to provide an accurate approximation of the underlying data-generating process.
3) The higher the variance of the AR1 process is (large set compared to small set, on the left and on the right in Figs. 6 and 7, respectively), the higher the degradation of the performances will be. This is intuitive since the AR1 process is the one that cannot be estimated with the AVLR technique. 4) The degradation in terms of position error is mostly visible during the GNSS outage period only. Indeed, the GNSS position fix corrects for any error accumulated because of poorly modeled inertial readings. If the AR1 has a large variance compared to the WN, a peak in the position error is visible also soon after the GNSS position fixes have become again available. A possible explanation for this effect is due to a wrong model for the inertial sensor, which will build a wrong covariance matrix in the EKF, requiring more time to converge after the recovery of GNSS measurements. Additional information on this effect is shown when discussing the estimated uncertainty of position and orientation. 5) An inaccurate stochastic model for the inertial sensors will have more impact on the orientation error than on the position error. In addition, this impact can be observed even when the GNSS position fix is available. This is expected because the orientation estimates are known to be more dependent on the quality of inertial sensor measurements and models.
Next, we analyze the quality of the confidence intervals for the position and orientation as estimated by the EKF. During navigation, the filter maintains a covariance matrix over the states, from which confidence intervals can be derived for those. If the stochastic model for the inertial measurement assumed in the EKF does not correspond to the one behind true noise generation, the covariance matrix will not be a good estimate of the probability distribution of the states. To quantify this, we evaluate the coverage as defined in Section IV-F with 1 − α = 70%: if the states' uncertainty estimated by the EKF is correct, we expect that 66% < c t < 74%, the interval accounting for the Monte Carlo simulation error. Indeed, assuming the filter to provide exact coverage, the random variable (i) c t follows a Bernoulli distribution with parameter 1 − α. Therefore, using the central limit theorem, an approximate 95% confidence interval for the averaged (i) c t over N Monte Carlo simulation is given by for α = 30%. The results are depicted in Fig. 8 for the large model set and in Fig. 9 for the small model set.
The results are expressed in terms of coverage error that we define as the difference between the assumed probability (i.e., 70%) and the empirical probability that the true states are within the computed 70% confidence interval, c t . Consequently, a negative value indicates an overconfident estimate of the position and orientation uncertainty, and vice versa. We can observe that the models obtained with the GMWM have the correct coverage (around 70% or around 0% coverage error) at all times, while the models obtained with the AVLR lead to a large underestimation of the position and orientation uncertainty when φ → 1 or, equivalently, τ C → ∞. This effect is more severe for the large model set, and less severe for the small one, as happened for the position and orientation relative error. This result suggests  that, in practice, if the user is particularly interested in having a reliable estimate of the navigation state's uncertainty, the stochastic models obtained with GMWM can provide substantial advantages over the commonly used AVLR, at least for the specific structure of the underlying noise generation mechanism.
2) Model 2: 1 WN + 2 AR1s: We consider a more complex case in the following. This is similar to the one discussed before, with the exception that, this time, the true gyroscope noise model is the sum of two AR1s and a WN (M = 3). The parameters of such autoregressive processes have been chosen so that a flat region in the high scales of the theoretical WV appears. In this case, 12 different instances have been considered, and their theoretical WVs have been plotted in Fig. 10(a). Such shapes are typical of low-to-medium grade inertial sensors, e.g., MEMS, exhibiting nonnegligible bias instability. In Fig. 10(b), we have plotted one of the chosen instances, in green in both figures, and compared it with the empirical WV of a static acquisition of a real sensor, the MTI-G MEMS IMU. It can be observed that the shape of the WV is very similar, suggesting that our experimental setup considers a realistic scenario.  Similar to the results obtained in Section VI-A1, it can be observed that a good estimate of the original model using the AVLR can only be obtained for some of the specific parameter values corresponding to correlation times of the two autoregressive processes that are either small or high. Instead, the GMWM allows estimating arbitrary mixtures of autoregressive processes, regardless of their correlation time, and, thus, recovers the underlying noise model with sufficient accuracy.
We apply the same experimental procedure as the one discussed in Section VI-A1. The relative position and orientation errors achieved employing the models estimated with the AVLR versus the ones obtained with GMWM are shown in Fig. 11. The coverage error, as defined in the previous section, is also presented in Fig. 12, where it is possible to see that the models obtained with the AVLR lead to a large underestimation of the position and orientation uncertainty in the EKF.

B. Scenario 2: Real Sensor Noise
In this scenario, we consider a real IMU sensor, the Xsens MTI-G, a MEMs IMU, for which real data are available from static acquisitions in a controlled environment. As anticipated in Section VI-A, and shown in Fig. 10(b), such sensor exhibits nonnegligible bias instability which appears as a flat region in the wavelet variance plot in correspondence of high scales τ . The AVLR does not allow estimating a suitable state-space model for this data. Instead, a model for this can be obtained approximating the flat region of the empirical WV as the sum of multiple autoregressive processes that can be effectively estimated by means of the GMWM. The empirical WV of the noise data collected during the static acquisition is presented in Fig. 13, along with the theoretical WV of the model estimated with GMWM and its decomposition. It is possible to see that the accelerometer noise model of the MTI-G presented in Fig. 13(a) presents a similar pattern in the WV as the model considered in our experimental setup and represented in Fig. 10(b). The empirical WV of the real noise data and the theoretical one estimated with the GMWM model match to a remarkable extent once three AR1s are considered for the accelerometer and two for the gyroscope. The real sensor data used in this work are taken from the open-source imudata R package. 2 We evaluate the performances of the models estimated by means of the AVLR against GMWM in an EKF. This time, instead of simulating noise samples from a stochastic model known a priori, which is not available in the case of a real sensor, we corrupt inertial readings using real noise samples obtained during a static acquisition. An example of the raw data used in this section is shown in Fig. 14. Several hours of data are available, and each time, different contiguous sets are chosen in a Monte Carlo fashion, as discussed in Section IV. This approach permits to evaluate the performances of a given stochastic model when realistic noise samples are provided, more closely to a real-world application scenario.
The obtained models are compared in terms of the relative position and orientation errors when configuring the EFK with the associated models. In Fig. 15, we show the position and orientation errors of the models estimated with the AVLR relative to the ones obtained with the GMWM. As anticipated, the models obtained by means of the AVLR fail to model accurately the stochastic nature of the error signal, and this translates into degraded navigation performances, up to 15% in position and 45% in orientation. The decomposition of the orientation error along the three axes is also shown to point out that, as expected, the maximum error is found on the yaw axis, and the AVLR model performs up to 50% worse than GMWM on that axis. These results show the potential benefit of performing stochastic calibration with GMWM on a real device and are in line with what is obtained in Section VI-A2: this can be seen by comparing Fig. 11, along the blue line.

VII. CONCLUSION
In this work, we propose a simulation-based framework allowing us to assess the performances of INS/GNSS navigation systems in realistic settings. Indeed, our approach allows considering flexible and user-configurable noise generation mechanisms and permits quantifying INS/GNSS navigation systems performances in different conditions (e.g., GNSS outages and inaccurate stochastic model for any sensor). Using this framework, we study the impact on navigation performances for different statistical procedures (AVLR and GMWM) used to obtain stochastic inertial stochastic models. By comparing various simulation settings, our results suggest that the commonly used AVLR method can lead to considerably worse performances compared with the GMWM in terms of position and orientation errors, as well as in terms of accuracy of the estimated states, especially for inertial sensors characterized by considerable bias instability. These results can have important practical implications, suggesting that the GMWM can provide significant improvements, in particular, when MEMS sensors are considered, where bias instability is often considerably high. Moreover, our framework provides a systematic method allowing researchers and practitioners to compare the performance of existing, and potentially new, stochastic calibration methods. In this work, we considered stochastic calibration as a proof of concept, but other problems related to the quantification of navigation system performances can be easily investigated with our frameworks, such as the robustness of an INS with respect to outliers or other modeling imperfections. Our method can bridge the gap between realworld experiments, which are always needed but often impractical, costly, and difficult to replicate a large number of times (e.g., to evaluate the correctness of estimated uncertainty) and pure statistical parameter estimation as done, for example, in the stochastic calibration literature.