Noise Covariance Estimation in Adaptive Kalman Filtering via sequential Mini-batch Stochastic Gradient Descent Algorithms

Estimation of unknown noise covariances in a Kalman ﬁlter is a problem of signiﬁcant practical interest in a wide array of applications. Although this problem has a long history, reliable algorithms for their estimation were scant, and necessary and suﬃcient conditions for identiﬁability of the covariances were in dispute until recently. Necessary and suﬃcient conditions for covariance estimation and a batch estimation algorithm. This paper presents stochastic gradient descent (SGD) algorithms for noise covariance estimation in adaptive Kalman ﬁlters that are an order of magnitude faster than the batch method for similar or better root mean square error (RMSE) and are applicable to non-stationary systems where the noise covariances can occasionally jump up or down by an unknown magnitude. The computational eﬃciency of the new algorithm stems from adaptive thresholds for convergence, recursive fading memory estimation of the sample cross-correlations of the innovations, and accelerated SGD algorithms. The comparative evaluation of the proposed method on a number of test cases demonstrates its computational eﬃciency and accuracy.


I. INTRODUCTION
T HE Kalman filter (KF) [1] is the minimum mean square error (MMSE) state estimator for discrete-time linear dynamic systems under Gaussian white noise with known first and second order statistics.It is also the best linear estimation algorithm when the noise processes are non-Gaussian with known covariances.It has found successful applications in numerous fields, such as navigation, weather forecasting, signal processing, econometrics and structural health monitoring, to name a few [16].However, in many practical situations, including the ones mentioned above, the statistics of the noise covariances are often unknown or only partially known.Exploiting the whiteness property of the innovations in an optimal Kalman filter [3]- [6], previous work [4], [7] has derived necessary and sufficient conditions for estimating the unknown process noise and measurement noise covariances from data.In [7], we have presented a batch optimization algorithm to estimate the unknown covariances to minimize the sum of the normalized temporal cross-correlations of the innovations.In this paper, we explore an enhanced covariance estimation method based on a sequential mini-batch stochastic gradient descent algorithm with adaptive step sizes and iteration-dependent convergence thresholds.

A. PRIOR WORK
The key to process noise and measurement noise covariance estimation is an expression for the covariance of the state estimation error and of the innovations of any stable suboptimal filter as a function of noise covariances.This expression serves as a fundamental building block for the correlationbased methods for covariance estimation.Pioneering contributions using this approach were made by [3]- [6].Relatively recently, Sarkka and Nummenmaa [2] proposed a variational Bayesian method for the joint recursive estimation of the dy-namic state and measurement noise parameters in linear state space models.This method is based on forming a separable variational approximation to the joint posterior distribution of states and noise parameters at each time step separately.In general, the variational methods suffer from computational complexity and often converge to a local minimum.
Recently, Zhang et al. [7] proposed a new approach to estimate the process noise and measurement noise covariances of a system using the entire time series of innovation samples.For estimating the unknown covariances, Zhang et al. [7] proved that the identifiability conditions are related to the rank of a matrix involving the auto and cross-covariances of a weighted sum of innovations, where the weights are the coefficients of the minimal polynomial of the closed-loop system transition matrix of any stable, not necessarily optimal, Kalman filter.An optimization algorithm was developed to estimate the steady-state Kalman filter gain, the unknown noise covariance matrices, as well as the state prediction (and updated) error covariance matrix.A drawback of this approach is that it is a batch optimization method requiring the entire observation sequence to compute the cost and the gradient, is computationally expensive, and is applicable only to stationary systems.The present paper seeks to overcome these limitations.

B. CONTRIBUTION AND ORGANIZATION OF THE PAPER
In this paper, we develop a sequential mini-batch estimation method with adaptive step size rules to improve the computational efficiency of the algorithm in [7].Our approach replaces the batch innovation covariance estimates with sequential fading memory mini-batch estimates when updating the Kalman filter gain.We also enhance the algorithm by investigating the performance of several accelerated stochastic gradient descent (SGD) methods, viz., Constant, Bold driver, Subgradient, RMSProp, Adam, Adadelta step-size selection methods [8]- [11], and enhance their computational efficiency via iteration-dependent dynamic thresholds for convergence.The SGD methods, together with change detection algorithms, enable the estimation of noise covariances in non-stationary systems in a decision-directed manner.The jumps are assumed to occur occasionally and after the filter has reached a steady-state.The validation of the proposed method on several test cases from the literature and on nonstationary systems demonstrates its efficacy.The paper is organized as follows.In Section II, we provide an overview of the batch gradient descent method for estimating the unknown noise covariances.Then, in Section III, we present the sequential mini-batch gradient descent method for optimizing the Kalman gain and consequently estimating the measurement and process noise covariances.We also discuss how we enhanced our approach using fading memory filter-based innovation correlation estimation, stochastic gradient descent (SGD) update of the Kalman gain, dynamic thresholding for algorithm convergence, and accelerated step size rules for the SGD update.Section IV provides evidence that the computational performance of the proposed methods is substantially better than the batch estimation algorithm via numerical examples and demonstrate its applicability to non-stationary systems where the noise covariances abruptly jump by an unknown, but finite, magnitude occasionally.Lastly, we conclude the paper and discuss our future work in Section V.

II. BATCH GRADIENT DESCENT METHOD FOR ESTIMATING Q AND R
Consider a linear discrete-time stochastic dynamic system where x(k) is the state vector, z(k) is the measurement vector, F and H are the state transition matrix and the measurement matrix, respectively, and Γ is the noise gain matrix.
Here, the process noise, v(k) and the measurement noise, w(k) are assumed to be zero-mean white Gaussian noise processes with unknown process noise covariance Q and unknown measurement noise covariance R, respectively.We allow for Q and R to change occasionally (see Section IV).These two noise covariance sequences and the initial state are assumed to be mutually independent in time.
When Q and R are known, the Kalman filter involves consecutive processes of prediction and update given by The Kalman filter predicts the next state estimate at time index (k + 1), given the observations up to time index k in (3) and the concomitant predicted state estimation error covariance in (6), using system dynamics, the updated state error covariance P (k|k) at time index k and the process noise covariance, Q.The updated state estimate at time (k + 1) in (5) incorporates the measurement at time (k + 1) via the Kalman gain matrix in (8), which depends on the innovation covariance S(k + 1) (which in turn depends on the measurement noise covariance, R) and the predicted state error covariance P (k + 1|k).The updated state error covariance P (k + 1|k + 1) is computed via (9).

A. IDENTIFIABILITY CONDITIONS FOR ESTIMATING Q AND R
When Q and R are unknown, consider the innovations corresponding to a stable, suboptimal closed-loop filter matrix F = F (I nx − W H) given by [7], [15] where is the predicted error at time (k − m).Given the innovation sequence (10), let us define a weighted sum of innovations, , where the weights are the coefficients of the minimal polynomial of the closed-loop filter matrix F , It is easy to see that ξ(k) is the sum of two moving average processes driven by the process noise and measurement noise, respectively, given by [7] Here, B l and G l are given by Then, if we define the cross-covariance between ξ(k) and ξ(k − j) as L j , we obtain The noise covariance matrices are positive definite and symmetric.By converting noise covariance matrices and the L j matrices as vectors, Zhang et al. [7] show that they are related by the noise covariance identifiability matrix I as in (16).
As shown in [7], if matrix I has full column rank, then the unknown noise covariance matrices, Q and R are identifiable.

B. OBJECTIVE FUNCTION AND THE GRADIENT
Innovations of an optimal Kalman filter are white, meaning that they are uncorrelated over time [3]- [6].In contrast, the innovation sequence {ν(k)} N k k=1 will be correlated if the Kalman filter gain is not optimal.The correlation methods exploit this non-whiteness property of the innovations of a suboptimal Kalman filter by minimizing a measure of the cross-correlations of the innovations over a lag window of length up to M ≥ n x for estimating the optimal Kalman filter gain W from the computed innovations.The M sample covariance matrices Ĉ(i) for each lag i = 0, 1, 2, .., M − 1 are given by (17).
On the other hand, the ensemble cross-correlation of a steady-state suboptimal Kalman filter is related to the closedloop filter matrix F , the matrix F , the measurement matrix H, the predicted covariance matrix P , filter gain W and the innovation covariance, C(0) via [5], [6] The objective function J formulated in [7] involves minimization of the sum of normalized C(i) with respect to the corresponding diagonal elements of C(0) for i > 0. Formally, we can define the objective function J to be minimized with respect to W as where diag(C) denots the diagonal matrix of C or equivalently the Hadamard product of an identity matrix with C. We can rewrite the objective function by substituting (19) into (18) as where The gradient of objective function ∇ W J can be computed as [7] ∇ The Z term in (24) is computed by the Lyapunov equation; it is often small and can be neglected for computational efficiency.
In computing the objective function and the gradient, we replace C(i) by their sample estimates, Ĉ(i).Evidently, the covariance estimation is a stochastic optimization problem.

C. ESTIMATION OF
From the joint covariance of the innovation sequence ν(k) and the post-fit residual sequence µ(k), and the Schur determinant identity [12], [13], one can show that [7] where S is the innovation covariance.Because (27) can be interpreted as a simultaneous diagonalization problem in linear algebra [13] or as a continuous-time algebraic Riccati equation, the measurement covariance R can be estimated by solving the simultaneous diagnonalization problem via Cholesky decomposition and Eigen decomposition, or by solving a continuous-time Riccati equation as in [7], [14].
2) Estimation of Q Given the estimated R, we can compute the process noise covariance Q and the steady-state state updated covariance P .This requires an iterative process because Q and P are coupled in the general case, Wiener process being an exception where an explicit non-iterative solution Q = W SW is possible [7].Letting t and l denote the iteration indices starting with t = 0 and l = 0, and using an initial Q (0) = W SW , we initialize the steady-state updated covariance matrix P as the solution of the Lyapunov equation in ( 28) where F = (I nx − W H)F .We iteratively update P as in (29) until convergence Given the converged P , Q will be updated in the t-loop until the estimate of Q converges.Note that λ Q is a regularization parameter used for ill-conditioned estimation problems.

D. BATCH ESTIMATION ALGORITHM
A sample pseudo code of the batch estimation algorithm is shown below.In the outer-loop, the estimated Q and R of the previous iteration are used to refine the estimates.Typically, less than 10 outer-loop iterations are needed.
Algorithm 1 Pseudo code of batch gradient descent-based on adaptive Kalman filtering algorithm 1: input: W 0 , α W 0 : initial gain, for example, by solving a steady-state Riccati equation from an initial R (0) and Q (0) α: step size 2: for t = 1 to N t do N t : Max.Num.outer-loop iter.

III. SEQUENTIAL MINI-BATCH GRADIENT DESCENT METHOD FOR ESTIMATING Q AND R
In this section, we provide an overview of the sequential mini-batch SGD method for estimating the unknown Q and R. We introduce five enhancements to the batch estimation algorithm: recursive fading memory estimation of the sample cross-correlations of innovations, mini-batch SGD, dynamic thresholds for inner-loop termination, accelerated SGD, and simplified gradient computation by neglecting the Z term in (24).The recursive nature of our algorithm makes it amenable for the estimation of noise covariances in nonstationary systems.The non-stationarity is assumed to arise from abrupt changes of unknown magnitude in noise covariances occasionally and after the filter reaches the steadystate.

A. RECURSIVE FADING MEMORY
We compute the sample correlation matrix Ĉk seq (i) at sample k for time lag i as a weighted combination of the correlation matrix Ĉk−1 seq (i) at the previous sample (k − 1) for time lag i and the samples of innovations ν(k − i) and ν(k).The tuning parameter λ, a positive constant between 0 and 1, is the weight associated with the previous sample correlation matrix.The current M sample correlation matrices are used as the initial values for the next pairs of samples.
Let B be the mini-batch size and let K = N k /B be the number of min-batches (we assume that N k is divisible by B for simplicity).While the mini-batch gradient descent sequentially updates the M sample covariance matrices at every input sample, we update the Kalman filter gain W when the sample index k is divisible by the size of minibatch B using the gradient of the objective function at sample k.Several accelerated gradient methods, which will be discussed in the next section, can be applied for updating the optimal Kalman filter gain.Sequential mini-batch gradient descent allows more opportunities to converge to a better local minimum by frequent update of the gain than the batch algorithm.This also enables us to experiment with dynamic convergence thresholds, while the batch estimation algorithm performs poorly with such adaptations.The generic form of gain update is Equation ( 9) for the updated state error covariance P (k + 1|k + 1) can be replaced with (34) in the Joseph form.Although (34) is more expensive computationally, it is less sensitive to round-off error because it guarantees that the updated state covariance matrix will remain positive definite.Raghavan et al. [19] suggested another way to ensure positive definiteness of the updated state covariance matrix via square root factorization algorithms.
C. ENHANCEMENT OF COMPUTATIONAL EFFICIENCY 1) Dynamic thresholds for the inner-loop termination We have found experimentally that the termination threshold during the initial outer-loop iterations can be relaxed considerably and can be made progressively tighter as the iterations progress.Consequently, we apply an exponentially decaying dynamic threshold to terminate the inner-loop gain updating process.The following outer-loop iteration-dependent dynamic threshold to terminate the inner-loop gain update was found to work well in our computational experiments.This has substantial impact on the speedup of the sequential algorithm by up to 8 times without loss in the accuracy of RMSE.However, the batch algorithm in [7] exhibits erratic behavior with a significant increase in RMSE when dynamic thresholds are used.
Note that the dynamic threshold is related to conditions 1 to 3 of the following termination conditions: 2) Simplified gradient computation Computational testing has shown that the Z term in (25) needed for the gradient computation in ( 24) is typically small.Consequently, we simplified the gradient computation by neglecting the Z term.Although the gradient is approximate, the simplification can reduce the computational cost of the algorithm with little or no impact on the RMSE of the covariance estimates.

D. ACCELERATED GRADIENT METHODS
The incremental gradient algorithm in (33) can be sped up by adaptively selecting the step size α (r) .

1) Bold driver Method
The bold driver method used in [7], [11] is an adaptive step size algorithm.We initialize this algorithm as where c > 0 is a constant, N s is a hyper-parameter, N k is the total number of observed samples and K = N k /B.We use a smaller step size to prevent unstable gains if the mini-batch size is small.
The step size α (r) is determined automatically at each iteration after comparing the current objective function value J (r) to the previous value J (r−1) , shown in (38): where c is the maximum step size defined as, We set the maximum step size c max at 0.2.

2) Constant Step Size Method
Compared to the bold driver method, the constant step size method uses a fixed step size at every iteration.The step size needs to be small to avoid unstable gains.

3) Subgradient
The subgradient method [20] is a simple algorithm for minimizing non-differentiable functions.Since zero is a lower bound on the cost function (recall that innovations of an optimal Kalman filter are uncorreleted over time), we use the following step size for this method where 0 ≤ α (0) ≤ 2. We used α (0) = c in (39).The subgradient method can be combined with the bold driver method as discussed in [12].We have not experimented with this combination.

4) RMSProp
RMSProp [10] keeps track of the moving average of the squared incremental gradients for each gain element for adapting the element-wise step size.
Here, γ = 0.9 is the default value and = 10 −8 to prevent division by zero.

5) Adam
Adaptive Moment Estimation (Adam, [9]) update computes an adaptive learning rate from the filtered estimates of the gradient and the mean square value of the gradient.The method is appropriate for non-stationary systems and for problems with very noisy and/or sparse gradients.Adam keeps track of an exponentially decaying averages of the past gradients m r and the past mean squared gradients τ r via Here β 1 = 0.9, β 2 = 0.999 and = 10 −8 are used as default values in this paper.
Then, we compute bias-corrected gradient and mean square estimates mr and τr as Finally, (33) is used with the filtered and bias-corrected gradient mr instead of the incremental gradient ∇ W (r) J and with the adaptive step size α (r) ij given by Adadelta [8] dynamically adapts the learning rate over time using a fading memory average of all past squared gradients and squared gain (parameter) changes.The filtered gradient squared τ r and filtered squared gain increments ∆ r are computed as Typical default decay rates are δ = γ = 0.9.The elementwise step size for Adadelta algorithm is given by The pseudo code for the sequential mini-batch SGD estimation algorithm is included as Algorithm 2.

E. SEQUENTIAL MINI-BATCH GRADIENT DESCENT METHOD IN NON-STATIONARY SYSTEMS
Killick et al. [17] proposed an algorithm for detecting multiple abrupt changes in a signal sequence.The changes can be in the mean of the signal, its root-mean-square (RMS) value, or its standard deviation (STD).The sequential algorithm of this paper was coupled with the change-point detection algorithm in [17] to detect changes in the statistical behavior of innovations to restart the parameter estimation procedure at every change point.We used the RMS deviation in innovations as the characteristic change-point to be detected.
Our decision-directed estimation procedure works as follows: First, we invoke the estimation algorithm on all observation samples assuming a stationary system.Then, the change-point detection algorithm is invoked to detect abrupt Algorithm 2 Pseudo code of sequential mini-batch gradient descent-based on adaptive Kalman filtering algorithm 1: input: W 0 , α 0 , B W 0 : initial gain, α 0 : initial step size, B: batch size 2: for t = 1 to N t do N t : Max.Num.outer-loop iter. 3: for l = 1 to N l do N l : Max.Num.inner-loop iter.update the step size α (r) 13: update R (t) and Q (t) 20: end for changes in the RMS value of the innovation sequence using the algorithm in [17].Then, the sequential estimation algorithm is invoked for samples between two consecutive change points, that is, the sequential estimation algorithm is restarted at every change point.Pseudo code of the sequential minibatch SGD algorithm for non-stationary system is shown below.
Algorithm 3 Pseudo code of sequential mini-batch SGD algorithm for a non-stationary system Select samples from Call Algorithm 2 with samples from N range = [Ω y−1 : Ω y ] 7: end for

IV. NUMERICAL EXAMPLES
In this section, we first investigate the performance of the accelerated stochastic gradient descent (SGD) methods using the five examples in [7].These examples consist of 1) a second-order kinematic system (a white noise acceleration or nearly constant velocity model); 2) A system described 10: The optimal cost value of ν(1 : k) 19: The latest change-point in the optimal segmentation of ν(1 : k) 20: Ω k : The change-points in the optimal segmentation of ν(1 : N k ) 21: end for 23: end while in [3]; 3) A five-state system with diagonal Q and R; 4) A detectable, but not completely observable, system; and 5) A three-state ill-conditioned system.
We perform 100 Monte Carlo simulations for each case with an assumed "patience" of 5, the lags M = 5, a dynamic threshold ζ in (35), c max = 0.2, c = 0.01, N t = 20, N l = 100, N k = 1, 000 and N s = 1, 000.The lag M = 30 is applied to Case 1 to conform with [7], and we used 10,000 samples for Case 3. The ill-conditioned Case 5 was simulated with 200 Monte Carlo runs to be consistent with [4], [7].We show that the SGD methods are considerably faster than the batch method used in [7], especially for difficult Cases 3-5.We show that there is a modest speedup of 10% when the Z term in (24) is neglected without loss in RMSE accuracy.
For each case, the best gradient descent method and the optimal mini-batch size were somewhat different in terms of RMSE and processing time.However, when considered in aggregate, Adam update with a mini-batch size of 64 works well across all the example cases.Consequently, we show the computational speedup of the Adam's version of the sequential mini-batch algorithm over the batch method.The SGD algorithms are particularly effective on Case 3 and the ill-conditioned estimation problems represented by Cases 4 and 5.
We also demonstrate the utility of the sequential method in estimating Q and R in non-stationary systems by coupling it with a change detection method [17], [18] to obtain a decision-directed adaptive Kalman filter.For the nonstationary system, we apply our proposed method to the system used in [3].In this paper, computational simulations were run on a computer with an Intel Core i7-8665U processor and 16 GB of RAM.

A. COMPARISON OF ACCELERATED SGD METHODS AND THE BATCH METHOD 1) Case 1
The system is described by Fig. 1 shows the RMSE of accelerated gradient descent methods for estimating Q and R for varying mini-batch sizes.In estimating Q, RMSE values of all sequential algorithms are lager than the batch algorithm, but the RMSE itself was very small.In estimating R, RMSProp and Adam show good performance over all mini-batch sizes.Constant, Bold driver and Subgradient methods have slightly higher RMSE for Case 1.All the update methods resulted in consistent state estimates, as measured by the normalized innovation squared (NIS) metric [16].Table 1 shows Monte Carlo simulation results for estimating the noise parameters using Adam for various mini-batch sizes.Adam with a mini-batch size of 128 shows a good trade-off between computational efficiency and RMSE for this example.
Fig. 2 shows the averaged NIS of SGD (Adam; mini-batch size of 64) method and the 95% probability region.As with the batch method, the Adam SGD-based Kalman filter is consistent.For Case 2, the system simulated is assumed to be as follows.
In Case 2, RMSProp shows similar or better performance than the batch method in estimating Q and R as shown in Fig. 3. Adam can provide very good estimates of Q and R when the mini-batch size is larger than 64.The Subgradient, Constant and Bold driver updates tend to have higher RMSE values than the Batch method for the ranges of mini-batch sizes examined as shown in Fig. 3, but the state estimates are consistent as measured by NIS.
As one can see from Table 2, Adam shows improved computational efficiency over the batch method for all minibatch sizes, but it is more effective when the batch size is larger than 64.Note that RMSProp shows the least RMSE in estimating the noise parameters among the accelerated SGD methods for Case 2. In particular, RMSProp with a minibatch size of 64 provided the best parameter estimates with the least computation time of 114 seconds for all 100 MC runs.i.e., 1.14 seconds per run (not shown).Table 3a and Table 3b show the simulation results with 100 Monte Carlo runs for estimating the noise parameters based on the Adam using using 10,000 samples.Adam with a mini-batch size of 64 shows a good trade-off between computational efficiency and RMSE among all the accelerated SGD methods.

4) Case 4
The unobservable (but detectable) system for Case 4 is described by Fig. 5 clearly shows that SGD methods can estimate Q and R with lower RMSE when compared to the Batch method.In this example, Constant and Bold driver methods have lower RMSE values for estimating Q and R for mini-batch sizes less than 32.The RMSE values in estimating noise parameters are smaller with lager mini-batch sizes as shown in Table 4.As in the case of batch gradient descent method, better estimates are obtained by the addition of regularization term λ Q = 0.1.The system matrices for Case 5 are assumed to be as follows.As shown in Table 5, the sequential algorithm improves the computational efficiency by up to 8 times when compared to the batch method on Case 5. Note that Adam update with a mini-batch size of 128 estimates the noise parameter closer to the corresponding true value.We can improve the estimation accuracy by using the regularization term λ Q = 0.3.

B. COMPUTATIONAL SPEEDUP ANALYSIS 1) Dynamic thresholds
Fig. 7 shows the computational efficiency of the Adam version of sequential algorithm with a mini-batch size of 64 over the batch algorithm in [7] for all five example cases.The proposed method improves the computational efficiency by a factor of 4.4 on Case 3 and factor of 8 on Cases 4 and 5 when compared to the Batch gradient descent method.Dynamic threshold has substantial impact on the speedup of the sequential algorithm, but it is not suitable for the Batch estimation algorithm.Table 6 shows the estimation results of Batch algorithm with dynamic threshold applied for Case 2 and Case 4, respectively.In each example case, the Batch estimation algorithm with a dynamic threshold estimates parameters that are far from their corresponding true values.Consequently, a fixed threshold of 10 −6 is used for conditions 1 to 3 discussed in III-C1 for terminating the batch algorithm.2) Simplified gradient computations Fig. 8 shows the computational efficiency by simplifying the gradient computations when the Adam update method with a mini-batch size of 64 is applied to every example case.The processing time is reduced by less than 10% with almost the same RMSE by neglecting the Z term in the gradient computations (24).We simulated 100 MC runs with 5,000 observation samples in each run using system matrices of Case 2 in (52) and varying Q and R. We considered three non-stationary scenarios: 1) Only Q changes every 1,000 samples, 2) Only R changes every 1,000 samples, and 3) both Q and R change every 1,000 samples.In the estimation procedure, we set the outerloop iterations N t = 20, inner-loop iterations N l = 100, number of burn-in of samples N b = 50, c = 0.2 and the number of lags M = 5.The iteration-dependent dynamic thresholds for convergence were also applied.
For the non-stationary systems, both RMSProp and Adam showed the best performance among all the accelerated SGD algorithms.Here, we show the statistical results using the Adam update with a mini-batch size of 64.Given innovation correlations, the algorithm for detecting the change points based on Adam update with a mini-batch size of 64 is able to track the multiple abrupt changes accurately, as shown in Fig. 9.In this scenario, the process noise covariance Q changes every 1,000 samples, while measurement noise covariance R is constant over all 5,000 samples, as shown in Fig. 10.The decision-directed estimation algorithm can track Q and R quite well and provides a consistent state estimator.Table 7a shows the MC simulation results for the nonstationary system with varying Q, and the estimation of parameters are very closer to their corresponding true values.In the scenario of varying R, measurement noise covariance R changes every 1,000 observation samples, but the process noise covariance Q is kept constant over the 5,000 samples.Fig. 11 shows that the sequential mini-batch gradient descent algorithm can track Q and R correctly.As shown in Table 7b, the noise parameters can be estimated quite accurately by the sequential algorithm.
3) Varying Q and R The situation where Q and R is more difficult to track than when either Q and R are constant over the 5,000 samples.However, the sequential algorithm, when coupled with the change-point detection algorithm, can track Q and R accurately as shown in Fig. 12. Fig. 12c shows the averaged NIS of the SGD (Adam; mini-batch size of 64) algorithm when The trajectory of Q and R estimates can be smoothed by a simple first order fading memory filter.Fig. 13 shows the trajectory results with a smoothing weight of 0.9.

V. CONCLUSION AND FUTURE WORK
In this paper, we presented stochastic gradient descent (SGD) algorithms for the noise covariance estimation in adaptive Kalman filters that are an order of magnitude of faster than the batch method for the same or better root mean square error (RMSE) and are applicable to non-stationary systems where the noise covariances can occasionally exhibit abrupt, but finite, changes.The computational efficiency of the new algorithm stems from adaptive thresholds for convergence, recursive fading memory estimation of the sample cross-  In the future, we plan to pursue a number of research avenues, including 1) estimating Q and R using one-step lag smoothed residuals; 2) automatic model selection from a library of models; 3) formalization of decision-directed adaptive Kalman filtering with probabilistic change-point detection and state estimation; 4) explore the utility of the covariance estimation algorithm in adaptive interacting multiple model (IMM) filters.

4 :
for k = 1 to N k do N k : Num.samples 5:compute the innovation correlations, ν(k)

Initialize the updating index r 5 : 7 :
for k = 1 to N k do N k : Num.samples 6: compute innovation correlations ν(k) if k > N b + M then N b : Num.burn-in samp.
α 0 : initial step size, B: batch size, N k : Num. of samples, N p : Num. of change points 2: Obtain the innovation sequence {ν(k)} N k k=1 Call Algorithm 2 assuming a stationary system 3: Detect the p abrupt change-points Ω y , y = 1, 2, • • • , p Call Change-points detection algorithm given {ν(k)} N k k=1 and N p described in Algorithm 4 4: for y = 1 to p do 5:

FIGURE 1 :
FIGURE 1: Performance of accelerated gradient descent methods on Case 1

FIGURE 2 :
FIGURE 2: Averaged NIS of SGD method for Case 1

Fig. 4 FIGURE 4 :
FIGURE 3: Performance of accelerated gradient descent methods on Case 2

FIGURE 5 :
FIGURE 5: Performance of accelerated gradient descent methods on Case 4

3 Fig. 6 FIGURE 6 :
Fig.6shows the performance of accelerated SGD methods.RMSE values of estimated Q and R decrease as the mini-batch size increases, and the RMSE of all SGD methods are smaller than that of the batch method when the batch

FIGURE 7 :Type
FIGURE 7: The improvement of computational efficiency over 5 example cases

FIGURE 8 :
FIGURE 8: Computational efficiency with simplified gradient computations

FIGURE 9 :
FIGURE 9: The change-points in innovation correlations in a non-stationary system

FIGURE 10 :
FIGURE 10: Trajectory of Q and R in varying Q system

FIGURE 11 :
FIGURE 11: Trajectory of Q and R in varying R system Averaged NIS of SGD method in the non-stationary system

FIGURE 12 :FIGURE 13 :
FIGURE 12: Trajectory of Q and R in varying Q and R system

TABLE 7 :
Monte Carlo Simulation for the non-stationary system (100 Runs; 5,000 samples; M = 5) (a) The case of varying Q Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE 1 0.36 0.37 Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE Truth Mean RMSE 1 0.42 0.43