IR-VIO: Illumination-Robust Visual-Inertial Odometry Based on Adaptive Weighting Algorithm With Two-Layer Confidence Maximization

Illumination change, image blur, and fast motion dramatically decrease the performance of visual-inertial navigation systems (VINS). This article presents a new illumination-robust visual-inertial odometry (IR-VIO) based on adaptive weighting algorithm with two-layer confidence maximization. First, to prevent the VIO performance degradation caused by poor image quality in complex scenes and ignoring the confidence differences of feature points, we develop a novel adaptive weighting algorithm on the multisensor layer and visual feature layer to better fuse multisensor information and maximize the overall confidence of VIO. Second, to solve the problems of image feature tracking difficulty and excessive image noise in illumination-changing scenes, an image enhancement algorithm is introduced to enhance consecutive images to the same brightness level, while a block noise removal algorithm with constraint protection mechanism is proposed to dynamically remove noise points. Finally, experimental results in the public dataset and real-world environments demonstrate that IR-VIO has superior performance in terms of accuracy and robustness compared with the state-of-the-art methods.

IR-VIO: Illumination-Robust Visual-Inertial I. INTRODUCTION S TATE estimation has played an increasingly important role in robot navigation, autonomous driving, virtual reality, and augmented reality in recent years [1], [2], [3], [4], [5], [6]. Since a monocular VINS can achieve accurate state estimation using only one monocular camera and one inertial measurement unit (IMU), which has the advantages of low cost, small size, and simple hardware setup, it has received extensive attention from researchers in the fields of computer vision and robotics.
There are different schemes for VINS to fuse visual and inertial measurements, which can be divided into loosely coupled and tightly coupled. Specifically, the loosely coupled methods fuse the results of two independent systems including the visual motion estimation and the inertial motion estimation system. In contrast, tightly coupled methods use raw data generated by the camera and IMU to jointly estimate pose and obtain more accurate results compared with loosely coupled methods. Early works are generally filter-based tightly coupled methods due to limited computational resources. Nowadays, the development of computational resources has inspired the rise of optimizationbased tightly coupled methods, which jointly optimize multiframe poses by minimizing visual and IMU measurement residuals [7]. In this article, we focus on optimization-based tightly coupled methods.
In the past few years, many excellent optimization-based tightly coupled monocular VINS have made great breakthroughs in pose estimation [8], [9], [10]. However, the following challenges still hinder the performance of VINS. Specifically, the first challenge is the degradation of VIO performance caused by poor image quality in complex scenes and ignoring the confidence differences of feature points [8]. There is an urgent need to explore a method to ingeniously fuse multisensor information and maximize the overall confidence of VINS. The second challenge is the problem of image feature matching or tracking failure in illumination-changing scenes [11], [12]. Drastic illumination change may cause VINS to collapse due to insufficient geometric constraints to participate in the optimization. The third challenge is that image noise reduces the performance of VINS. Excessive image noise may lead to more outliers in feature matching or tracking, resulting in performance degradation and failure of VINS. Next, we review some classic solutions to these three challenges.
For the first challenge, representative work can be traced back to [13], which uses a small number of feature points to estimate pose due to limited resources and computational performance. The system selects a subset from the available landmark points by constructing a demand matrix to minimize pose uncertainty. Recently, the approaches in [14] and [15] use the minimum eigenvalues and a log determinant (Max logDet) to maximize the confidence gain of pose estimation and then guide feature selection, and high accuracy can be maintained with low overhead and latency accordingly. Different from their work, this article aims to design appropriate adaptive weights to reasonably fuse multisensor information by measuring the confidence of the camera and IMU as well as the confidence of different feature points.
Many existing works focus on the second challenge. By combining the extrinsic transformation matrix and the reprojection model, Zhu et al. [11] proposed an initial optical flow prediction algorithm based on IMU preintegration, which greatly improved the success rate of feature point tracking in environments with drastic illumination change. The work in [16] designed a deep convolutional neural network (CNN) model to adaptively change the camera exposure time and gain, which increases the number of high-quality features. Unlike the above methods, this article aims to adopt an image enhancement algorithm to enhance consecutive images to the same brightness level, so that feature points can be easily and successfully matched or tracked.
Many works have been devoted to addressing the third challenge: image denoising. Classical methods use hand-crafted low-pass filters to remove image noise. Recently, FFDNet [17] is a flexible denoising CNN whose input is an adjustable noise level map, making the network suitable for images with different noise levels. However, the network has poor generalization due to limited training data and resources. Different from the above methods, this article aims to design a real-time and efficient block noise removal algorithm to eliminate feature points in large noise areas and reduce the impact of image noise on the VIO system.
To address the above challenges, this article proposes IR-VIO. The main contributions are as follows.
1) We develop a novel adaptive weighting algorithm with two-layer confidence maximization to better fuse multisensor information and maximize the overall confidence of the VIO system, which significantly prevents the VIO performance degradation caused by poor image quality in complex scenes and ignoring the confidence differences of feature points in existing works. 2) We introduce an image enhancement algorithm to enhance consecutive images to the same brightness level and propose a block noise removal algorithm with constraint protection mechanism to dynamically remove noise points, which effectively solve the problems of image feature tracking difficulty and excessive image noise in illumination-changing scenes. 3) Experimental results on the public dataset and real-world environments demonstrate that the proposed method has superior performance in terms of accuracy and robustness compared with the state-of-the-art methods. The rest of this article is organized as follows. Section II introduces the system framework. In Section III, a novel adaptive weighting algorithm with two-layer confidence maximization is presented in detail. In Section IV, an image enhancement algorithm and a block noise removal algorithm are presented in detail. Experimental results are shown in Section V. Finally, Section VI concludes this article.

II. SYSTEM FRAMEWORK
The system framework of IR-VIO is shown in Fig. 1. IR-VIO is developed based on VINS-Mono [8], which consists of measurement preprocessing and local visual-inertial odometry. The blue shaded blocks represent our contributions. The pipeline is as follows. First, the image from camera is enhanced by the image enhancement algorithm, then it is sent to the feature point detection and feature point tracking module. When adding the image enhancement algorithm, the tracking module can cope with drastic illumination change. Second, the block noise removal algorithm is used to obtain high-reliability feature point pairs and reduce the impact of noise on VIO. Due to the information fusion strategy proposed in Section IV-A for fusing enhanced and raw images, measurement preprocessing adds another thread that uses only raw images. Third, VIO is initialized with feature point pairs and IMU measurements of adjacent image frames. After initialization, the IMU measurements and feature point pairs respectively form the IMU and visual residuals, which are sent to the sliding window optimization of the local visual-inertial odometry to jointly solve the pose. In the solution process, the weights of different sensor constraints and different feature points are adjusted in real time by using the adaptive weighting algorithm, which significantly improves the confidence of the system and prevents the VIO performance from being degraded by illumination change, image blur, and fast motion.

III. ADAPTIVE WEIGHTING ALGORITHM WITH TWO-LAYER CONFIDENCE MAXIMIZATION
Because of the degradation of VIO performance caused by poor image quality in complex scenes and ignoring the confidence differences of feature points in existing works, there is an urgent need to explore a method to ingeniously fuse multisensor information and maximize the overall confidence of the VIO system. In this section, we develop a novel adaptive weighting algorithm with two-layer confidence maximization. The first layer is the multisensor layer. To prevent VIO from trusting poor image information too much in complex scenes, the optimization weights of different sensors are adjusted according to the confidence of the pose covariance matrix. The second layer is the visual feature layer. To prevent VIO from overly trusting low-reliability feature points, the weights of different feature points are adjusted according to the tracking times of feature points and the number of frames where visual constraints are constructed.

A. Sliding Window Optimization
After initialization, IR-VIO performs sliding window optimization. We use the visual-inertial bundle adjustment to minimize all measurement residuals in the sliding window. The visual residual r C (ẑ c j l , X ) constructed from the observation of the lth landmark point in the jth image frame, the IMU residual r B (ẑ b k b k+1 , X ) constructed from the IMU observations between two consecutive images, the marginalization residual r p (prior information), and the adaptive weight ω ij n C which will be introduced in Section III-C and calculated by (9) to obtain a more accurate maximum posterior estimation by where X is the state vector in the sliding window, B is the set of all IMU measurements, and C is the set of features observed at least twice in the current window. ρ(·) denotes the Huber norm, H p is the Jacobian matrix of prior information, P b k b k+1 is the covariance of IMU measurement residuals, and P c j l is the covariance of visual residuals, which is set to a constant.

B. First Layer-Multisensor Layer
The traditional methods usually fix the weight ratio between visual and IMU constraints in the sliding window. If the image quality changes due to illumination change, image blurring, and so on, the multisensor information can not be fused properly, which reduces the accuracy of pose estimation.
In this subsection, we design the multisensor layer adaptive weighting algorithm. Considering that when the image quality deteriorates, the number of successfully tracked feature points is very small, and most of them contain large noises. Although the Huber robust kernel function used in VINS-Mono can reduce the impact of outliers on pose estimation, it still cannot handle data with a lot of noise well. To reduce the interference of unreliable visual information on pose estimation, we reduce the weight of visual constraints and rely more on IMU data in the short period of time between two image frames. On the contrary, when the image quality is good, the pose covariance matrix constructed by vision has high confidence, we increase the weight of the visual constraint and rely more on the visual information. The following is how to calculate the confidence of the pose covariance matrix and design the weights of different sensor constraints.
In the process of feature detection and camera intrinsic parameter calibration, the measured values of 2-D feature points have some errors. It is usually assumed that the noise is independent of each other in u and v coordinate directions and follows N (0, σ 2 I ). The noise is propagated in the optimization process, causing the pose estimation to be perturbed.
The Jacobian matrix J p of the pth feature point is the partial derivative of the visual reprojection error to the pose. The Jacobian matrix of q feature points is defined as According to the derivation in [18], the first-order approximation of the error in the propagation of image measurements to pose parameters is given by the covariance matrix where Σ I is the covariance matrix of the image measurement, which represents the error in the measurement of 2-D feature points. Since these errors are assumed to be independent of each other, Σ I is a diagonal matrix with a diagonal of σ 2 I . Therefore, the expression of the 6-DoF (degrees of freedom) pose covariance matrix can be simplified to The matrix Σ θ represents the 6-DoF confidence ellipsoid in pose space. The volume and average radius of the confidence ellipsoid are commonly used to evaluate the quality of the pose covariance matrix [19]. The average radius η of the confidence ellipsoid is calculated by where α is the quantile of a chi-square distribution with τ degrees of freedom and upper tail probability of ε, which is a constant that depends on τ and ε, det(·) is the determinant of a square matrix, τ = 6 in this article. The works in [14] and [15] used the logarithm of the confidence ellipsoid volume to determine the most favorable subset of feature points for the pose solution, reducing the delay of pose estimation. Unlike the above, this article uses the reciprocal of the average radius to estimate the confidence level of the pose covariance matrix and then adjusts the weight of different sensor constraints to improve the accuracy of pose estimation. Since α is a constant, we only need to evaluate the confidence level by the reciprocal of (detΣ θ ) 1/2τ .
As the average radius of the pose covariance matrix constructed by the IMU is unchanged, the weights of the IMU constraints are fixed at 1. Using the pose of the i − 1 frame in the sliding window and the IMU information between the i − 1 and i frame, the initial value of the pose in the i frame can be inferred. Based on the initial value and the feature point pairs matched between the i − 1 and i frame, the pose covariance matrix Σ i θ can be constructed, and then the weight ω i α of the visual constraint can be calculated by where Λ is the confidence value of the pose covariance matrix when the image quality is medium, which is set by the quality of the camera.
Considering that there are errors in the process of camera intrinsic parameter calibration, feature point extraction and tracking, the visual information is not absolutely accurate, and blindly believing in visual information and ignoring the IMU may reduce the localization accuracy. Therefore, we set an upper limit δ α and a lower limit 1/δ α for the visual constraint weight. If the δ α is too small, the effect of the weighting algorithm is not obvious. On the contrary, the upper limit cannot play a restrictive effect.
By (6), the multisensor layer adaptive weight ω ij α constructed by matching feature point pairs of the ith and jth frames is as Fig. 2 is an illustration of the multisensor layer adaptive weighting algorithm. The image frames, inertial data, and prior information jointly solve the pose nodes in the sliding window. Since image quality usually deteriorates in complex scenes, we add adaptive weights ω ij α to the constraints of different sensors, which adaptively fuse multisensor information and maximize the overall confidence of the VIO system.

C. Second Layer-Visual Feature Layer
SLAM based on feature point methods usually assumes that the reliability of all features is the same, and the weight is the same when different feature points participate in the optimization. However, the research results in [20] show that for some feature points easier to track, the reliability is higher, and thus the reprojection error is smaller and inversely proportional to tracking times. In addition, each tracking will introduce an error when image features are tracked. Therefore, the more the number of cross-image frames when constructing the visual reprojection residual in sliding window optimization, the less reliable the visual constraint is. To sum up, it is reasonable and necessary to set different weights for feature points.
In this section, we propose a visual-feature layer adaptive weighting algorithm. The purpose is to set weights for different feature points according to the tracking times of feature points and the number of frames across which visual constraints are constructed. When the number of feature points tracked is large and the number of frames across is small, the weight of the visual constraint should be set higher to trust the feature points more. On the contrary, the weight should be set smaller. The algorithm can further improve the confidence and localization accuracy of the VIO system.
Motivated by this, the visual-feature layer adaptive weight ω ij n β of the visual constraints constructed by the feature point pairs of ith and jth frame tracking times n is as Considering that the reliability of feature points cannot be infinitely improved according to the number of tracking, we set an upper limit δ β for the adaptive weight. If δ β is too small, the effect of the weighting algorithm is not obvious. On the contrary, δ β has no restrictive effect. Fig. 3 is an illustration of the vision-feature layer adaptive weighting algorithm. The 3-D landmark P is projected to the pixel position of the blue box on the C 0 frame image. The optical flow tracking algorithm is used to track the feature point between image frames. The larger the number of tracking frames, the greater the cumulative error. The pink circle points are the pose nodes to be solved. Supplement adaptive weights ω ij n β to the visual constraints constructed by two image frame observations. The algorithm adaptively adjusts the weights of different feature points according to the reliability differences of feature points and further improves the confidence and localization accuracy of the VIO system.
According to (7) and (8), the visual hybrid adaptive weights ω ij n C constructed by the feature points of the tracking times n of the ith and jth frames is as

IV. IMAGE ENHANCEMENT ALGORITHM AND BLOCK NOISE REMOVAL ALGORITHM
The problem of image feature points tracking difficulty and excessive image noise in illumination-changing scenes significantly limits the performance of the VIO system. In this section, we introduce an image enhancement algorithm and propose a novel block noise removal algorithm with constraint protection mechanism. Using the former enhances successive images to the same brightness level, making it easier for feature points to be tracked successfully. Using the latter dynamically removes feature points in strong noise areas and reduces the impact of noise on VIO. The principles of the two algorithms are described below.

A. Image Enhancement Algorithm
VIO based on feature point methods usually uses optical flow tracking algorithms for feature point tracking, such as VINS-Mono [8] and Basalt [10]. Compared with methods based on descriptor matching, the above methods have the advantage of being fast but are very sensitive to illumination change. Drastic illumination change may lose tracking of feature points, which in turn leads to the failure of VIO.
A deep network based on zero-shot learning for low-light image enhancement is proposed in [21], which can be trained end-to-end using zero-reference images without any paired or even unpaired data during training. Referring to the above methods, we modify the network to improve algorithm performance and computational efficiency, making it more suitable for VIO systems. Specifically, to meet the common image types in VIO, we modify network input and output. The input and output are a single-channel grayscale image and a pixelwise curve parameter map, respectively. Furthermore, we propose an information fusion strategy to fuse the raw and enhanced images. Specifically, first, we use the adaptive weighting algorithm proposed in Section III, namely, (6)-(9), to calculate weights of the visual reprojection residuals of the raw and enhanced images in real time. In illumination-changing scenes, benefiting from the robustness of the image enhancement algorithm, the pose covariance matrix constructed by the enhanced image has higher confidence, so the residual weight of the enhanced image calculated in (9) is larger. On the contrary, in scenes with good lighting conditions, the residual weight of the raw image calculated by (9) is larger because of the noise introduced by the image enhancement algorithm. Then the pose is estimated jointly with the IMU measurement residuals. This strategy can further improve the localization accuracy and robustness of the VIO system in complex scenes.

B. Block Noise Removal Algorithm
To reduce the noise after image enhancement, we propose a new block noise removal algorithm with constraint protection mechanism. First, for image noise estimation, we refer to the filter-based method in [22] to estimate the image noise level. The advantages of this method are its low computational cost and high real-time performance.
Based on the designed noise estimation kernel k and homogeneous region mask H, we estimate the noise variance σ I noise of image I by where N I is the effective number of pixels of the homogeneous region mask H, | · | denotes the absolute operator, × denotes the convolution operator. We estimate the noise level of the image I by the size of the noise variance σ I noise . After the image is enhanced, the noise in some areas is significantly increased, especially the black object area. We divide the image into M × N blocks and calculate the noise level of each image block I k by (10). The noise of the image block I e k after image enhancement is compared with the noise of the raw image block I r k . When the former is too large (ratio > μ), the feature points of I e k are removed. To avoid the insufficient constraints for solving the 6-DoF pose caused by too few remaining feature points after denoising, we design a constraint protection mechanism. First, all image blocks are arranged in descending order of noise ratio, and feature points in areas with large noise ratio are preferentially removed in this order. Then, it is guaranteed that the number of retained feature points is not less than the threshold μ cp when the original number exceeds μ cp . In this article, M = 6, N = 8, μ = 10.8, and μ cp = 8. Fig. 4 is an illustration of the block noise removal algorithm. Raw and enhanced images are divided into M × N image blocks. The circled points are successfully tracked feature points. The noise region is obtained by noise ratio, and strong and weak noise points in enhanced image are distinguished by the descending order of noise ratio. Using the constraint protection mechanism, the orange weak noise points are preserved to ensure sufficient constraints when solving poses.

V. EXPERIMENTS
In this section, the performance of IR-VIO is evaluated in public datasets and real-world experiments. First, the test is conducted on the EuRoC public dataset [23]. It is a commonly used visual-inertial dataset for micro aerial vehicles in the VIO field, including 11 sequences from three different scenes, consisting of binocular grayscale images (20 Hz) and synchronized inertial measurements (200 Hz). We provide qualitative and quantitative comparisons and analyses in six aspects. In particular, all methods use images from the left camera as visual input. Then, we perform experiments in real-world illumination-changing scenes to further demonstrate that IR-VIO is robust to illumination change and has superior localization performance. Remarkably, we focus on visual-inertial odometry, thus our method and other VIO methods all turn off loop detection in all tests for a fair comparison. In addition, the algorithm parameter values, real-time analysis, and more detailed experimental results are in the supplementary.

1) Image Enhancement Performance Test in Drastically
Illumination-Changing Scenes: The 11 sequences of the EuRoC dataset cover the following challenges: illumination change, motion blur, and sparse texture. We choose the V1_03_difficult sequence with the most drastic illumination change to test the performance of the image enhancement algorithm. As shown in Fig. 5, consecutive raw images with significant differences in average brightness are enhanced to the same brightness level, which is attributed to carefully designed enhancement network and exposure control loss. The above results demonstrate that this algorithm can significantly hinder the interference of illumination change on the camera.

2) Comparison of Feature Tracking Performance and Suc-
cess Rate: In this section, we perform comparisons of tracking performance on the challenging V1_03_difficult sequence.   Note that IR-VIO here is the VINS-Mono version that only replaces raw images with enhanced images. We first show a portion of this sequence and randomly pick 4 fragments with illumination change, as shown in Fig. 6. The success rate of feature tracking (TSR) has been dramatically improved. Except for the illumination-changing period, during the period of 15.0 ∼ 16.0 s, the scene is with low light, and the TSR of IR-VIO is higher than that of VINS-Mono. During 16.6 ∼ 17.0 s, with good illumination, the TSR of IR-VIO is almost equal to that of VINS-Mono. The TSR comparison of the whole sequence is shown in Fig. 7. Obviously, the overall trend of TSR for IR-VIO is higher than that for VINS-Mono. In this sequence, the average TSR of IR-VIO is 20.2% higher than that of VINS-Mono. Due to the image enhancement algorithm, IR-VIO can adapt to drastic illumination change, which significantly improves the tracking performance.

3) Comparison of Multisensor Layer Weights in Com-
plex Scenes: To explore the effectiveness of the multisensor layer adaptive weighting algorithm, we select the challenging V1_03_difficult sequence. When the IMU confidence is constant, the higher the quality and confidence of the image, the higher the weight. The comparison results are shown in Fig. 8. Note that IR-VIO here is the VINS-Mono version that only replaces raw images with enhanced images. Obviously, the weight of IR-VIO is overall higher than that of VINS-Mono, and the overall trend is consistent with Fig. 7. When the scene illumination changes drastically or the camera moves rapidly, e.g., 8.0 ∼ 12.0 s, 24.0 ∼ 27.0 s, 63.0 ∼ 67.0 s, and 97.0 ∼ 100.0 s, the decrease in the TSR of feature points is caused by the deterioration of image quality, and even some pictures only successfully track 6 feature points, most of which contain noise. If the visual weights are not adjusted using this algorithm, the pose is likely to be dominated by these noises. Our algorithm adaptively drops the visual weights below 1.0, making the VIO system more dependent on IMU data during the short period of time between two image frames. In the field of feature-based visual SLAM, it is generally believed that the more features that are successfully tracked, the more likely the pose estimated by visual odometry is accurate. In this sequence, the average TSR of IR-VIO is 20.2% higher than that of VINS-Mono, and the visual weight calculated by the proposed adaptive weighting algorithm is 24.2% higher on average, which proves the effectiveness and rationality of the multisensor layer adaptive weighting algorithm.

4) Impact of the Block Noise Removal Algorithm on Feature
Point Tracking: We select several typical segments to qualitatively analyze the impact of the block noise removal algorithm on feature point tracking. The results are shown in Fig. 9. The feature points in the large noise area of the yellow box are removed to avoid interfering with the pose estimation, and the constraint protection mechanism prevents the algorithm from removing excessive feature points. The above results show the effectiveness of the algorithm, and the quantitative analysis results of the impact on the VIO localization accuracy are presented in Section V-A6.

5) Comparison of Localization Accuracy of Different VIO
Methods: The root-mean-square error (RMSE) of the absolute trajectory error (ATE) is used to evaluate the localization accuracy of different methods [28]. For a fair comparison, IR-VIO, VINS-Mono, and R-VIO2 turn off the histogram equalization algorithm. As shown in Table I, IR-VIO outperforms almost all other methods, only the V2_03 sequence is lower than R-VIO2 [27]. The reason is that to avoid the instability of other functions and to verify the advantages of the proposed algorithms more fairly, compared with R-VIO2, both VINS-Mono and the proposed method IR-VIO have turned off the camera-IMU time online calibration function. Also by turning off this function in R-VIO2 and retesting, IR-VIO is 53.2% higher than R-VIO2 in V2_03. The average localization accuracy of our method improves by 35.6% over VINS-Mono. On the V1_03 with the most drastic illumination change, the localization accuracy is improved by 63.7% over VINS-Mono. 6) Ablation Study: To further demonstrate the role of the different algorithms proposed in this article, we perform several ablation studies and the results are shown in Table II. Among the 11 sequences, our method has the highest accuracy (red) in 6 sequences, especially on difficult sequences such as MH_04, MH_05, and V1_03. In 4 sequences, our method is second best (green). On MH_01 and V2_02, the accuracy is similar to the best one. The other two sequences are worse than the IR-VIO without the block noise removal algorithm, and the accuracy of the MH_02 sequence without illumination changes is worse than that of the method only using the adaptive weighting algorithm due to the weak noise introduced by the image enhancement algorithm. However, in order to make the VIO system more robust to illumination changes and strong noise, it is still recommended to use the three proposed algorithms together, and the average accuracy also shows a great advantage, which is 10.7% higher than the second best.

B. Real-World Experiments
In this section, we carry out handheld and MAV flight experiments in different illumination-changing environments to evaluate the performance of IR-VIO. All experiments use the Intel Realsense D455 sensor, which consists of a hardware synchronized camera (30 Hz) and an IMU (200 Hz).
1) Handheld Experiments: Two handheld sequences are recorded by walking with a handheld D455 sensor at a speed of about 0.5 m/s in both indoor and outdoor illumination-changing environments as shown in Fig. 10. Since there is no ground truth, we keep the start and end positions of each sequence consistent and evaluate the localization accuracy by computing the trajectory endpoint drift (TED) [29]. The results are shown in Table III, and the comparison of trajectories is shown in Fig. 11. Obviously, the starting and ending points of IR-VIO are closer than those of VINS-Mono. In the indoor experiment, the TED of IR-VIO is reduced by 79.0% over VINS-Mono. In the outdoor experiment, the TED is reduced by 66.6%. The three algorithms proposed in this article significantly improve the localization accuracy of VIO.
2) MAV Flight Experiment: The D455 sensor is equipped on the MAV as shown in Fig. 12(a), which is used to record the sequence of MAV flight under the indoor illumination-changing   Fig. 12(b). Fig. 13(b) shows two consecutive frames of images when the illumination changes. The ground truth is obtained by Qualisys motion capture system. The results are shown in Table IV. The comparison of different trajectories is shown in Fig. 13(a). Obviously, the trajectory of IR-VIO is significantly closer to the ground truth than that of VINS-Mono. RMSE of ATE is reduced by 73.5% over VINS-Mono.

VI. CONCLUSION
In this article, we propose a new illumination-robust visualinertial odometry (IR-VIO) based on adaptive weighting algorithm with two-layer confidence maximization. Specifically, we develop a novel adaptive weighting algorithm on the multisensor layer and visual feature layer, an image enhancement algorithm, and a block noise removal algorithm with constraint protection mechanism. The analysis results of six aspects on the public dataset and the experimental results of different real-world environments demonstrate that the proposed method has superior performance in terms of accuracy and robustness compared with the state-of-the-art methods.
Youwei Wang received the B.Eng. degree in automation from Chang'an University, Xian, China, in 2022. He is currently working toward the master's degree in control science and engineering with the Institute of Robotics and Automatic Information System, Nankai University, Tianjin, China.
His current research interests include multisensor fusion and deep learning.
Jing Yuan (Member, IEEE) received the B.S. degree in automatic control, and the Ph.D. degree in control theory and control engineering from Nankai University, Tianjin, China, in 2002 and 2007, respectively.
Since 2007, he has been with the College of Computer and Control Engineering, Nankai University, where he is currently a Professor. His current research interests include robotic control, motion planning, and SLAM.