Targetless Extrinsic Calibration of Camera and Low-Resolution 3-D LiDAR

Autonomous driving heavily relies on light detection and ranging (LiDAR) and camera sensors, which can significantly improve the performance of perception and navigation tasks when fused together. The success of this cross-modality fusion hinges on accurate extrinsic calibration. In recent years, targetless LiDAR–camera calibration methods have gained increasing attention, thanks to their independence from external targets. Nevertheless, developing a targetless method for low-resolution LiDARs remains challenging due to the difficulty in extracting reliable features from point clouds with limited LiDAR beams. In this article, we propose a robust targetless method to solve this struggling problem. It can automatically estimate accurate LiDAR and camera poses and solve the extrinsic matrix through hand–eye calibration. Moreover, we also carefully analyze pose estimation issues existing in the low-resolution LiDAR and present our solution. Real-world experiments are carried out on an unmanned ground vehicle (UGV)-mounted multisensor platform containing a charged-coupled device (CCD) camera and a VLP-16 LiDAR. For evaluation, we use a state-of-the-art target-based calibration approach to generate the ground truth extrinsic parameters. Experimental results demonstrate that our method achieves low calibration error in both translation (3 cm) and rotation (0.59°).


Symbol Definition
X CL Extrinsic calibration matrix from light detection and ranging (LiDAR) to camera. s Scale factor of monocular motions.

{·}
Ordered list. ∥ · ∥ F Frobenius norm of a matrix. ∥ · ∥ 2 Second norm of a vector. P i ith frame of laser scan. Pose of cluster k in external registration. P L Scene reconstructed from LiDAR data { P i }. P C Scene reconstructed from camera data{I i }. ϵ Restriction on the change in s (Algorithm 1).

I. INTRODUCTION
N OWADAYS, an increasing number of autonomous driving and robotic systems are equipped with multiple sensors, and LiDAR sensors and cameras are the most popular ones. The LiDAR measures the distance of surrounding points directly irrespective of light conditions, while the camera provides dense texture and color information. The fusion of LiDAR and camera can complement their characteristics to improve performance in perception and navigation tasks such as road detection [1], simultaneous localization and mapping (SLAM) [2], and depth estimation [3]. Extrinsic calibration between LiDAR and the camera is the prerequisite to sensor fusion, as it estimates the transformation between their coordinate systems. The core of this task is to establish reliable correspondence across a sparse 3-D point cloud and a dense red green blue (RGB) image. Recently, the resolution of LiDAR has been increasing with the advancements in manufacturing processes. High-resolution mechanical LiDAR and solid-state LiDAR are advanced representatives but are prohibitively expensive. In some perception and navigation applications [4], [5], [6], low-resolution LiDAR can be a more attractive alternative due to its cost-performance tradeoff. However, calibrating low-resolution LiDAR with the camera is more challenging due to its vertical sparsity of laser scans. The purpose of our work is to provide a feasible solution to this problem. In the remaining parts of this section, we will introduce related works on LiDAR-camera calibration and outline our contributions.

A. Related Works
According to whether a specific target is required, existing calibration methods can be divided into two categories: targetbased and targetless. Target-based calibration establishes geometric constraints or loss functions based on a target easily detectable to both LiDAR and the camera. Planar checkerboards [7], [8], [9] and other solids with sufficient corners [10], [11] are common targets for these approaches. Targetbased methods have dominated the field of LiDAR-camera calibration over several decades owing to their robustness and accuracy, but their limitation is the heavy dependence on external targets. Moreover, the calibration accuracy of these methods is limited to the processing and measurement accuracy of used targets.
Instead of capturing and analyzing specific geometric details from known targets, targetless methods generally extract common features across the laser scan and the RGB image. Early targetless methods design a fusion loss between projected LiDAR points and the image, and edge alignment (EA) [12] is a typical one based on edge correspondence. The primary principle of this method is a hypothesis that depth-discontinuous LiDAR edges of the point cloud are likely to be associated with 2-D edges of the image. Thus, the extrinsic matrix can be optimized by minimizing the reprojection error between LiDAR and camera edges. However, Yuan et al. [13] found an inherent drawback of this method called foreground object inflation. It leads to a great loss of accuracy in EA, especially for high-resolution LiDARs. To resolve this problem, they replace depth-discontinuous edges with depth-continuous ones, resulting in higher calibration accuracy. Mutual information (MI) [14] is a similar target-free method proposed for calibration between the omnidirectional camera and LiDAR. It maximizes the MI between projected LiDAR points' reflectivity and image pixels' gray intensity. Besides these fusion-loss-based approaches, odometry fusion [15] provides another insight into this problem. Rather than finding common features in a single LiDAR-camera pair, it utilizes the geometric constraints called hand-eye calibration [16], [17] based on LiDAR and camera motions. Nevertheless, its accuracy is deeply dependent on the performance of motion estimation, which is a challenge to LiDAR and visual odometry. In addition, the development of deep learning technology also promotes the proposal of selfsupervision methods [18], [19], [20]. In these works, a deep neural network is trained from calibrated LiDAR-camera data pairs and used to predict the extrinsic matrix from uncalibrated pairs. They do not require extra operations or external targets, but their generalization ability across unseen scenes and sensors requires further validation.

B. Contribution
To the best of our knowledge, although many target-based methods [9], [11], [21] are compatible with low-resolution LiDARs, no targetless ones have succeeded in this task yet. The primary challenge derives from the vertical sparsity of laser scans, which leads to insufficient reliable features. Aiming to alleviate this problem, we consolidate multiple frames of point clouds into one to generate LiDAR features. Specifically, our main contributions are listed below.
1) We propose a novel targetless LiDAR-camera calibration approach compatible with low-resolution LiDARs. It leverages hand-eye calibration and scene registration (SR) to obtain accurate extrinsic parameters and addresses the scaleless problem of monocular motion estimation. 2) We develop a novel algorithm named cluster extraction and integration (CEI) for robust LiDAR pose estimation. With camera poses estimated by Structure from Motion (SfM), CEI can automatically correct erroneously registered LiDAR pairs without any prior knowledge of extrinsic parameters, which is crucial to hand-eye calibration. 3) We carried out real-world experiments on a multisensor platform containing a VLP-16 LiDAR and a charged-coupled device (CCD) camera. For evaluation, we compare our results to a state-of-the-art targetbased method [9] from the perspective of calibration error and EA. Furthermore, an ablation study is also conducted to verify the effectiveness of each module in our framework.

A. Overview
As displayed in Fig. 1, our multisensor system is fixed on a platform consisting of a VLP-16 LiDAR and a CCD camera. It can rotate in the pitch direction to generate additional orientations, which is advantageous to hand-eye calibration. However, we did not install any sensor to directly measure this angle so as to ensure the generality of our method. The whole platform is fixed on an unmanned ground vehicle (UGV). Fig. 2 is the overview flowchart of our method, and primary symbols involved in Sections II-IV are defined in Nomenclature. The main body of our method is hand-eye calibration, which requires the multisensor system to move around into a variety of positions and orientations. Different from odometry fusion methods like [15], our approach works on an offline mode and has a back-end optimization-the LiDAR and visual odometry are replaced with Multiway registration [22] and SfM [23], [24] severally. In our framework, SfM and hand-eye calibration are combined together to address the scaleless problem in monocular motion estimation. Furthermore, CEI is designed to estimate motions of the low-resolution LiDAR accurately. This algorithm leverages camera poses estimated by SfM to correct failed-registered LiDAR point cloud pairs without any prior knowledge of the extrinsic matrix X CL . With acquired LiDAR and camera poses ( CL & s (0) ) of extrinsic parameters can be solved by a hand-eye (calibration) estimator. Eventually, we fine-tune the solution of hand-eye calibration using SR between reconstructed scenes P L and P C .
The following sections are organized below. For one, the theory of hand-eye calibration and point cloud registration is reviewed in Sections II-B and II-C, respectively, as they are preliminaries to our method. Subsequently, Sections III-A and III-B jointly introduce our CEI algorithm. Finally, Section III-C describes the solution and tuning of extrinsic parameters.

B. Hand-Eye Calibration
Hand-eye calibration aims to estimate X CL and s through LiDAR and camera motions. For convenience, we suppose that their motions have been estimated correctly before. Then, the hand-eye calibration equation can be formulated as (1), which can be further decomposed into (2) and (3) where N is the number of poses where R C i, j and R L i, j are the rotation matrices of T C i, j and T L i, j , respectively, and t C i, j and t L i, j are the translation vectors of T C i, j and T L i, j , respectively. Notably, (1) is a general formulation of hand-eye calibration used in the pose graph. An example pose graph comprised of four nodes is shown in Fig. 3. Each node represents a pose, while each edge signifies the transformation between nodes. Unless specified otherwise, all pose graphs in this article are fully connected and undirected.
As reported in [25], (2) can be simplified to (4), in which R CL can be straightforwardly solved by singular value decomposition (SVD). Once R CL has been determined, t CL and s can be estimated through the least-square method, as shown in (5) where α i, j and β i, j are rotation vectors of R C i, j and R L i, j , respectively The above are all the steps to solve the extrinsic parameters R CL , t CL and the scale factor s using hand-eye calibration.

C. Point Cloud Registration
Point cloud registration is a technique to compute the transformation (normally rigid) between two point clouds, which is also the foundation of LiDAR pose estimation and SR. Suppose that P i and P j denote two point clouds, the task of registration is to find a rotation matrix R i, j ∈ SO(3) and a translation vector t i, j ∈ R 3 to minimize the following equation: where x k and y k are one pair of corresponding points. Note that P i and P j are usually partially overlapped, and {x i } and {y i } are only their respective subsets. The iterative closest point (ICP) method [26], [27] is one of the most popular registration methods. It provides a closed-form leastsquares solution to this problem. To apply ICP, we need to first compute the centroids of {x} and {y}, along with the cross-covariance matrix H where x and y are centroids of {x} and {y}, respectively. Afterward, we need to decompose H by SVD, mathematically H = USV, where U and V are orthogonal while S is diagonal. Finally, the optimal R i, j and t i, j can be solved by the following equation: However, it is not easy to find correct correspondences between P i and P j . ICP searches correspondences based on the closest Euclidean distance and does it iteratively, but it usually requires an initial guess. Point descriptors [28], [29], [30] are designed to build putative associations. They match points  in the feature space independent of the initial transformation using the following equation: For each x k ∈ P i , y k = arg min where F(x k ) and F(y k ) are corresponding feature vectors of x k and y k , respectively. In addition, outlier rejection approaches [22], [31], [32] are developed to improve the matching recall of these descriptors. They are normally robust enough to filter out incorrect correspondences and register point clouds in complex scenarios.
Besides these versatile point cloud registration methods, LiDAR SLAM is also a valuable tool to estimate continuous-time LiDAR poses, even for vertically sparse LiDARs. We test their performance in our ablation study discussed in Section IV.

A. LiDAR Pose Estimation (Cluster Extraction)
As mentioned in Section II-A, our method relies on 3-D reconstruction technologies for pose estimation. We qualitatively evaluate the accuracy of motion estimations (transformations) by visually inspecting the reconstructed point cloud scenes ( P L and P C ). We found numerous misalignments in P L but none in P C . Concerning reconstructing P L , we first tried ICP [26] for pairwise registration, but it encountered a complete failure [ Fig. 4(a)], where no patterns can be derived. Then, we replaced ICP with a modified random sample consensus (RANSAC) registration approach proposed in the supplementary material of [22]. It applies a modified RANSAC criterion and the fast point feature histograms (FPFH) [28] descriptor for point cloud registration. We refer to this registration approach as RANReg to distinguish it from the ordinary RANSAC algorithm. Fig. 4(b) illustrates that a considerable proportion of point clouds are correctly aligned but broken into several clusters with many outliers. Through an analysis of its pairwise registration (not shown in pictures), we found that the cause of this phenomenon is the erroneous transformations between adjacent scans, i.e., erroneous odometry edges T L i,i+1 . Once T L i,i+1 is incorrect, it will affect all poses after the frame i + 1. Consequently, nodes can be classified into clusters (inlier nodes with different labels), together with ones that do not belong to any cluster (outlier nodes with no label). Typically, nodes sharing the same label are aligned well with each other, exhibiting a high degree of transformation congruence, which is denoted as cluster consistency in this article. The process of cluster extraction can be viewed as a classification task. For clarity, we assign the label −1 to outlier nodes. This algorithm aims to capitalize on the cluster consistency to classify nodes into several clusters (extraction) and integrate them into a single graph (integration) through point cloud registration.
Concerning the extraction part, consider a simple case first, where there is only one such cluster in the graph. In this case, all the inlier nodes belong to the same cluster, so edges between inlier nodes should satisfy (1) while other edges do not. We apply the RANSAC [33] algorithm to extract these edges (inlier edges) with a score function presented in (10). As the solution of (1) has been decomposed into (4) and (5), we decompose the RANSAC extraction part into two steps accordingly-first filter out putative inlier edges scored by SO3 loss (11) and then pick out more reliable inliers scored by SE3 loss (10) from the putative ones. Fig. 5(a) displays the result of the first iteration in the form of an adjacency binary matrix, in which ones represent inlier edges while zeros denote other edges Nonetheless, our final goal is to extract the constituent inlier nodes of each cluster rather than inlier edges, and we employ density-based spatial clustering of applications with noise (DBSCAN) [34], [35] to achieve this goal. Unlike its conventional usage for clustering, we re-define its adjacency matrix-the distance between every pair of nodes connected by an inlier edge is viewed as zero, while that between any other pair is treated as infinity. In other words, we set the metric distance between P i and P j to zero when T i, j is an inlier edge, else one (equivalent to infinity in this algorithm). When there is only one cluster, the labels of nodes predicted by DBSCAN have offered all the necessary information for extracting the cluster and excluding outliers.
To generalize this approach to graphs with multiple clusters, we can simply repeat the above procedures iteratively. Specifically, for each iteration, the input is the graph composed of the outlier nodes discarded in the previous iteration, with connected edges unchanged. Meanwhile, we establish a termination condition for this loop. As s is theoretically invariant across clusters, we define a threshold ϵ to restrict the difference in s between the first and the current iterations. The specific steps of this algorithm are summarized in Algorithm 1.
Some intermediate results of Algorithm 1 are visualized in Fig. 5 in the form of adjacency matrices. Fig. 5(a) illustrates the inlier edges extracted by RANSAC in the first iteration. Fig. 5(b) and (c) jointly display the inlier poses clustered by DBSCAN in the first iteration, which constitute the first cluster. We also display cluster consistency in Fig. 5(d) by re-predicting inlier edges within this cluster-most of the edges are classified as inliers, aligning with our theory. Fig. 5(e) and (f) shows likewise results in the second iteration. Ultimately, it is demonstrated in Fig. 4(c) that several clusters are extracted by Algorithm 1, while some outlier laser scans have been excluded in the pruned graph.

B. LiDAR Pose Estimation (Cluster Integration)
Based on the preceding analysis, the next step is to consolidate these clusters into one. The overall solution is to regard each cluster as an ordinary point cloud and register them in a pose graph. With this cluster-level registration, we can acquire the pose of each cluster T L k and update frame-level poses T L i accordingly. Considering this process (external registration) can not diminish transformation errors within the cluster, we employ internal refinement within each cluster separately before external registration. In essence, internal refinement and external registration are both multiway registrations. The only difference between them is the input.
Regarding the internal refinement, given that point clouds within each cluster are aligned close enough to each other, we choose ICP for local pairwise registration. One example cluster and its refined result are visualized in Fig. 6. It is demonstrated that the integrated point cloud P k becomes denser and sharper after internal refinement.
For external registration, we observed that RANReg [22] failed in pairwise registration. Despite no ground truth correspondences, we empirically attribute the problem to the low feature-matching recall. Based on this assumption, we replaced the FPFH point descriptor with fully convolutional geometric features (FCGF) [29], which is a deep learning descriptor pretrained on the KITTI [36] dataset. With this modification, the algorithm eventually achieved success in external registration. We do not draw figures to expressly show their differences, as it is not relevant to our contributions.
The whole cluster integration algorithm is outlined in Algorithm 2, and the final integrated point cloud is shown in Fig. 4(d), with the same palette as Fig. 4(c). The above are all steps of our CEI algorithm, and its role in the whole calibration method is visualized in Fig. 2. With the CEI algorithm, we can obtain accurate LiDAR poses as well as the reconstruction P L .

C. Extrinsic Parameters Solution and Tuning
Considering that some LiDAR poses are classified as outliers and excluded by CEI, we exclude the corresponding camera poses in the following steps accordingly. With the reserved LiDAR and camera poses, we can generate motion pairs (T C i, j and T L i, j ) to solve X CL and s, as discussed in

Algorithm 2 Cluster Integration
Input: The only remaining problem is how to select motion pairs as input. According to [37], the degeneration of motions can affect the uniqueness of the solution. In general, the solution accuracy increases with pose diversity. A typical odometry-fusion approach [15] only utilizes odometry edges [1 ≤ i < N , j = i + 1 in (1)] to solve (1), aiming to evade pose drift caused by long-term movement. Nonetheless, a large proportion of edges are not involved in a solution, resulting in a significant loss of pose diversity. Instead, we include all edges [1 ≤ i < j ≤ N in (1)] in solution to enhance pose diversity without concern for long-term drift, as multiway registration and SfM both contain back-end optimization to address this problem. We also qualitatively compare the solution accuracy between selecting all edges and only adjacent edges. With given extrinsic matrix X CL and scale factor s, we can register P C to P L through a similarity transformation and inspect their alignment. The formulation of this similarity transformation is given in (12), and registration results are visualized in Fig. 7. Compared with [15] [ Fig. 7(a)], using all edges to solve (1) yields better alignment along with a better solution In addition, we use SR between P L and P C to further optimize the solution. It is implemented by a seven-DOF ICP method that can handle point cloud registration with scale tuning. By comparison between Fig. 7(c) and (b), P L and P C are aligned closer to each other, indicating higher accuracy of the solution. The above is the whole content of our method.

D. Implementation Details
We take advantage of several open-sourced libraries for the code of conduct. Specifically, SfM and multiway registration are implemented based on OpenMVG [38] and Open3D [38], respectively; the densification process in our flowchart (Fig. 2) is conducted based on OpenMVS [39]; the RANSAC and DBSCAN algorithms are implemented based on the scikitlearn library [40]. It should be noted that we have also tried other state-of-the-art pairwise registration methods like TEASER++ [32] and fast global registration (FGR) [31] in Section III-A, but they even perform worse than RANReg [22]. To contribute to the community, we provide open-sourced codes at https://github.com/gitouni/Targetless-LiDAR-cameracalibration.

IV. REAL-WORLD EXPERIMENT A. Calibration Error
Our real-world experiments are carried out on a multisensor platform introduced in Section II-A. We collected 211 frames of synchronized LiDAR-camera data, and the pose of our platform varies among these frames. The backward 180 • of each LiDAR scan (x < 0 in O L ) was discarded during preprocessing because the objects on the UGV behind the LiDAR can be a disturbance to LiDAR scanning. One example picture of the calibration scene and our UGV is shown in Fig. 8(a).  To quantitatively evaluate our method, we generate the ground truth calibration matrix using a state-of-the-art targetbased method [9], which has proven to work well for lowresolution 3-D LiDARs. Its LiDAR projection figures are displayed in Fig. 9. Concerning metrics, we evaluate the root mean square error (RMSE) in rotation and translation, respectively.
Moreover, we add ablation experiments to verify the effectiveness of CEI and SR. By comparing the last three rows of Table I, the employment of CEI dramatically improves the accuracy of hand-eye calibration. SR successfully reduces the translation error to a low level, with a slight loss of rotation performance. It is almost certain that CEI and SR are both effective in our framework.
Ablation experiments are also conducted on RANReg, with slightly different data settings. The control group consists of three state-of-the-art LiDAR SLAM methods: A-LOAM [41] (LOAM without a IMU), fast lidar odometry and mapping (F-LOAM) [42], and CT-ICP [43]. In contrast to previous settings, consecutive LiDAR scans were saved in rosbag files as the input of SLAM. { P i } and {I i } were collected synchronously at key timestamps, while {T L i } were selected from the continuous-time poses evaluated by SLAM.
Quantitative calibration errors of aforementioned SLAM methods are presented in Table I. As shown in the first three rows, considerable calibration residual still exists when using LiDAR poses estimated by SLAMs for hand-eye calibration. In contrast, the CEI module noticeably improves the performance of SLAMs, except F-LOAM. When combined with CEI, CT-ICP achieves almost satisfactory calibration performance, which is the best of all SLAM methods. However, SR is not effective in all cases. F-LOAM is the only method whose rotation and translation RMSE have been reduced by SR, but the translation part of its calibration result still needs improvement.

B. LiDAR Pose Estimation Error
To measure the accuracy of LiDAR poses, we also implemented an advanced LiDAR-inertia SLAM named LIO-SAM [44] to generate ground truth LiDAR poses. An Xsens MTi-300 IMU is installed at the bottom of the platform [ Fig. 8(b)], and its extrinsic matrix with LiDAR is obtained through hand-eye calibration. For trajectory error evaluation, we adopt unitless relative rotation error (RRE) and relative translation error (RTE) introduced in the evo [45] tool. It is illustrated in Table II that RANReg with CEI achieves the lowest RRE and RTE metrics, and CT-ICP is an excellent alternative to RANReg, provided consecutive LiDAR poses and edge computing devices are available. It is also recommended to acquire high-accuracy LiDAR poses if a near infrared (NIR) camera is accessible [46].

C. EA Analysis
Inspired by a previous study [12], we conducted an EA analysis to compare our proposed method and the targetbased one [9]. To start with, we extracted the LiDAR depth-discontinuous edge points using the technique introduced in [12]. Then, we manually marked the image edges on which the projected LiDAR points may have discontinuous depth. Finally, we projected the LiDAR edge points onto their corresponding images using different groups of extrinsic parameters for comparison. As the manual labeling process can be time-consuming, we selected only 11 representative images for the EA analysis.
We present both qualitative and quantitative results in this section. Regarding the qualitative comparison, we display two cases in Fig. 10, indicating that our method offers more accurate EA results in certain areas, such as edges of the street light, windows, and eaves. As for the quantitative evaluation, we utilized the KD-Tree to compute the shortest distance between the projected LiDAR edge points and the image edges. We define a pair of LiDAR point and image edge Qualitative results of the EA analysis. We analyze the correlation between the image edges (yellow) and projected LiDAR edge points (red). (a) Proposed. (b) Target-based [9]. as a true correspondence, provided its distance is less than 5 pixels. Based on this concept, we propose two metrics: correspondence number and correspondence distance. The former denotes the number of such correspondences counted in an image, while the latter indicates the mean of their distances. We present the final resultant figure across 11 samples in Fig. 11. Overall, our method yields more correspondences than the target-based one (mean of 11 cases: 23.27 versus 22.00) while maintaining even lower correspondence distances (mean of 11 cases: 2.01 versus 2.18).

V. CONCLUSION
In this study, a novel targetless method is proposed for extrinsic calibration between the camera and low-resolution LiDAR. It utilizes hand-eye calibration for initialization and SR for refinement. The proposed method outperforms other state-of-the-art targetless methods and can be particularly useful in applications where accurate extrinsic parameters are required but external targets are unavailable. In addition, its CEI module offers a new solution to pose correction in multisensor systems.