Deep Learning-Based Image 3-D Object Detection for Autonomous Driving: Review

An accurate and robust perception system is key to understanding the driving environment of autonomous driving and robots. Autonomous driving needs 3-D information about objects, including the object’s location and pose, to understand the driving environment clearly. A camera sensor is widely used in autonomous driving because of its richness in color and texture, and low price. The major problem with the camera is the lack of 3-D information, which is necessary to understand the 3-D driving environment. In addition, the object’s scale change and occlusion make 3-D object detection more challenging. Many deep learning-based methods, such as depth estimation, have been developed to solve the lack of 3-D information. This survey presents the image 3-D object detection 3-D bounding box encoding techniques and evaluation metrics. The image-based methods are categorized based on the technique used to estimate an image’s depth information, and insights are added to each method. Then, state-of-the-art (SOTA) monocular and stereo camera-based methods are summarized. We also compare the performance of the selected 3-D object detection models and present challenges and future directions in 3-D object detection.


I. INTRODUCTION
A UTONOMOUS driving and robot navigation should ob- tain 3D information of objects to understand the environment clearly.For fully autonomous driving, the perception system, such as 3D object detection, needs to be robust to work in adverse weather, accurate to give precise information about the driving environment, and enable fast decision-making for high-speed driving [1].Although 2D object detection has shown significant performance improvement in the computer vision community due to the rapid growth of deep learning (DL), 3D object detection is still a challenging problem due to the lack of 3D information on sensors, scale changes, occlusions, and others.A robust perception system, including 3D object detection, contributes to the development of fully autonomous driving, reducing fatalities caused by reckless human drivers.Building a perception system that is accurate to give precise information about the driving environment, fast to decide high-speed driving, and robust to work in inclement weather is crucial to achieving the goal of fully autonomous driving [1].There are different 3D sensors available for 3D object detection, such as Light Detection and Ranging (LiDAR), radio detection and ranging (radar), and depth sensors (RGB-D cameras) [2].The LiDAR sensor is a good choice for distance measurement.It is also more robust to inclement weather than a camera.However, the LiDAR data is unstructured and sparse, making LiDAR processing more challenging.Additionally, LiDAR is poor for color-based detection, and it is expensive.Radar is another 3D sensor for distance measurement and velocity estimation and is suitable for use in bad weather and night driving.However, it has low resolution, so radar-based object detection is poor.The camera sensor is inexpensive and rich in color and texture information.The major problem with a camera is the lack of high-accuracy depth information.Different DL-based methods have been developed to solve this problem.The monocular camera's lack of depth information can be partially solved using a stereo camera [3], [4] or structure from motion.Predicting stereo instance segmentation is another technique to solve the monocular depth problem for 3D object detection [5].Additionally, a few works convert the image into pseudo-LiDAR representation to solve the lack of depth information [6] (see details in section IV).
The major contributions of the paper are summarized as follows: 1) We provide an in-depth analysis of monocular and stereo image 3D object detection methods.2) We summarize 3D bounding box encoding techniques and object detection evaluation metrics.The bounding box encoding and 3D object detection evaluation techniques of each method are also provided in section IV. 3) We categorize image 3D object detection methods based on depth estimation techniques.4) We present SOTA image 3D object detection methods for autonomous driving.
The rest of the paper is organized as follows.Section II provides related work.Object detection, especially 3D object detection, including object detection categories, 3D bounding box encoding techniques, and 3D object detection evaluation metrics, are summarized in section III.Section IV summarizes the image 3D object detection methods and compares the se-lected ones.The challenges and future directions are presented in Section V.The last section summarizes the survey paper.

II. RELATED WORK
The rapid growth of DL enables feature learning from images rather than hand-crafted feature extractors, improving performance and facilitating the training process of object detection models.This work reviews DL-based image 3D object detection models for autonomous driving.Most survey papers presented image 3D object detection models with other works, such as LiDAR 3D object detection methods.However, a tremendous number of papers are published each year.So, in this work, we present a detailed analysis of image 3D object detection methods for autonomous driving.Therefore, we present SOTA methods and a comprehensive image 3D object detection analysis for autonomous driving.Kim and Hwang [7] reviewed a survey on Monocular 3D object detection, but the works are not specifically for autonomous driving.They presented deep DL-based monocular 3D object detection methods and datasets.Feng et al. [1] reviewed 2D and 3D object detection and semantic segmentation for autonomous driving.The commonly used datasets and 2D/3D methods were reviewed.Jiao et al. [8] presented DLbased object detection methods, but not limited to autonomous driving.Additionally, the survey focused more on 2D object detection methods.Arnold et al. [2] briefly reviewed 3D object detection methods including LiDAR and image-based for autonomous driving.Similarly, Rahman et al. [9] presented 3D object detection methods for autonomous driving.Li et al.
[10] and Guo et al. [11] presented DL-based object detection, segmentation, and classification in autonomous driving.Fernandes et al. [12] also reviewed DL based object detection and semantic segmentation for autonomous driving.Recently, Qian et al. [13] published a 3D object detection method for autonomous driving.In addition to the current SOTA methods, we have included 3D bounding box encoding techniques and 3D object detection evaluation techniques not covered by those survey papers.Recently, Alaba et al. [14] reviewed multisensor fusion 3D object detection models, 3D datasets, and sensors for autonomous driving (Refer [14] for detailed analysis of more than fifteen 3D datasets in autonomy, including stereobased datasets).

III. OVERVIEW OF OBJECT DETECTION
This section presents object detection categories, evaluation metrics for object detection, and 3D bounding box encoding techniques.

A. Object Detection categories
Image-based 3D object detection models use 2D object detection as a base model and use different techniques, such as regression, to extend to 3D object detection.Thus, we review 2D object detection models to understand 3D object detection fully.DL-based general object detection methods can be classified into two groups: two-stage and one-stage.A twostage object detection network has a region of interest (ROI) network for region proposal generation and the subsequent network for bounding box regression and classification, as shown in Fig. 1.R-CNN [15], SPPNet [16], Fast R-CNN [17], Faster R-CNN [18], RFCN [19], and Mask R-CNN [20] are examples of two-stage 2D object detection models.Girshick et al. [15] proposed R-CNN, a two-stage 2D object detection network, as shown in Fig. 2. Selective search algorithm [21] is used to generate 2,000 region proposals (candidate boxes), and then a CNN model is employed for feature extraction.The extracted features feed into support vector machines (SVM) to classify an object within the region proposals.The major limitation of R-CNN was the redundant generation of 2,000 bounding boxes from each image, increasing the network's computational burden.[19], Mask RCNN [20], Light head RCNN [22], Feature pyramid Network [23], etc. Mask RCNN network combines the Faster R-CNN and Fully Convolutional Network (FCN) in one architecture with an additional binary mask to show pixels of the object in the bounding box.There are also many 3D object detection networks, such as Mono3D [24] (see section IV for details on 3D object detection).On the other hand, one-stage object detection networks directly learn the class probabilities and bounding box coordinates in a single pass through the network without generating region proposals for each image.The one-stage object detection general architecture is shown in Fig. 3.   a Single Shot MultiBox Detector (SSD) [29], which is a one-stage detection network that improves the YOLO [25] accuracy bottlenecks and a small object detection problem by introducing aspect ratios and a multiscale feature map to detect objects at multiple scales.Then, Lin et al. [30] introduced RetinaNet to improve one-stage object detection by introducing focal loss (see the details from the paper [30]) as a classification loss function.The network's accuracy is comparable to the two-stage object detection while maintaining a high detection speed.Zhao et al. proposed M2det [31], a multilevel feature pyramid network that enables the construction of multiscale and multilevel features, which helps to detect objects of different scales.Zhang et al. introduced a RefineDet [32] to further increase the accuracy of one-stage object detection.MoVi-3D [33], [34], and AutoShape [35] are image 3D one-stage object detection networks (see the details in section IV).One-stage object detection networks are fast, but their detection accuracy is lower than two-stage detectors due to class imbalance problems.On the other hand, two-stage detectors are slower than one-stage detectors; however, they have better detection accuracy.The RPN reduces redundant detections of two-stage detectors.However, one-stage detectors directly detect class probabilities and bounding box estimation in a single pass without RPN, so the redundancy reduces the detection accuracy.
Although the eight-corner encoding method gives better results than the axis-aligned method, it does not consider the physical constraints of a 3D bounding box [36].Because of this, it forces the top corner of the bounding box to align with the bottom corners.The four-corners and two-heights encoding technique solve this problem by adding corner and height offset from the ground plane between the proposed bounding boxes and the ground truth boxes.Moreover, voxelnet [39] and SECOND [40] adopted the seven-point 3D bounding box encoding technique.The seven points are (x, y, z, w, l, h, θ), where x, y, and z are the center coordinates; w, l, and h are the width, length, and height, respectively.θ is the yaw rotation around the z-axis.
The elevation and roll angles are considered zero.This encoding method is further adopted by pointpillars [41], WCNN3D [42], and monocular 3d [24].This technique is widely used in 3D object detection.The regression operation between ground truth and anchors using the sevenpoint technique can be defined as:  Fig. 5: A diagrammatic comparison between the 8 corner box encoding method [36], 4 Corners and 2 height encoding method [37], the axis aligned box encoding method [38], and seven parameters encoding method [39], [40].
∆θ = sin(w gt − w a ), where the superscripts gt and a represent the ground truth and the anchor boxes, respectively.d a = (w a ) 2 + (l a ) 2 is the diagonal of the anchor box.Energy minimization methods use a different 3D bounding box encoding technique.For example, Mono3D [43], 3DOP [44], and DeepStereoOP [4] represent a 3D bounding box as (x, y, z, θ, c, t), where (x, y, z) and θ denote the center of the 3D bounding box and azimuth angle, respectively.c represents the object class, such as Cars and Pedestrians, and t denotes the set of 3D box templates learned from the training data, which shows the physical size variation of each class.

C. Evaluation Metrics for Object detection
One commonly used evaluation metric for object detection is average precision (AP) [45], which is an average detection precision under different recalls for each object category.The mean average precision (mAP) is used as a final evaluation metric for performance comparison of overall object categories.The intersection over union (IOU) threshold value, a geometric overlap between the prediction and the ground truth bounding boxes, is used to measure the object localization accuracy.The graphical representation of IOU is shown in Fig. 6 (the yellow region represents the intersection of the predicted box and the ground truth bounding box, whereas the green region represents the union of the two).Equation ( 1) shows the mathematical expression of IOU.The representative threshold value may vary from object to object.For example, in the KITTI [45] dataset, a car's 3D bounding box requires an IOU of 0.7, and pedestrians and cyclists require an IOU of 0.5.
where bbox pred is the predicted bounding box and bbox gt is the ground truth bounding box.Additionally, the F1 score and the Precision-Recall curve are used as evaluation metrics for classification.Precision shows the ratio of the true positives to the total dataset's actual values, whereas recall reveals the ratio of the true positives to the predicted values.The balance of the precision recall is important for average precision (AP) and mAP.AP approximates the Precision/Recall curve shape by averaging precision for R equally spaced recall levels [46].
For the KITTI dataset, it is calculated for eleven equally spaced recall levels [46], [47], i.e.R 11 = (0, 0.1, 0.2, ..., 1).When the recall interval is zero, the correctly matched prediction gives 100 % precision at the bottom recall bin [46].The interpolation function ρ interp (r) is defined as: where ρ(r) is the precision at recall r.The maximum precision value at recall greater or equal to r is considered rather than the mean of the whole observed precision values for each point r.
The mAP is calculated for the overall performance evaluation for 11 recall points.Some works, such as Monopair, [48] and [46], calculate mAP using 41 points instead of eleven recall points but averaging only 40 (1/40, 2/40, 3/40, ..., 1), without zero recall points to eliminate the glitch at the lowest recall bin [46].The other common performance evaluating metrics are AP3D metric, Average Orientation Similarity (AOS) metrics [45], and the localization metrics (AP BV ) [36] for bird'seye view representation.AOS measures the 3D orientation and detection performance by weighting the cosine similarity between the estimated and ground-truth orientations.
where r = T P T P + F N is the recall based on PASCAL [47] dataset.TP is a true positive, and FN is a false negative.The orientation similarity ∈ [0, 1] at recall r is normalized by the cosine similarity.
where D(r) denotes the set of all object detections at recall rate r, and θ is the difference in angle between estimated and ground truth orientation of detection i and δ(i) term penalizes multiple detections.On the other hand, the nuScenes [49] AP method defines a match by thresholding the 2D center distance d on the ground plane rather than IOU.This helps to decouple the effect of object size and orientation for detection.
For the nuScenes dataset, they measure a set of true positives (TP) for each prediction matched with the ground truth box.
Then, for each TP, the mean TP (mTP) is computed for overall classes.
Finally, the nuScenes detection score (NDS) is computed.
The nuScenes detection score is an evaluation metric for the nuScenes dataset.Waymo open dataset [51] uses a 3D object detection evaluation metric, APH, by incorporating heading information into common evaluation metrics, such as AP.
AP H = 100 where p(r) is the precision/recall curve.Additionally, h(r) is computed similar to p(r), but each true positive is weighted by heading accuracy, which can be defined as min (| θ −θ|, 2π| θ − θ|)/π, where θ and θ are the predicted heading and the ground truth heading in radians within [−π, π], respectively (Refer Waymo open dataset [51] for details).Most autonomy datasets follow either the KITTI or nuScenes evaluation metric.

IV. IMAGE 3D OBJECT DETECTION METHODS AND COMPARISON OF VARIOUS METHODS
Image-based object detection methods use images as input.
In this section, we review the monocular image and stereo image-based methods.2D object detection is successfully implemented for many applications, but it is not enough for autonomous driving applications.The autonomous vehicle must clearly understand the driving environment for reliable driving.Because of the lack of accurate depth information, 3D object detection is more challenging for image-based methods.Different methods have been proposed to estimate depth from 2D images so that we can detect objects in 3D using the estimated depth.Some of these methods use two-stage object detection methods by first generating object proposals and performing regression for 3D bounding box detection and classification.The classic object detection methods use handcrafted methods to generate 2D box proposals [52]- [55].
Others use the ability of deep neural networks to learn complex features from images to generate 2D box proposals [56], [57].
Image-based 3D object detection is more challenging due to the lack of depth information.Most depth estimation techniques can be categorized into Pseudo-LiDAR, stereoimage, or geometric constraints-based, such as the object's shape and key-points to estimate the depth.The Pseudo-LiDAR methods generate point cloud data from images and use 3D LiDAR-based methods for detection.Although these methods outperform image-only methods, their accuracy is

A. Pseudo-LiDAR Method
Some works convert monocular or stereo images into a LiDAR representation called Pseudo-LiDAR to solve the lack of depth information, such as [6], [24], [50], [59]- [61].Pseudo-LiDAR is a LiDAR representation of images by predicting the depth of each image pixel, called the depth map.Wang et al. [50] showed that the representation of the data plays a big role rather than the quality of the data on 3D object detection by converting monocular images into LiDAR representation (Pseudo-LiDAR).The stereo depth estimation was done by using pyramid stereo matching network (PSMNet) [62], DISP-NET [63], and SPS-STEREO [64], but they use DORN [65] as a monocular depth estimator.Then, the depth map is projected into a 3D point cloud to produce pseudo-LiDAR by mimicking the LiDAR signal as shown in Fig. 7.The LiDARbased detectors can directly process the pseudo-LiDAR data.AVOD [66] and Frustum PointNet [67] LiDAR-based models were used for the experiment.The experimental result on the KITTI [45]

B. Stereo Images Method
These methods generate depth from stereo-images [3]- [5], [43], [76]- [79].Mono3D [43] uses stereo images to estimate the depth and generate 3D bounding box object proposals by encoding object size priors, ground planes, a variety of depthinformed features, point cloud densities, and distance to the ground.The problem is articulated as an energy minimization function, and Markov Random Field (MRF) is used to score 3D bounding boxes for proposal generation.Fast R-CNN [17] is used to predict the class proposal, and objects' orientation is estimated using the top object candidates.Chen et al. [44] extended the previous work [43] to generate class-specific 3D object proposals (3DOP) with very high recall for various IOU thresholds by assuming that objects should be on the ground plane and using only a single monocular image.They use semantic and object instance segmentation, context, shape features, and location priors to score 3D bounding boxes, as shown in Fig. 9.The limitation of 3DOP is that it should run separately for each object class to achieve a high recall.This operation increases the processing time because of many generated object proposals.To overcome this problem, Pham and Jeon [4] introduced a proposal reranking algorithm, DeepStereoOP, to rerank the generated 3D object proposals.This algorithm helps achieve high recall and good localization using only a few candidate proposals.The two-stream CNN algorithm uses RGB features, depth features, disparity maps, and distance to the ground to rerank top-ranked candidates.
The result shows the DeepStereoOP algorithm is superior to the Mono3D [44] algorithm to get high recall with fewer proposals.
Chen et al. [3] presented a proposal generation algorithm using stereo imagery and contextual information.The 3D object proposals are generated using an energy minimization function that encodes object size priors, ground plane information, and depth-informed features, such as free space, point cloud densities, and distance to the ground.The CNN scoring network uses appearance, depth, and context information to predict 3D object proposals and object poses simultaneously.
The result outperforms the previous works, such as [4] and [44], on the KITTI dataset.Königshof et al. [80] put forth Triangulation Learning Network (TLNet) [76] uses 3D anchors to construct object-level geometric correlation between stereo images.Then, the neural network learns the correspondence between stereo images to triangulate the target object near the anchor.The Channel reweighting method is also proposed to enhance informative features and weaken the noisy signals by measuring left-right coherence, which overcomes the high computational burden of generating disparity maps in mono3D [44] network.Stereo CenterNet [77] uses semantic and geometric information in stereo images to implement 3D object detection.They use the anchor-free 2D box association method by detecting only objects in the left images and computing the left-right associations by predicting the distance between them.Gao et al. [87] put forth an efficient stereo geometry network (ESGN) for 3D object detection.The ResNet-34 [88] backbone is used to extract multi-scale feature maps.Using a stereo correlation and reprojection module, the proposed 3D efficient geometry-aware feature generation (EGFG) module constructs multi-scale stereo volumes in camera frustum space.Then,  [96]is an instancelevel disparity estimation network (iDispNet) that estimates disparity only for regions that contain objects of interest rather than the entire image and learns a category-specific shape prior.This operation helps capture the smooth shape and sharp edges of object boundaries for more accurate 3D object detection.
The lack of depth in image-based methods can be partially solved using stereo images.The 3D object proposals are generated from stereo images using different techniques.Some methods, such as TLNET [76], use cost and channel reweighting to enhance features and weaken noises.Other methods formulated the object proposal as an energy minimization problem.Works, such as DeepStereoOP [4], proposed a reranking algorithm to reduce redundant proposals and use only a few proposals.Additionally, contextual information can be used together with stereo images for proposal generation.

C. Geometric constraints Method
These works create 3D proposals by adding additional geometric constraints including object shape, ground planes, and key-points [33], [38], [44], [58], [66], [70], [94], [97]- [104].Mousavian et al. proposed Deep3DBox [38], a 3D object detection method by incorporating geometry constraints.A hybrid discrete-continuous loss is used to estimate the 3D object orientation and then apply regression on the 2D bounding box combined with the estimated geometric constraints to produce the object 3D bounding box.M3D-RPN [58] is a single end-to-end region proposal network for 3D object detection using the correlation between 2D scale and 3D depth.The proposed depth-aware convolutional layer improves 3D parameter estimation, enhancing 3D scene understanding.Likewise, Mono3d++ [97] uses a joint method of predicting the vehicles's shape and pose using a 3D bounding box and morphable wireframe model from a single RGB image.The unsupervised monocular depth, a ground plane constraint, and vehicle shape priors optimize loss functions.The overall energy function integrates the loss, the vehicles' shape, and pose to improve vehicles' detection further.Integrating the loss function with the shape of vehicles may limit the model's performance because of the shape difference between vehicles.Some methods use instance-level depth estimation using geometric reasoning.Others use a combination of key-point and geometric information for depth estimation.For example, MonoGRNet [94] is a unified network for 3D object detection from monocular RGB images using geometric reasoning and instance-level depth estimation.The model consists of the 2D detection, instance depth estimation, 3D location, and location corner estimation subnetworks as shown in  RTM3D [70] predicted the nine-perspective key-points of a 3D bounding box and modeled the geometric relationship of 3D and 2D points to detect 3D objects from monocular images.Similarly, MoVi-3D [33] is a one-stage deep architecture that leverages geometrical information to generate virtual views, using prior geometrical knowledge to control the scale variability of the object because of depth.GS3D [98] is an efficient model for getting a coarse cuboid for each predicted 2D box to determine the 3D bounding box by refinement.This method improves the 3D object detection and performs better than regression-based bounding box prediction.ROI-10D [99] is an end-to-end network for 3D object detection by lifting 2D into 3D to predict six degrees of freedom pose information (rotation and translation).The loss function measures the metrics misalignment of boxes and minimizes the error by comparing it with the ground truth 3D boxes.Ding et al. proposed a Depth-guided Dynamic Depthwise Dilated local convolution (D4LCN) [106] network where local filters learn specific geometry from each RGB image using a depth map that is applied locally to each pixel and channel of each image.Some models, such as [34], avoid processing the image multiple times, which reduces the computational bottleneck of deep neural networks by generating perobject canonical 3D bounding box parameters using NMS and nonlinear list square optimizer.Srivastava et al. [107] developed a 2D to 3D lifting method for autonomous vehicle's 3D object detection.They generate BEV images from a single RGB image using Generative Adversarial Networks (GAN) for image-to-image translation [108] and then do 3D object detection using the generated BEV images.Garanderie et al. [109] proposed a 3D object detection model for autonomous vehicles by using 360 panoramic imagery.This method is important to avoid blind spots in driving.The model was tested using the CARLA [110] urban driving simulator and KITTI [45] dataset.Liu et al. [111] developed a deep fitting scoring network for monocular 3D object detection.The network generates 3D proposals using the object's anchorbased dimension and orientation regression.Then, they use a fitting quality network (FQNet) to understand the spatial relationship between 3D proposals and objects only using 2D images.Chen et al. [48] proposed a pair-wise spatial relationship-based 3D object detection method.The object location is computed using uncertainty-aware predictions and 3D distances for the adjacent object pairs.Finally, nonlinear least squares jointly optimize the system.By the same token, Bao et al. proposed MonoFENet [105] network for 3D object detection by estimating the disparity from a monocular image.Fig. 12 shows the disparity image generated using the monocular-based disparity estimator.Then, the estimated disparity is transformed into a 3D dense point cloud to feed into a point feature enhancement (PointFE) network and fuse with the image features for the final 3D bounding box regression.Bao et al. [113] proposed a two-stage object-aware 3D object detection model that uses both the region-wise appearance attention and the geometric projection distribution to vote the   [45] shows that the model outperforms the previous models, such as D4LCN [106].
Some works followed different approaches than we mentioned above to solve the 3D objection problem from the input of 2D images.Liu et al. put forth RAR-Net [112] a reinforced axial refinement network monocular 3D object detection model.The proposed model starts with an initial prediction, refines it gradually towards the ground truth, and only one 3D parameter is changed in each step.The ϵgreedy policy, which maximizes the reward by selecting the action with the highest estimated reward, is implemented to get a reward after each action is taken and the refined 3D box of the monocular 3D detection network.At each step, information from the image and 3D space is fused, and then project the current detection into the image space to preserve information.This reinforcement learning-based learning can be used as a post-processing stage and integrated into an existing monocular 3D detection model to improve performance with some extra computational cost.The model was trained with KITTI dataset [45] and showed promising performance.Mehtab et al. [117] proposed a 3D vehicle detection model using LiDAR and camera sensors.The autonomous vehicle's size and orientation of 3D bounding boxes are estimated from the RGB images, whereas the LiDAR point cloud is used for distance estimation.The authors used MobileNetV2 [118] as an image feature extractor.The model was trained and tested on the KITTI [45] and Waymo [51] datasets.Simonelli et al. [46] put forth selfsupervised loss disentangling transformation for monocular 3D object detection.The loss separates the groups of parameter contributions into separate terms as the original loss.The authors also applied the loss function IOU for 2D detection and 3D bounding box predictions and detection confidence.
The model was trained on the KITTI [45] dataset.The three depth estimation techniques perform different operations to estimate depth from 2D images.The Pseudo-LiDAR methods transform the image into a LiDAR representation and use LIDAR-based models to leverage 3D information of the LiDAR representation.On the other hand, the stereobased models do not transform the image into another domain; instead, depth is generated from the left and right stereo images.The geometric constraints method uses additional geometric constraints, including object shape, ground planes, and key-points to estimate the depth information from 2D images.The 3D bounding box encoding technique, 3D object detection evaluation method, datasets used for the experiment, and year of publication of each method are presented in Table I.Table II shows the BEV and 3D performance comparison of the image-based 3D object detection methods on KITTI [45] validation and test data benchmarks.

V. CHALLENGES AND FUTURE DIRECTIONS
Camera images, especially monocular images, are rich in texture and color information, which are essential for color-related tasks, such as object classification and lane detection.However, they do not provide high accuracy depth information for a complete understanding of the surrounding environment.Autonomous driving needs to be robust to drive in different weather conditions, but cameras are affected by bad weather.Additionally, DL models evaluated on a different domain than trained perform poorly.We presented some challenges and future research directions in image-based 3D object detection for AVs.
1) Semisupervised Learning: One of the challenges of supervised learning is annotating and labeling data, which requires time and money.Data annotation and labeling problems can be solved using unsupervised learning.However, unsupervised models' detection and classification accuracy are lower than the supervised models.The potential solution to these problems is applying a semisupervised model using few labeled data and many unlabeled data to leverage the abundance of freely available images for different applications.Some teacher-student models, such as Zhang et al. [119], belong to a semi-supervised 3D object detection network for autonomous driving.The teacher model generates pseudo-labels in the teacher-student model, and the student model trains the pseudo-labels and the labeled dataset.Then, the teacher model may receive an update from the student model for better pseudolabel prediction.This model is mainly used in 2D object detection, but the 3D equivalents are limited.2) Multitask Learning: The feature extractor part of DL networks can be common to multiple applications.Therefore, building a model with common feature extractor /lower architecture of the model with multiple decision layers to perform multiple tasks can save time, memory, and computational power.For example, [120] performs object detection and segmentation multitask learning.We expect many multitask learning works for AVs. 3) Domain Adaptive Models: DL models should perform the same/equivalent when tested with a different domain than they were trained.However, most DL models perform poorly when the training domain changes.Domain adaptive models are essential for autonomous driving to avoid country-specific changes, such as traffic sign variability and corner issues.Therefore, we need domain adaptive models to learn the driving environment changes and respond quickly to the changes.4) Lightweight Models: DL models in AVs should fulfill the following three criteria [1]: 1) Accurate to precise information about the surrounding environments.
2) Robust to work in different weather.
3) Real-time to perform high-speed driving.To achieve the above criteria, DL models should be robust enough to work under different weather and lightweight to be deployed in low-power and low-memory embedded hardware devices.Most existing 3D object detection models are not lightweight as their 2D equivalents.There are relatively lightweight 2D object detection models, such as YOLO [121] and SSD [29] than 3D object detection models.5) Multisensor Fusion: Cameras are suitable for colorrelated detection and rich in texture too.Although different methods have been developed to solve the lack of 3D information, 3D object detection using cameras is challenging.Additionally, cameras are not robust to adverse weather, which makes robust driving in different environmental weather challenging.Other sensors can provide better 3D information, such as LiDAR, and more robust to adverse weather, such as radar.Therefore, fusing the camera images with LiDAR and/or radar can improve 3D object detection by using the best out of different sensors (Refer [14] for a detailed analysis of multisensor fusion methods and different sensor fusion techniques in 3D object detection).

VI. CONCLUSIONS
This survey presented DL-based monocular and stereo camera image-based 3D object detection for autonomous driving.The 3D bounding box encoding methods and the corresponding evaluation metrics were summarized.The general object detection categories as one-stage and two-stage and depth estimation methods of 3D object detection are also reviewed.The depth estimation methods are grouped based on techniques, such as pseudo-LiDAR, stereo image, and geometric constraint methods.Although 3D object detection using camera images has shown significant performance improvement due to the rapid growth of DL, there are still issues to be solved for reliable and robust driving, such as driving in bad weather or at night.The camera sensor is rich in color and texture and inexpensive, but it cannot measure the distance from long range, cannot withstand bad weather, and does not give direct 3D information [14], [42].3D sensors, such as LiDAR and radar, provide 3D information about the driving environment and objects.LiDAR is more robust than a camera for inclement weather and a good choice for long-distance measurement and velocity estimation.However, it is not rich in color and texture.Similarly, radar is a robust sensor for inclement weather and the best choice for distance measurement and velocity estimation, but it has low resolution, making radarbased detection difficult.Additionally, there is a possibility of sensor failure during autonomous driving.Thus, using multiple sensors for autonomous driving is essential to use redundant data from different sensors for reliable and robust driving to work under bad weather or sensor failure conditions.Lightweight and accurate 3D object detection models are necessary to improve the speed and accuracy of real-time processing.Finally, challenges and possible research directions were presented.

Fig. 1 :
Fig. 1: Two-stage object detection architectural representation.The first stage generates the region of interest (ROI), and then the second stage predicts class probabilities and the bounding box for each object.The backbone network and RPN can be designed as one network.

Fig. 2 :
Fig.2: R-CNN object detection system[15].The system (1) takes an input image, (2) around 2000 bottom-up region proposals are extracted using a selective search algorithm, (3) for each proposal, features are computed using CNN and fed to the SVM classifier, and then (4) linear SVMs classifies each region.

Fig. 3 :
Fig. 3: One-stage object detection model architectural representation.The model learns the class probabilities and bounding box regression in a single pass through the network instead of two passes like the two-stage model.

Fig. 4 :
Fig. 4: The YOLO Model [25].The model divides the image into an S×S grid.The model predicts bounding boxes, a confidence score for each grid cell, and class probabilities for those boxes.

Fig. 6 :
Fig. 6: Pictorial representation of IOU, best viewed in color.The top is the intersection, and the bottom is the union.

Fig. 7 :
Fig. 7: Generating pseudo-LiDAR representation from given stereo or monocular images by predicting the depth map and projecting it into a 3D point cloud coordinate system [50].

Fig. 8 :
Fig.8: Monocular 3D object detection[59].In the first phase, the two backbones generate 2D detection and point cloud generation by estimating depth from RGB images.In the 3D box estimation step, the pointNet backbone network generates each ROI's 3D location, dimension, and orientation.

Fig. 9 :
Fig. 9: Mono3D [44].The 3D bounding boxes are sampled before being projected to the image representation.Scoring and nonmaximal suppression (NMS) is done from multiple features: object shape, class semantic, location prior, instance semantic, and context.

Fig. 11 .
The model was trained and tested on the KITTI[45] dataset.Barabanau et al. [100]  also developed a combination of a key-point-based and geometric reasoning approach for 3D object detection from monocular images.Similarly, Liu et al. presented AutoShape[35], a one-stage real-time shapeaware monocular 3D object detection model.The model employs geometry constraints for 3D key-points and their 2D projections on images to enhance the detection performance.The proposed automatic annotation pipeline can autogenerate the shape-aware 2D/3D key-points correspondences for each object.The model was evaluated with KITTI[45] car dataset.Likewise, Cai et al.[101] modeled the 3D object detection task as a combination of a structured polygon prediction task and a depth estimation task.The depth estimation network uses an

Fig. 12 :
Fig.12: MonoFENet Architecture[105].The disparity map is generated using the monocular disparity estimator.Then, the estimated disparity map is concatenated with the associated front view maps, such as distance, height, and depth.The ROI point clouds are generated using the ROI pooling layer before feeding to the proposed PointFE network.The point cloud features are fused with the RGB image features for 2D and 3D detection.Finally, the output of the pointFE network and 3D detection head is fused to improve performance.
Yihunie Alaba is with the Department of Electrical and Computer Engineering, James Worth Bagley College of Engineering, Mississippi State University, Starkville, MS 39762, USA (E-mail: sa1724@msstate.edu)John E. Ball is with the Department of Electrical and Computer Engineering, James Worth Bagley College of Engineering, Mississippi State University, Starkville, MS 39762, USA (E-mail: jeball@ece.msstate.edu) [16]t al.proposed Spatial Pyramid Pooling Networks (SPP-Net)[16]to overcome this problem by introducing a spatial pyramid pooling layer, which generates a fixed-length representation of a region of interest (ROI).R-CNN and SPPNet train feature extraction and bounding box regression networks separately.So, the training takes a long time to process.
[18]hick et al.proposed the Fast R-CNN[17]detector to solve the multistage training problem by simultaneously training the feature extraction and bounding box regression networks.Fast R-CNN also uses a selective search algorithm for proposal generations.The selective search algorithm increases the computational burden of the model because of the redundancy of proposal generation.So, Fast R-CNN's detection speed is low for real-time applications.To solve this problem, Faster R-CNN[18]uses a region proposal network instead of the selective search algorithm to generate region proposals.Many improvements have been made based on Faster R-CNN such as RFCN

TABLE I :
Images 3D Object Detection Methods Comparison based on 3D Bounding Box (BBox) Encoding Techniques, Evaluation Metrics, Datasets used, and Year of Publication.See sections III-B and III-C for 3D bounding box encoding and evaluation techniques, respectively.

TABLE II :
[112]nd 3D performance Comparison of selected image-based 3D object detection methods on the KITTI[45]validation benchmark (t indicates experiments on the test benchmark).r40meansthemAP is calculated for 40 recall points instead of 11.E stands for easy, M for moderate, and H for hard..7867.4044.7134.1330.4258.9540.60 35.27 81.39 64.66 57.22 40.46 30.00 27.07 54.44 36.86 32.06 DSGN++ [84] (t) 88.55 78.94 69.74 50.26 38.92 35.12 68.29 49.37 43.79 83.21 67.37 59.91 43.05 32.74 29.54 62.82 43.90 39.21 mathematical priors and uncertainty modeling.An efficient Hierarchical Task Learning (HTL) strategy is proposed to reduce the instability caused by task dependency in geometrybased methods (error amplification).The error amplification causes amplification of the estimated depth.The HTL strategy controls the overall training process by making each task idle until its pre-tasks are well trained.The experimental result on the KITTI dataset[45]outperforms methods such as MoVi-3D[33]and RAR-net[112].