Object Detection in Traffic Videos: A Survey

Traffic video analytics has become one of the core components in the evolution of transportation systems. Artificially intelligent traffic management systems apply computer vision techniques to alleviate the monotony of manually monitoring the video feeds from surveillance cameras. Object detection is the most important step in these systems, and much research has been done on identifying objects in traffic scenes. This paper reviews various algorithms used for object detection in traffic surveillance, in addition to the recent trends and future directions. Based on the approaches used in the related studies, the object detection methods are categorized into motion-based and appearance-based techniques. Each group of techniques is further classified into a number of subcategories and the advantages and disadvantages of each method are finally analyzed. The major challenges, limitations, and potential solutions are also discussed along with the future directions.

Object Detection in Traffic Videos: A Survey Hadi Ghahremannezhad , Member, IEEE, Hang Shi , and Chengjun Liu Abstract-Traffic video analytics has become one of the core components in the evolution of transportation systems. Artificially intelligent traffic management systems apply computer vision techniques to alleviate the monotony of manually monitoring the video feeds from surveillance cameras. Object detection is the most important step in these systems, and much research has been done on identifying objects in traffic scenes. This paper reviews various algorithms used for object detection in traffic surveillance, in addition to the recent trends and future directions. Based on the approaches used in the related studies, the object detection methods are categorized into motion-based and appearance-based techniques. Each group of techniques is further classified into a number of subcategories and the advantages and disadvantages of each method are finally analyzed. The major challenges, limitations, and potential solutions are also discussed along with the future directions.
Index Terms-Object detection, instance segmentation, foreground detection, deep learning, traffic surveillance.

I. INTRODUCTION
T HE functioning capacity of intelligent transportation systems (ITSs) relies heavily on the competence of traffic data collection via sensors and the performance of the algorithms designed for the automatic processing of the collected data. Traffic cameras are cost-effective and provide a rich source of visual data, as well as a vast area of coverage. Revolutionary breakthroughs in the field of computer vision have enabled modern traffic management systems to effectively process the footage obtained from traffic cameras. Many studies have designed intelligent techniques to automate traffic monitoring systems.
Among the core components of intelligent traffic video analytics [1], object detection is the most crucial step and researchers have attempted various approaches to use appearance and motion clues to locate objects of interest in traffic scenes. Appearance-based techniques extract useful features from each input image, whereas motion-based methods tend to use temporal information to identify the object locations. The appropriate approach is taken in accordance with the specific application at hand and with regard to the camera location, video quality, and computational limitations.
Although there are recent survey papers about traffic surveillance systems [1] and object detection [2], there is no study that is particularly focused on object detection in traffic surveillance applications and their specific challenges. Considering the fact that traffic surveillance is one of the main applications of computer vision and object detection is the most paramount step in video analytic systems, there is a need to investigate different approaches to object detection in traffic videos and the particularly common challenges faced by the current methods. In this paper, we focus on object detection in the applications of monocular traffic surveillance and outline the most notable studies and algorithms. The main contributions of this paper are three-fold: • A comprehensive review of various techniques used for object detection in the applications of monocular traffic surveillance is presented together with the advantages and disadvantages of different methods.
• A list of the publicly available datasets that are constructed for the task of object detection in traffic surveillance applications is provided along with the most common performance evaluation metrics.
• The main challenges facing the object detection methods are reviewed accompanied by an outline of future research scope and directions. contain an overview of the various motion-based and appearance-based object detection methods applied in traffic monitoring systems. Publicly available datasets and some useful video analytics frameworks are presented in Section V. Section VI outlines the future scope and main directions of research in designing object detection methods for traffic surveillance systems." Further discussions, perspectives, and conclusions are reviewed in a nutshell in Section VII.

II. MOTION-BASED METHODS
Exploiting the temporal features is very useful in many surveillance applications due to the significant motion associated with the usual objects of interest compared to the stationary background. In highway and road traffic surveillance the objects of interest can be distinguished from the background solely by exploiting the temporal features. On the other hand, motion segmentation techniques have proven to be very practical in real-time applications due to their generalization and computational efficiency. Among various motion segmentation approaches, frame differencing, optical flow, and statistical background modeling have been frequently applied in traffic video analytics.

A. Frame Differencing
Frame differencing is the simplest motion estimation method in which the locations of the moving objects are estimated by calculating the absolute value of intensity difference between adjacent frames and applying a threshold to the results. Several studies have applied frame differencing to detect moving traffic objects such as vehicles [12]. Although this method is simple and fast, it is prone to errors and its performance suffers in many challenging scenarios such as illumination changes.

B. Optical Flow
Another way for detecting moving objects is to use the correlation between adjacent frames and find corresponding key points so as to calculate the optical flow vectors of the moving objects, which describe the instantaneous velocities of certain points in the image. The optical flow algorithm has been applied in the applications of traffic video analytics for various purposes including motion-based object detection [13]. These methods are not computationally efficient and therefore, limiting the calculations to a lower frequency or a smaller reference region will help with achieving real-time performance.

C. Background Modeling
In the applications of traffic video surveillance background modeling is by far the most popular approach for detecting moving objects due to its compromise between efficiency and performance. Background subtraction methods have been applied to traffic videos in a large number of studies [14], [15], [16], [17]. In spite of recent advances in foreground detection studies most real-world systems tend to apply relatively older techniques due to the limitations in computational capacity and the lack of collaboration between researchers and the industry [18]. Some of the representative methods include MoG [19], AGMM [20], codebook [21], Multi-Cue [22], PAWCS [23], PBAS [24], and ViBe [25]. A number of representative studies that have applied motion segmentation for object detection in traffic videos are listed in Table I.

D. Challenging Scenarios Faced by Motion-Based Methods
Despite all the benefits in terms of generalization and computational efficiency, locating objects based on motion information comes with its own set of challenging problems. These problems stem from general issues such as non-stationary cameras and stopped objects, as well as challenges that are specifically common in traffic surveillance, such as moving cast shadows, illumination variations, and occlusions [14]. Among these challenging situations, only cast shadows can be successfully handled by the current techniques and the other issues can hardly be solved when using motion-based object detection algorithms. Here, we go over the most common challenges in using motion-based methods for object detection in traffic videos.
1) Moving Cast Shadows: Cast shadows are specifically a problem for traffic surveillance videos as they frequently occur during the day and can negatively impact tasks such as vehicle tracking and classification. Many studies have attempted to suppress the cast shadows in motion segmentation algorithms [26], [27], [28]. Ghahremannezhad et al. [27], [28] propose an illumination invariant feature to suppress the cast shadows in outdoor environments. Although cast shadows can severely impact the performance of motion-based object detection methods, they can be effectively suppressed using current methods without incurring significant additional computation.
2) Non-Stationary Cameras: In modern surveillance systems, remote-control pan-tilt-zoom (PTZ) cameras enable the operators to direct the attention to a specific event or survey a different area. There are several studies that address motion segmentation in the case of a moving camera [29]. Ghahremannezhad et al. [30] apply optical flow to estimate the global motion and compensate for the camera movements by using adaptive statistical and sample-based models in order to extract the foreground objects in videos captured by non-stationary cameras. Despite these efforts, motion segmentation methods are more suited to applications with stationary cameras and sparse objects, such as highway and road surveillance.
3) Stopped Objects: Most foreground detection methods fail to keep detecting the moving objects after they stop. This is specifically problematic for traffic surveillance at intersections. However, in highway and road traffic, stopped vehicles can be detected and marked as anomalies by motion-based methods [17], [31]. Some studies have attempted a combination of motion-based and appearance-based methods [32], [33]. These studies assume the stopped vehicles are merged into the background model and they can be located by applying an appearance-based object detection method on the background image.
4) Weather and Illumination Variations: Traffic surveillance systems are required to work day and night under adverse weather conditions and illumination changes in the presence of large shadows and reflections. These variations can lead to sizable drops in the performance of motion-based object detection methods. Although there are studies that have attempted to solve these issues [34], motion-based methods tend to fail under illumination changes. 5) Occlusions: Locating objects of interest solely based on motion information is prone to severe performance drops in the case of object occlusions. Every connected component in the foreground mask is considered to be an object instance. However, multiple nearby objects can easily fall into the same component. This is specifically problematic for intersection surveillance and congested traffic. Despite the limited efforts [35], this issue cannot be handled by pure motion-based techniques and appearance-based methods are preferred in scenarios that are prone to occlusions.

III. APPEARANCE-BASED METHODS
Appearance-based object detection methods use techniques such as handcrafted features, CNNs, transformers, or a combination of these to detect and classify objects of interest. Traditional methods, which were popular before the rise of deep learning in 2014, use handcrafted features. Histogram of Oriented Gradients (HOG) [44] and Haar-like features [45] are considered to be the milestones of traditional object detection methods in traffic videos. Recent methods use CNNs and/or transformers to operate in an end-to-end manner without engineering meaningful features. CNN-based methods are widely used in traffic video analysis studies, and transformers are becoming more popular in recent years. Among the CNNbased methods, Faster R-CNN [46], Mask R-CNN [47], and different versions of You Only Look Once (YOLO) [48] are the most popularly used in traffic surveillance applications and are considered to be the milestones.

A. Methods Based on Handcrafted Features
The early object detection algorithms revolved around extracting carefully engineered features that represent useful information in the images. The sophistication of the designed features kept improving throughout the years and different techniques were devised to speed up the computations. Table II lists some of the studies that have used handcrafted features to detect road users in traffic videos.
1) Low-Level Features: Low-level features, such as color, gradient, and symmetry have been applied for detecting traffic-related objects for many years. Despite the convenient feature extraction process, low-level features generally do not provide enough useful information, and many studies have attempted to combine a number of these features for better detection results [49].
2) Feature Descriptors: Local feature descriptors were invented in further studies to create vector representations of local neighborhoods and increase the ability to handle scale variations, rotations, and occlusions. Various handcrafted feature descriptors, such as Histogram of Oriented Gradients (HOG) [44], local binary patterns (LBP) [50], scale-invariant feature transform (SIFT) [51], speeded up robust features (SURF) [52], Gabor features [53], Haar-like wavelets [45], and BDF [54] have been proposed that extract valuable information from the images.
This category of object detection approaches involves two stages: hypothesis generation (HG) or proposal generation, which generates proposals for potential object candidates, and hypothesis verification (HV) or proposal classification, which verifies and classifies the generated hypotheses. This verification is done by applying machine learning techniques, such as support vector machines (SVM) [55] or adaptive boosting (AdaBoost) [ [58].

B. Methods Based on CNN-Driven Features
The ability of deep convolutional neural networks (DCNNs) to learn high-level feature representations helped improve the performance of various tasks in computer vision including object detection. Object detection approaches based on CNN-driven features can generally be grouped into two categories: two-stage and one-stage methods. [2]. Table III compares the performance of CNN-based object detection methods when tested on the COCO 2017 validation dataset [59]. AP 0.5 indicates the average precision when the IoU between the predicted bounding box and the ground truth is over 0.5. AP [0.5:0.95] is the mean average precision for IoU in the range of [0.5 : 0.95] with step size of 0.5. Note that the frame rate reported in frames per second (FPS) is subjective and may vary depending on the hardware platform. For example, YOLOv5nano is faster than YOLOv7-tiny when tested on TESLA P100 and RTX 4090 GPUs, while YOLOv7-tiny is faster than YOLOv5-nano on TESLA V100 GPU. For traffic surveillance applications, the faster methods are usually preferred.
Object detection methods that are CNNs have been widely applied in the applications of traffic videos. As opposed to object detection methods based on handcrafted features, CNNbased methods can be applied in multi-class object detection tasks due to the use of multiple binary classifiers at the last layer. Figure 2 shows the improvements in object detection on some of the traffic video datasets over the years. Some of the representative studies that have applied CNN models for object detection in traffic surveillance videos are listed in Table V. 1) Two-Stage Object Detection: The most representative two-stage object detection approaches are Regions with CNN features (R-CNN) [68], Spatial Pyramid Pooling Networks (SPP-Net) [79], Fast R-CNN [80], Faster R-CNN [46], and Cascade R-CNN [81]. This group of methods consists of an initial region proposal step to generate candidate rectangular regions for objects of interest, followed by a region verification step to train classifiers and determine the presence of the targets in the candid locations.
a) R-CNN family: The Region-based Convolutional Neural Network (R-CNN) [68] was a milestone in object detection studies and it opened the doors to a new world of using CNNs for locating objects in images. R-CNN was barely  [61], UAVDT [62], Haze-Car [63], BIT-Vehicle [64], UTSD [65], and MIO-TCD [66] datasets. The tested detectors depicted in the figure are DPM [67], R-CNN [68], Faster R-CNN [46], YOLO [48], SSD [69], R-FCN [70], YOLOv2 [71], CenterNet [72], EfficientDet [73], YOLOv4 [74], YOLOv5 [75], DETR [76], Deformable DETR [77], and Swin Transformer [78]. used for video analytics systems due to the slow region proposal process. The speed was enhanced in the subsequent versions, such as Fast R-CNN [80], which improved upon the R-CNN method by using a shared convolutional layer to process the entire image, rather than using a separate convolutional layer for each region proposal, and Faster R-CNN [46], which introduced a fully convolutional region proposal network (RPN) that generates a number of candidate regions from an input image. Faster R-CNN replaced the use of image pyramids by anchor boxes, where bounding boxes of different aspect ratios are used to localize the objects. The feature maps generated in the convolutional layers are passed to the RPN to obtain bounding boxes with the classifications. The selected regions are then mapped to the feature maps of the previous CNN layer and are eventually fed to the fully connected layer, classifiers, and bounding box regressors. Due to its high accuracy and efficiency, Faster R-CNN has been abundantly applied to detect objects in traffic video analytics to this day [33], [82], [83], [84].  Another major version of the R-CNN family is Mask R-CNN [47], which extends Faster R-CNN by adding a branch for predicting an object mask. Instance segmentation takes object detection one step further and represents each instance of a class by a segmentation mask. Figure 3 shows the general architecture of the Mask R-CNN method. Many studies have attempted to use Mask R-CNN to detect objects in traffic videos [85], [86], [87], [88], [89], [90], [91], [92]. For instance, in [90], a traffic surveillance system is constructed that applies an instance segmentation technique based on Mask R-CNN to extract traffic volume and comprehensive vehicle information that includes type, 3D bounding box, number of axles, speed, length, and current driving lane. b) SPP-Net: In order to handle images of varying sizes, He et al. [79] introduced the Spatial Pyramid Pooling (SPP) layer [93], which is inserted between the last convolutional layer and the fully connected layer. This layer performs pooling operations over different scales, subsampling the feature maps and allowing the network to handle images of arbitrary dimensions without the need for resizing. Additionally, SPP-Net is able to extract features at multiple scales, improving the detection of objects of different sizes. SPP-Net can be combined with other CNN-based object detection models, such as Faster R-CNN or YOLO, to improve their performance. Similar to R-CNN, a fast mode of Selective Search algorithm [94] is applied to generate candidate regions and features are extracted through the convolutional layers. The feature maps are converted to fixed-length vectors in the pyramid pooling layer and the vectors are then passed to the fully connected layer. Figure 4 illustrates the overall architecture of the SPP-Net method in object detection. The main advantages of SPP-Net over R-CNN are the increased speed and the ability to process images of various aspect ratios. Given the frequent scale variations in traffic surveillance, researchers have utilized SPP-Net for improved object detection in many studies [95], [96]. However, the SPP layer increases the computational complexity of the network and the increased number of parameters in the network may lead to overfitting if not enough data is available. Also, SPP-Net is generally less efficient than more recent object detection models, such as RetinaNet [97] and CenterNet [72].
c) R-FCN: Region-based Fully Convolutional Network (R-FCN) [70] proposed a fully convolutional network to reduce the computations of the R-CNN family of object detection methods, which repeatedly apply a costly sub-network for each region. For this purpose, R-FCN uses position-sensitive score maps to solve the translation variance in object detection and translation invariance in classification. Relative spatial information of each object is kept by the sensitive score maps and pooled later to find the exact locations. Additionally, the likeliness scores of grid cells are averaged over each region of interest to predict the object categories. Generally speaking, R-FCN is a combination of Faster R-CNN [46] and a fully connected network with similar accuracy and improved speed. Some of the studies have applied R-FCN to the applications of traffic surveillance videos in recent years [65]. Figure 5 presents the general architecture of the R-FCN method.
2) One-Stage Object Detection: In one-stage methods object classification and bounding box regression are Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. performed without using pre-generated region proposals. There have been several milestones in the one-stage object detection methods, including different versions of You Only Look Once (YOLO) [48], [71], [74], [75], [98], [99], Single Shot MultiBox Detector (SSD) [69], and Reti-naNet [97]. In addition, researchers introduced fast one-stage instance segmentation methods, such as YOLACT [100], YOLACT++ [101], and SOLO [102]. This group of methods requires only one pass through the neural network to predict the bounding boxes that contain the target objects. One-stage instance segmentation methods have also been applied in traffic-related applications [103], [104]. In [103] a novel neural network architecture, namely SOLACT, with a multi-resolution feature extraction backbone is proposed for instance segmentation in traffic surveillance videos with real-time performance on embedded devices. a) YOLO family: You Only Look Once (YOLO) [48] treated the task of object detection as a regression problem. In this method, the input image is divided into a grid where each cell of the grid is responsible for the objects whose center falls into that grid. Multiple bounding boxes along with their confidence scores are predicted at each grid cell. A combination of the losses from the predicted components is used to optimize the model and Non-Maximum Suppression (NMS) is applied for dealing with multiple detections. YOLO was one of the major milestones in single-stage object detection due to the substantial improvements in accuracy and speed. However, the ability of this method is limited when it comes to the localization of small or cluttered objects and the number of objects detected per grid cell. These limitations are addressed in the later versions, including YOLOv2 [71], YOLOv3 [98], YOLOv4 [74], YOLOv5 [75], YOLOX [105], YOLOR [106], PP-YOLO [107], YOLOv7 [99], and YOLOv8 [108].
Methods from the YOLO family have been frequently applied to the task of object detection in traffic videos [92], [109], [110], [111], [112], [113], [114], [115]. One of the most recent versions of this family is YOLOv7 [99], which has been used in a few traffic surveillance systems [116]. Due to its high efficiency and accuracy and its ability to extract object masks, YOLO is a very good candidate for traffic surveillance applications. Figure 6 displays the general architecture of the YOLO method. b) SSD: Single Shot MultiBox Detector (SSD) [69] was the first one-stage method that achieved the accuracy of representative two-stage detectors such as Faster R-CNN [46] while keeping the speed to the real-time level. With several layers added to the backbone network, SSD generates feature maps of different sizes. The further layers are used for the offset of the aspect ratios and default boxes. The weighted sum of the confidence and the localization loss is used to train the model. Although SSD outperforms YOLO [48] and Faster R-CNN [46] in both speed and accuracy it struggles with detecting small objects. SSD has been one of the popular object detection methods in traffic surveillance applications [117], [118], [119], [120]. Figure 7 shows an overview of the SSD architecture. c) CenterNet: Unlike the previous one-stage object detection method, CenterNet [72] tends to represent objects as points as opposed to bounding boxes. Each object is predicted as a single point that is located at the center of its bounding box. The input image is passed through the layers of a fully connected network, which outputs a heatmap indicating the object centers. CenterNet does not require non-maximum suppression (NMS) and has higher accuracy and speed compared to its predecessors. However, its specifically divergent approach and different backbone architectures make it not compatible with other popular object detectors. Therefore, it has not been used for traffic surveillance and other vision applications as commonly [121], [122]. Figure 8 presents an overview of the CenterNet architecture. d) EfficientDet: EfficientDet [73] is another representative object detection model that has focused on improving the efficiency of object detection. It introduces key optimization techniques such as a weighted bi-directional feature pyramid network (BiFPN) for fast input feature fusion at different scales, a compounding coefficient for jointly scaling up the resolution, depth, and width of the backbone, the feature network, and the prediction networks, simultaneously. With EfficientNet [123] as the backbone network and multiple refinements this method is able to achieve better efficiency while maintaining a high accuracy compared to the previous detectors. EfficientDet has been applied in several computer vision studies in recent years that include real-time video analytics applications such as traffic surveillance [124], [125].    3) Lightweight Object Detection: In spite of all the efforts made in increasing the efficiency of object detectors the requirement for extensive computational resources still limits the applications of these methods. In recent years the interest in deploying deep learning AI applications on edge computing and other resource constraint platforms has been increasingly growing and traffic surveillance systems are no exception. Lightweight networks such as SqueezeNet [134], MobileNet [135], and ShuffleNet [136] can be applied to detect objects on edge devices. With the development of smart cities, several studies have attempted to take advantage of the more efficient models to bring advanced AI models to the edge for bandwidth and privacy optimization in traffic surveillance [103], [117], [137], [138].

C. Methods Based on Transformers
Transformers were introduced in the field of Natural Language Processing (NLP) as an improvement over Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) strategies [139]. They were initially designed to model text sequences by employing an attention mechanism to establish long dependencies among different elements.
Transformers are shown to be good building blocks for other tasks as they are good at extracting features automatically and they can be pre-trained and later fine-tuned for each specific task. Transformers set their foot in computer vision in 2020 where they were used for object detection and image classification tasks with Detection Transformers (DETR) [76] and Vision Transformers (ViT) [140]. While a combination of convolutional neural networks and transformers was used in DETR for object detection, ViT sliced each image into a number of patches, which were fed to the transformer architecture to generate features for image classification.
Although pure transformers rely heavily on large datasets due to the lack of inductive bias [141], they have been shown to outperform methods that are based on CNN-driven features. In general, transformer-based object detection methods are categorized into three groups: (i) CNN backbone for feature extraction with transformer decoder as the detection head, (ii) transformer backbone for feature extraction with CNN head for detection, (iii) pure transformers for end-to-end object detection [141], [142]. Object detection methods that operate on the basis of transformers have already found their way in traffic surveillance applications [143], [144], [145], [146]. As transformers are able to model the spatial and temporal relationships, their potential in traffic surveillance applications should be realized through further investigations [147], [148].    [76], the object detection task is treated as a set prediction problem, where a set of bounding boxes are predicted given a set of image features in an end-to-end manner. The spatial feature maps are first extracted by a CNN backbone from the input image. Then the encoder flattens the feature maps and supplements them with fixed positional encodings. The output of the encoder is N fixed-length embeddings, where N is a predefined number that is much larger than the typical number of objects in an image. Then the decoder applies an attention mechanism to decode these embeddings. Finally, feed-forward neural networks (FFNs) are employed to predict the bounding box coordinates and the class of each object. A loss function based on the Hungarian algorithm [149] is used to achieve optimal bipartite matching between the predicted and ground-truth objects. Figure 10 represents the overall architecture of the DETR method. One of the advantages of DETR over CNN-based methods is the elimination of the requirements for region proposals and non-maximum suppression. DETR or similar transformer-based methods with CNN backbones [77], [150], [151] have been used for object detection in a number of recent traffic-related studies [61], [144], [152], [153], [154]. For example, Zhang et al. [154] attempt to enhance the vehicle localization performance by adding a dynamic anchor box generation module between the transformer encoder and decoder. To improve the query design and speed up the training convergence, the anchor boxes are generated based on the regional and content information before the decoder stage. Their method achieves a higher average precision compared to other methods, including DETR [76], Anchor DETR [130], Conditional DETR [128], Deformable DETR [77], SMCA [127], DAB-DETR [131], and DN-DETR [155].
2) Transformer Backbone With CNN Head: Another group of studies [78], [156], [157], [158] tends to use transformers as the backbone with a CNN-based detection head. The input image is partitioned into a number of patches, which are fed to a vision transformer that generates features. These features are then passed to a CNN head for detection. For instance, the ViT-FRCNN method [159] feeds the feature maps produced by ViT [140] to a detection head that is designed based on Faster R-CNN [46] to perform the final detection and classification tasks. The Swin transformer [78] uses a multi-stage hierarchical architecture to compute self-attention within non-overlapping local windows. The input image is split into multiple non-overlapping patches, which are converted into embeddings. Then the number of patches is reduced through Swin Transformer Blocks (STBs) in four hierarchical stages. Each Swin Transformer Block (STB) consists of local multi-headed self-attention (MSA) modules followed by two layers of Multilayer Perceptrons (MLP). Each MSA and each  MLP is preceded by a LayerNorm (LN) layer and followed by a residual connection. The partitioning of the windows is gradually shifted to capture the interactions from different locations. Figure 11 shows the overall architecture of the Swin transformer used as the backbone for object detection. Although object detection methods based on Swin transformers achieve some of the best results [142], they are computationally expensive and demand high GPU memory. Using transformers as the backbone with CNN head for object detection has demonstrated promising results in traffic surveillance applications [143], [145], [146]. Deshmukh et al. [145] use a backbone based on a Swin transformer to extract multi-scale features from each image. The low-resolution and high-resolution extracted features are combined by a bi-directional feature pyramid network (BIFPN) to generate discriminative multi-scale feature maps for robust vehicle detection in undisciplined traffic scenes.
3) Pure Transformers: A number of recent studies have attempted to perform object detection based on pure transformers [129], [157]. For example, the You Only Look at One Sequence (YOLOS) method [129], which is designed based on ViT [140], tends to detect objects solely using a transformer architecture. The classification tokens are replaced with multiple learnable object detection tokens. Afterward, not unlike DETR [76], the bipartite matching loss is applied instead of the image classification loss to perform detection in a set prediction manner. Similarly, the Pyramid Vision Transformers (PVT) [157] method attempts to overcome the challenges of using pure transformers, such as ViT [140], in dense prediction tasks with the use of convolution-free networks. This group of methods has not yet been used for object detection in traffic surveillance studies. However, they have the potential to gain attention in such studies due to their simplicity and comparable performance.

D. Challenging Scenarios Faced by Appearance-Based Methods
There is a variety of challenging scenarios where current appearance-based methods fail to deliver satisfactory performance. These scenarios involve general object detection issues, including high false positive rates and requiring large amounts of data, as well as problems that are specially common in traffic surveillance applications, such as variations in object scales, time constraints, imbalanced classes, frequent occlusions, and unpredictable changes in weather and lighting conditions. Here we highlight some of the studies that have addressed these challenges.
1) High Number of False Positives: A high false positive rate (FPR) in object detection can degrade the performance of intelligent surveillance system to the point of being impractical. One of the main approaches to avoid unnecessary false positives is to determine a Region of Interest (ROI). Many studies have taken advantage of this approach to improve the performance both in terms of lower false positives and faster computations [85], [87], [90], [111], [122], [166]. The sizes and aspect ratios of the objects can also be used to filter out some of the false positives samples [85]. Other  Fig. 12. The system architecture of the YOLOS method [129]. The patch tokens (Pat-Tok) are the embeddings of the flattened image patches, the detection tokens (Det-Tok) are learnable embeddings for object detection and PE refers to positional embedding. During training, an optimal bipartite matching between predictions from 100 detection tokens and ground truth objects is produced. During inference, the final set of predictions is directly sent to output in parallel. remedies include increasing the quality and quantity of training data, using Non-Maximum Suppression (NMS), increasing the threshold on the confidence score, using ensemble models [167], incorporating contextual information [33], [168], and fine-tuning pre-trained models.
2) Low Processing Speed: Generally, traffic surveillance video streams have moderate to high frame rates, typically ranging from 20-30 frames per second (fps) or more, to ensure that objects are captured with sufficient detail and at a high enough rate to accurately track their movements. The methods that are able to exceed such rates while detecting the objects in the incoming video frames are suitable for real-time applications. In general, single-stage detectors are more suitable for surveillance applications. For example, YOLO can process up to 60-120 fps on a high-end GPU which leaves enough time for other video analytics tasks to be complete. Light-weight object detectors have also gained popularity among modern traffic surveillance systems [103], [117], [137], [137]. Using smaller input sizes and frame sampling also helps with improving the speed.
3) Various Object Scales and Aspect Ratios: In most traffic surveillance applications the perspective view of the installed Fig. 13.
The class imbalance among common objects in four traffic surveillance datasets. The uneven distribution is mainly due to the nature of traffic scenes. This disproportion may lead to a bias towards the majority classes and a high false negative rate with the underrepresented categories.
cameras imposes a wide variety in sizes and aspect ratios among objects of interest. Studies have attempted different approaches to account for these variations. Yang et al. [84] replace the VGG-16 backbone of Faster R-CNN [46] with a multi-attention residual network (MA-ResNet) and use the different layers of this network for feature pyramid construction. Hu et al. [169] designed a scale-insensitive convolutional neural network to detect vehicles of various scales. This is achieved by using context-aware RoI pooling and a multi-branch decision network. Multi-scale training, pyramid feature maps, using scale-aware architectures like FPN or PANet, and using anchor-free methods like CenterNet are among other techniques that help with dealing with various scales.
4) Data Limitation: Supervised object detection methods require a significant amount of training data to achieve good performance. Despite the availability of datasets that are well suited to traffic surveillance applications (see Section V), constructing a comprehensive dataset for appearance-based object detection methods to perform and generalize well across various traffic videos remains a challenge. On the other hand, unsupervised, semi-supervised, or active learning techniques are not yet able to achieve results that meet the needs of realworld systems. The data limitation issue is one of the main reasons why many real-world systems still rely on motion-based techniques to locate objects in surveillance videos.

5) Class Imbalance:
In a typical traffic surveillance system, there is bias in the object categories. Compared to the majority classes, such as cars or trucks, the minority classes, such as bicycles or pedestrians, may be less frequently represented in the training data, leading to poor performance in detecting them. Figure 13 shows the class imbalance in some of the representative traffic video datasets. This could have serious consequences, such as missing critical events or misclassification in the minority categories. Oversampling the minority class or undersampling the majority class can help mitigate the impact of class imbalance to some degree. Some studies [177], [178] address the skewed dataset problem by data augmentation or introducing additional synthetic data to minority classes to reduce the bias.
6) Occlusion and Diverse Viewing Perspectives: Occlusion can make it difficult for an object detection model to fully perceive the object and understand its shape, size, and features. Occlusion is a common problem in traffic videos, particularly during heavy traffic. While appearance-based methods are better at handling this issue than motion-based methods, they still have difficulty detecting objects properly when occluded. There are several studies that have attempted to solve the occlusion problem in traffic surveillance applications [65], [179], [180], [181]. Li et al. [182] tend to fuse the prior information of the Kalman filter to address the occlusion problem during the tracking stage. Nguyen et al. [179] replace the non-maximum suppression (NMS) in the Faster R-CNN method by soft NMS algorithm to diminish the effects of occlusion and truncation. Another study [180] attempts to improve the ability of Faster R-CNN in handling occlusions by integrating a part-aware region proposal network to account for local and global features among different vehicle attributes. Jin et al. [144] embed a deformable channel-wise column transformer between the backbone and the head of the YOLOv5 [75] network and use a novel asymmetric focal loss to incorporate column-wise occlusion information between vehicles and guide the network to focus more attention on the visible areas of partially occluded vehicles. Other studies have also attempted to solve the occlusion problem by using 3D bounding boxes [90], [183], [184] or multi-camera tracking [181], [185]. A number of studies tend to use different sensors such as Light Detection and Ranging (LiDAR) to benefit from additional depth information [186], [187], [188]. However, these sensors are rarely deployed in the current transport infrastructure. 7) Weather and Illumination Changes: Traffic videos are captured day and night under different weather situations and the object detectors are expected to be robust against these variations. Appearance-based methods are generally more robust to adverse weather conditions and illumination changes than motion-based techniques. However, the performance of these methods still suffers in challenging illumination situations or adverse weather conditions, such as rain, fog, snow, and sandstorms. Several approaches have been attempted to avoid the sizable performance drops when facing illumination variations [189], [190], [191], [192]. In [189] a traffic flow estimation system is proposed that deals with rain-drop tampered scenes where cameras are affected by rain. Han and Du [190] designed a dual input Faster R-CNN model to utilize color and thermal images for detecting traffic objects in bad weather. Zhang et al. [191] designed a multi-camera vehicle representation that combines vehicle headlights and taillights to find vehicle contours and locate them in nighttime videos. Fu et al. [192] propose a detail-preserving unpaired domain transfer method using generative adversarial networks (GANs) to avoid the loss of accuracy in nighttime videos.

IV. COMMON PERFORMANCE MEASURES
Object detection techniques applied to traffic video analysis are typically assessed using four key metrics: precision, recall, F-measure, and frame rate. The first three metrics are used to evaluate the performance and accuracy of the object detection methods for each frame and are computed as follows:" where T P , T N , F P , and F N stand for True Positives, True Negatives, False Positives, and False Negatives, respectively. P R, R E, and F 1 indicate precision, recall, and F1-score, correspondingly. Precision is used to report the correct predictions while Recall, a.k.a. sensitivity, measures the correct predictions with regard to the ground truth. F-Measure, also called F1-Score, combines precision and recall by computing the harmonic mean of the two.
In motion segmentation methods, T P and F P are the numbers of pixels correctly and incorrectly reported as foreground, and T N and F N are the numbers of pixels that are correctly and incorrectly reported as background, respectively. These metrics are measured in different ways when it comes to appearance-based object detection methods. Precision is calculated from the Intersection over Union (IoU), which is the ratio of the overlapping area and the union area between the predicted and the ground truth bounding boxes for each object. When the IoU ratio is above a predefined threshold the detected object is considered a T P and it is counted as F P otherwise. F N instances occur when there is an object present in the ground truth, but the model has failed to detect it. In order to compare the performance of different object detectors, the average precision is calculated among all images separately for each object category and the mean average precision (mAP) of all categories is used as a single value to weigh methods against each other. For instance segmentation, IoU is calculated in a pixel-wise manner rather than using bounding boxes.
Another important metric in evaluating the object detection methods in traffic surveillance systems is the frame rate, which reflects the system's ability to perform timely detection of objects. The rate at which the objects are detected in video frames is usually reported as frames per second (FPS) by computing the average number of video frames processed at each second. A high frame rate is specially crucial in real-time surveillance systems where the video streams are monitored constantly.

V. AVAILABLE DATASETS AND FRAMEWORKS
Although some popular object detection datasets, such as PASCAL VOC [193] and COCO [59], can be used to detect various classes of common traffic-related objects, they are not specifically designed for surveillance applications. Table VI contains a list of publicly available datasets that are collected from traffic surveillance videos and can be used for the task of object detection. In addition, we have listed a number of popular video analytics frameworks that are good candidates for traffic surveillance systems.

A. The BGSlibrary
The BGSLibrary [216] is an online C++ framework that contains efficient implementations of state-of-the-art motion segmentation methods. This open-source library offers more than 43 background subtraction algorithms that can be used for video analytics applications including traffic surveillance. This library has been used for traffic surveillance applications in a number of studies [194], [217], [218], [219]. The source code is available under the GNU GPL v3 license with a Java-based user interface allowing the users to determine the region of interest and algorithm parameters. The library is developed based on OpenCV, platform-independent, and free of charge for academic or commercial uses.

B. The LRSLibrary
The LRSLibrary [220] is another framework implemented in MATLAB for low-rank and sparse decomposition algorithms. This library is primarily designed for the task of moving object detection in videos and contains more than 100 state-of-the-art methods based on a matrix and tensor factorization.

C. NVIDIA DeepStream SDK
The NVIDIA DeepStream Software Development Kit [221] is an accelerated framework that can be used to build real-time intelligent video analytics pipelines. This framework is scalable, multi-platform, encrypted by Transport Layer Security, and it can be deployed on-premises, in the cloud, or on the edge servers. The increased efficiency in object detection, classification, and instance segmentation models enables the simultaneous processing of multiple video streams even on edge devices. Therefore, this framework is a good candidate for modern traffic surveillance systems as it supports state-of-the-art object detection and instance segmentation models, such as Faster R-CNN, YOLO, SSD, and Mask R-CNN, as well as advanced multiple object tracking techniques, including Discriminative Correlation Filter (NvDCF), DeepSort, and IOU tracker. Although this SDK is closedsource, developers have the option to add implementations in C, C++, or Python, integrate custom libraries, and deploy models for inference in native frameworks such as Ten-sorFlow and PyTorch. This framework has been used for traffic surveillance applications in a number of recent studies [176], [222], [223], [224].

D. Intel DL Streamer
Intel Deep Learning Streamer [225] is an open-source framework with C++ and Python APIs for creating streaming media analytics pipelines at the cloud or edge servers. This framework uses pre-trained models from the OpenVINO model zoo that are optimized for Intel hardware platforms for high efficiency. Object detection models, such as SSD, YOLO, Faster R-CNN, ResNet, and MobileNet can be used on input video streams for different use cases including vehicle and pedestrian tracking. This framework is currently being used by many leading solutions for various purposes. Intel provides an intelligent traffic management system [226] based on DL Streamer that is designed to detect and track vehicles and pedestrians, detect collisions, and estimate a safety metric for intersections.

E. Azure Percept
Azure Percept [227] is a comprehensive platform for building edge AI solutions to develop real-time video and audio analytics. By default, a Single Shot Detector (SSD) model [69] trained on the COCO dataset [59] is used for the task of object detection. However, there are a number of other pre-trained and customizable object detection models such as YOLO and Faster R-CNN that can also be used to efficiently locate the objects of interest such as vehicles and pedestrians in video streams. This platform is a good candidate for traffic video analytics due to its efficiency and easy-to-use components.

VI. FUTURE SCOPE
On the whole, an intelligent traffic monitoring system should concomitantly be accurate, responsive, and generalizable. The essential need for developing accurate, robust, and efficient algorithms for locating the objects of interest in traffic surveillance videos opens new horizons and prospects for future research studies. This section briefly discusses the main areas of focus for future research and developments in object detection for intelligent traffic video analytics systems.

A. Multi-Modal Data Fusion
The robustness and accuracy of object detection in different lighting and weather conditions, as well as in challenging scenarios such as occlusions can be improved by combining information from various sensors. In spite of the efforts to integrate other sensors, such as thermal cameras [190], [228] and LiDAR [186], [187], they are not commonly used in real-world systems due to the additional costs. Traffic monitoring systems based on the Internet of Things (IoT) [229] can use a combination of cameras, LiDAR, radar, and ultrasonic sensors to improve performance.

B. Edge Computing
There have been numerous studies attempting to increase the efficiency of appearance-based object detection methods for surveillance applications on edge computing platforms [9], [230], [231], [232], [233], [234], [235]. The deployment and maintenance of edge-cloud systems come with additional costs, such as cloud storage, data transmission, and data processing fees. Integrating edge GPU technologies on the surveillance cameras allows them to perform analytics on the device rather than having to send the footage to a remote server. Enabling lightweight object detection algorithms to achieve high performance on edge computing platforms is still an open challenge that requires substantial research effort.

C. Adapting to the Low-Resolution Videos
Deep learning techniques are primarily designed to work with high-resolution images and yet real-world surveillance footage consists mostly of low-resolution videos with low frame rates. To fully capitalize on the benefits of appearance-based techniques in current surveillance systems, there should be a greater focus on research and development in low-resolution video analysis.

D. Domain Adaptation and Transfer Learning
Given the high variety in traffic surveillance videos in terms of illumination condition, resolution, viewing angle, viewing distance, frame rate, and weather conditions, object detection methods should be adaptable to different scenarios by leveraging domain adaptation [236] and transfer learning [237] techniques.

E. Employing Both Temporal and Appearance Methods
Combining motion segmentation with appearance-based techniques can enhance detection performance and fully leverage the benefits of these methods in surveillance systems. Motion segmentation can be used in conjunction with appearance-based object detection methods to decrease the number of false positives and lower the required computations. Despite the potential benefits, relatively few studies have explored the combination of motion-based and appearancebased methods in surveillance systems [33], [137], [238], [239] and further investigations can yield better advancements.

F. Dealing With Imbalanced Data
Due to the characteristics of traffic surveillance, a disproportionate number of instances may exist for certain object categories (see Figure 13). Developing new techniques to address class imbalance can enhance performance for underrepresented classes [177], [178]. Designing a unified comprehensive dataset for traffic surveillance that is balanced and contains sufficient instances of various resolutions, illuminations, and weather conditions would be beneficial.

G. Adapting to the New Techniques
Many of the current traffic surveillance systems still employ older methods for object detection. Due to old infrastructures, computational limits, costly equipment, and insufficient training data for deep learning models, many surveillance systems still solely rely on motion segmentation methods for object detection. The application of modern object detection techniques, including transformers, lightweight detectors, and video object detection, is expected to enhance the performance of traffic video analytics systems in the near future as research efforts continue.

H. Weakly Supervised/unSupervised Detection
Unsupervised or few-shot object detection reduces the amount of labeled data required for training. This can save considerable time and resources needed for manual annotations [240]. In addition, these techniques make the object detection models more robust to changes in appearance and environmental conditions, which are common in traffic surveillance. Furthermore, unsupervised object detection helps with detecting new classes of objects.

I. 3D Object Detection
3D object detection can help traffic surveillance systems to have a better understanding of the traffic scene, which can be useful in GIS mapping, overcoming the occlusion and clutter issues, and providing a more comprehensive understanding of the scene. It can also be used in combination with other sensor modalities, such as thermal cameras [190], [228] or LiDAR [186], [187], to improve the performance in challenging environments, such as in poor visibility or adverse weather conditions.

VII. CONCLUSION
The survey presents an overview of object detection techniques used in monocular traffic surveillance applications. The methods are grouped into two categories, motion-based and appearance-based, and representative studies are summarized along with related applications. Popular datasets for training and validating these methods are also discussed. Furthermore, the main challenges in designing object detection algorithms for traffic surveillance applications are outlined. Object detection methods based on motion segmentation do not require extra computational resources and are well suited to highway and road traffic monitoring. On the other hand, appearancebased methods using deep learning are suitable for most surveillance applications including urban intersections but require large training data and extra computational resources. The evolution of deep learning has had a significant impact on modern traffic surveillance, but there are still unsolved challenging issues and further research efforts are required to improve the current algorithms in terms of performance, generalizability, and efficiency. Chengjun Liu is currently a Professor of computer science and the Director of the Face Recognition and Video Processing Laboratory, New Jersey Institute of Technology, Newark, NJ, USA. He has developed the evolutionary pursuit method, the enhanced Fisher models, the Gabor Fisher classifier, the Bayesian discriminating features method, the kernel Fisher analysis method, new color models, new image descriptors, and new similarity measures. His current research interests include image and video analysis, pattern recognition, and machine learning.