Improving Small Objects Detection using Transformer

General artificial intelligence is a trade-off between the inductive bias of an algorithm and its out-of-distribution generalization performance. The conspicuous impact of inductive bias is an unceasing trend of improved predictions in various problems in computer vision like object detection. Although a recently introduced object detection technique, based on transformers (DETR), shows results competitive to the conventional and modern object detection models, its accuracy deteriorates for detecting small-sized objects (in perspective). This study examines the inductive bias of DETR and proposes a normalized inductive bias for object detection using a transformer (SOF-DETR). It uses a lazy-fusion of features to sustain deep contextual information of objects present in the image. The features from multiple subsequent deep layers are fused with element-wise-summation and input to a transformer network for object queries that learn the long and short-distance spatial association in the image by the attention mechanism.SOF-DETR uses a global set-based prediction for object detection, which directly produces a set of bounding boxes. The experimental results on the MS COCO dataset show the effectiveness of the added normalized inductive bias and feature fusion techniques by detecting more small-sized objects than DETR.


I. INTRODUCTION
Object detection is one of the long-standing research topics in computer vision (CV), and more challenges surface with technology updates and artificial intelligence (AI) ventures. New applications demand rigorous, accurate, and efficient detection of all interesting objects in an image. AI enthralled by the advancement in deep learning (DL) generally benefits from the hard inductive bias of convolutional neural networks (CNN) with region proposal network (RPN) for two-stages object detection [1]- [4]. Later, single-shot object detection with anchors efficiently produced competitive results [5]- [7]. Despite outstanding results, RPN generates highly overlapping region proposals for the same set of objects demanding postprocessing using handcrafted processing like Non-Maximum Suppression (NMS) based on Intersection-over-Union (IoU) [4]- [6]. Recently introduced paradigm, transformer architecture has shown outstanding performance in the natural language processing field [8] and led the researchers to utilize such paradigms in several fields like image captioning [9], 1 School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005, South Korea; 2 Threat Intelligence Team, Monitorapp, Seoul, South Korea; † represents the corresponding author speech processing [10], and even biomedical imaging [11]. Moreover, it inspired researchers from the computer vision field as well [12]- [14], and recently, a new architecture, detection using the transformer (DETR), was proposed [13]. This architecture efficiently outputs bounding boxes per image and classes in parallel using a self-attention mechanism. DETR eliminates the necessity of post-processing detection by using bipartite matching, which helps remove redundancy in proposals. DETR shows competitive results among conventional and modern object detection models. Nevertheless, detecting smaller-sized objects compared with medium and larger-sized objects is still challenging due to some factors as inadequate resolutions of objects and fewer training images containing small-sized objects than other-sized objects present in the training dataset. That reflects in the overall performance degradation of these detectors. This present work attempts to address such problems in the object detection task using transformers and exploiting inductive biases. Generally, artificial neural networks are famous for their inductive bias exposited through their composition, either in convolutional [12], [15] or sequential networks [16]. Convolution neural networks explore hard inductive bias using diverse compositions of networks layers and the hyper-parameters for connected layers and provide performance improvements. Therefore, in this work, we propose a model, SOF-DETR, which leverages such inductive biases in the case of transformer detection to improve small-sized object detection. To provide the extensive features of small-sized objects without losing other objects' information, a normalized inductive bias is included in our proposed model, SOF-DETR. Normalized inductive bias includes a fusion technique. Fusion techniques are commonly used to combine multi-sensor or multi-source data [17]. In our study, we use it to join different feature layers. Multiple layers of the normalized convolutional layers are fused together to give some hard-encoded inductive bias to favor the small-sized objects. Then, these advanced deep features are progressed through the transformer network as object queries and learn associations among present objects in the image by the attention mechanism of a transformer. Furthermore, similar to studies, SOF-DETR predicts unique bounding boxes associated with a single object by utilizing a global set-based cost function for the bi-partite matching technique.
Contributions: This study contributes to the object detection field as follows. First, we introduce a normalized inductive bias for detection using a transformer to get extensive features from small-sized objects. Second, this normalized inductive bias utilizing a fusion of multiple layers also supports generating more focused self-attention maps. Our extensive experiments demonstrate the effectiveness of all the modules. The detailed explanation, along with experimentation of SOF-DETR, is exhibited in the following sections.

II. METHOD
The architecture of the proposed model, SOF-DETR, is depicted in Fig. 1. SOF-DETR introduces a new composition of the neural networks in the transformer-based detection model (DETR). It channelizes the hard inductive bias of a CNN to the soft inductive bias of a transformer by interleaving a normalized inductive bias on fused feature map from different convolutional blocks. The model consists of three modules: convolutional backbone, normalized inductive bias, and transformer detection. For unique detection, this study follows set-based object detection [18] to associate a unique bounding box with one object. The composition of each module is given in the following sections 1 .

A. Convolutional Backbone
Extracting valuable features of objects from the provided image is one of the crucial tasks in object detection. The extracted features are carried through the transformer architecture as query, value, and key and learn the context from the entire image. In this study, we have utilized the standard convolutional model, ResNet [19], as the backbone of SOF-DETR for extracting deep structural and contextual features of all objects in the image. First, we provide the given input image I of 3 * H * W dimensions, where H, W are the height and width of the image, respectively. We have experimented our model on the ResNet-50.

B. Normalized Inductive Bias
DETR practices the CNN hard inductive bias for features and soft inductive bias to learn the relationship. There are still challenges in the case of small-sized object detection due to their inadequate resolutions resulting in missing indispensable features of these objects for further processing, which leads to poor attention maps for such objects. Consequently, to overcome this challenge in the case of transformer-based detection, our proposed model, SOF-DETR, channelize hard inductive bias and normalized it before passing it to the soft inductive bias of the transformer encoder. The normalized hard inductive bias fuses the features extracted at different convolution blocks so that the essential features of the smallsized are intact for further processing. It accepts the output of the last two layers (layers 3 and 4) of the backbone model and down-project the features from high-level (layer 4) using a de-convolutional layer of 1024-size and transfers it through a convolutional layer of 1024-size (Concatenating more layers decreases the overall performance). the The features are fused after group normalization using element-wise-summation as given in equation (1).
where F stands for extracted features from an image, I and layer L or L − 1, and F c stands for combined features. These combined features are then passed through the relu layer, followed by a convolutional layer of size 1024 for extracting deep features from these combined high-level and low-level features. Concatenating these features has not improved the current model; therefore, we have experimented with the element-wise summation of the low-level and highlevel features. Now, we transfer these contextual features, including inductive bias, to the transformer module for further processing.

C. Transformer Detection
DETR architecture [13] is a recently introduced deep neural network inspired by the transformer's accomplishments in the language domain. The transformer architecture for object detection consists of a simple encoder-decoder architecture and a feed-forward network (FFN) for ultimate predictions. The encoder layer consists of standard multi-head self-attention and FFN modules [8] for correlating the queries with keys and values. Positional encoding is added for image data [20]. These encoded features of N objects of n-dimension are passed through the decoder layers. N is greater than the maximum number of objects present in the image. The decoder layer also follows the standard transformer-based architecture. The embedded N features generate attention maps using a multihead self-encoder-decoder attention mechanism. SOF-DETR follows the DETR decoder layer where all the embeddings of N objects are passed together as queries and generate the output for each object simultaneously as decoder embedded form. At last, these embedded decoder outputs are individually decoded into bounding box coordinates, and class labels by passing it through FFN consists of three-layer fully-connected layers with ReLU activation function and provide us N number of object predictions with normalized center coordinates, height, and width. The linear layer provides the class of the object.
Following the studies [13], [18], we have utilized a setbased loss function for training the network and giving a unique bounding box for each object. A set of fixed-size of N embedded output from the decoder is generated, where N is significantly larger than maximum number of objects present in the image. If the ground truth for image I has o number of objects in set, we pad ground truth with φ for no object case to make it equal to N number of detected objectsd = {d i } N i=1 . Now, we find the best permutation,θ : ϑ ∈ α N , of N elements present in these two sets, o, and d, which gives the lowest cost for finding bipartite matching as below: where C bm is element-wise bipartite matching cost, which can be calculated using the Hungarian algorithm similar to [18], among elements of sets of ground-truth o and predictions d with index ϑ(i). Set of ground-truth o has elements as o i = (c i , bbox i ) and c i , bbox i are class and bounding box of the i th object. Similarly, prediction for i th object p i = (ĉ i ,bbox i ) with predicted classĉ i of index ϑ(i) with probabilityp ϑ(i) (c i ) and bounding boxbbox i . Similar to [13], [18], this bipartite matching takes place for both predictions class predictions and bounding box predictions. The loss function, Hungarian loss for all matched pairs, for predicted classes, and bounding boxes can be calculated similarly to work [13], [18]. Class imbalances are handled likewise to work [4]. The bounding box loss, C bbox , can be calculated utilizing l 1 loss and generalized IoU loss, L IoU similar to the work [21]. The loss function and shared layer-norm are added to each decoder layer.
Our model, SOF-DETR, is trained using this loss function and predicts a unique bounding box and class for each object present in the image.

A. Dataset
We have experimented with and evaluated SOF-DETR on the MS COCO object detection dataset, which categorizes the small-sized objects (2017). This dataset comprises of 118, 000 training and 5, 000 validation images. It has more small-sized objects than large-sized and medium-sized objects. Almost 41% of the objects are small-sized, 34% are medium-sized, and 24% are large-sized in this dataset. In this study, the denotation of objects' size is appraised in accordance with the MS COCO annotations: A labeled object with a bounding box having an area less than 32 2 (pixels) is a small-sized object, the area between 32 2 and 96 2 is a medium-sized object, and the objects with an area greater than 96 2 are large-sized. There are at least 7 objects and at most 63 objects present (labeled) in a single image of the training set. In the case of occluded objects, the area is computed based on the size of the partially labeled bounded box. We have used data-augmentation techniques like scaling, resizing all the images by 800 * 800 random cropping with a probability of 0.5, followed by resizing.

B. Evaluation Metrics
We have evaluated the performance of SOF-DETR using the standard evaluation metric: mAP (mean average precision) of bounding boxes averaged on thresholds ∈ [0.5 : 0.05 : 0.95] for all detection. Moreover, we have also compared the proposed model's performance using conventional precision, recall, and F-1 scores @0.5 for small-sized, medium-sized, and large-sized objects. Furthermore, the PR curve (Precision-Recall curve) is also utilized for the performance analysis.

C. Implementation Details
The training of SOF-DETR is similar to DETR: It uses an AdamW optimizer with an initial learning rate of 10 (−4) . The learning rate for the backbone (ResNet-50) is 10 (−5) , and weight decay is 10 (−4) . SOF-DETR uses 0.1 dropout and Xavier initialization [22] for starting weights. We have experimented SOF-DETR without any additional dilation layer [23], and future work can extend this work. We have reported the results after 500 epochs of training for an overall evaluation, with a learning rate dropping by 10 after 300 epochs. SOF-DETR is trained on 8-V 100 Nvidia-GPUs on a single node with a batch size of 3 on each GPU (performs better than 2, 4 and 6 batch-sizes) and tested on a single GPU and single node with 2 batch sizes.
D. Performance Analysis of SOF-DETR 1) Quantitative Analysis: We have compared the SOF-DETR with DETR [13] and Faster-RCNN [4] algorithms. Table I shows the detailed qualitative results. We have also compared the Faster-RCNN results reported in the paper   DTER after fine-tuning [13] as well as pre-trained faster-RCNN without any fine-tuning [4]. SOF-DETR outperforms DETR with a 0.7% improvement in mAP score overall, and 1.2%, 0.1% and 0.4% improvement in detecting smallsized, medium-sized, and large-sized objects, respectively.
The results indicate normalized inductive bias in SOF-DETR improves the detection of small-sized objects without affecting the medium-sized and large-sized objects detection.
To emphasis the efficacy of our proposed model in smallsized object detection, we have also presented the quantitative analysis of SOF-DETR using AP @0.5 metric and PR curve as shown in Fig. 4 and Fig. 5, respectively. In the case of class imbalance, PR-Curve presents a more reliable comparison of evaluation. These curves display the correlation between precision (positive predictive value) and Recall (sensitivity) values for each attainable cut-off value. A curve higher than the other curve exhibits better performance and Fig.4 indicates that SOF-DETR outperforms other algorithms in most cases.
2) Qualitative Analysis: We have shown qualitative results of the proposed technique in Fig.2, and Fig. 3. The figures also depict DETR results for a comparative study. It is evident in Fig. 2, SOF-DETR detects small-sized present objects with high confidence scores, which DETR misses. Furthermore, SOF-DETR produces a higher confidence score for other small-sized objects than the confidence score of DETR. Fig. 2, also signifies higher confidence in predicting the small-sized objects without affecting the performance of medium-sized and large-sized objects. Another potential trace of the perceptible normalized inductive bias introduced in this study is   Fig.3, which compares self-attention activation of SOF-DETR with self-attention maps of DETR. It is evident that the SOF-DETR attention maps are more vibrant for smallsized objects, including medium-sized and large-sized objects.

IV. CONCLUSION
This study improves the hard inductive bias of DETR for small-sized object detection without affecting the performance of medium-sized and large-sized objects. The normalized inductive bias is introduced using a lazy fusion of feature maps before passing it to the transformer layers of our proposed technique, SOF-DETR. The proposed technique shows higher confidence scores for detected small-sized objects and overall better performance than DETR. Future studies can explore another direction of improving inductive biases for small objects, like introducing a penalty in loss function for object sizes. This can be introduced in parallel to the normalized inductive biases. Future work can also include the study of such inductive biases in segmentation, panoptic segmentation, and instance segmentation.