3D Bounding Box Detection in Volumetric Medical Image Data: A Systematic Literature Review

This paper discusses current methods and trends for 3D bounding box detection in volumetric medical image data. For this purpose, an overview of relevant papers from recent years is given. 2D and 3D implementations are discussed and compared. Multiple identified approaches for localizing anatomical structures are presented. The results show that most research recently focuses on Deep Learning methods, such as Convolutional Neural Networks vs. methods with manual feature engineering, e.g. Random-Regression-Forests. An overview of bounding box detection options is presented and helps researchers to select the most promising approach for their target objects.


I. INTRODUCTION
The extraction of a Volume of Interest (VOI) is an important pre-processing step in computer based diagnosis. Tasks such as organ segmentation or classification of malignant tumors usually require a prior localization of the corresponding organ or structure. Especially the semantic segmentation of small organs benefits from a preceding localization. By limiting the data to be examined to a VOI, it is ensured that only relevant areas need to be processed and the computing and memory effort is reduced. For instance in the field of intervention training and planning, 4D Virtual Reality (VR) simulations require realistic 3D patient organ models in order to be an adequate preparation for training and planning medical procedures [1], [2]. The automatic reconstruction of such 3D organ models benefits from the localization of a VOI, as it excludes irrelevant regions and therefore making the segmentation of the relevant structures easier and more efficient.
In this Paper we review 3D Bounding Box (BB) detection in volumetric medical image data. Such data is generated by imaging procedures such as CT (Computerized Tomography), MRI (Magnetic Resonance Imaging), PET (Positron Emission Tomography), US (Ultrasound), HFU (High Frequency Ultrasound), just to name a few. We focus only on recently published papers (last five years) to capture current trends and developments.

II. METHODOLOGY
The papers of interest deal with methods to detect 3D BBs around targets in volumetric medical image data. Therefore we used search terms containing "3D Bounding Box" AND "localization" AND medical -vehicle -"point cloud" (excluding terms "vehicle" and "point cloud") to find relevant papers in public databases and digital libraries. The platforms searched were, IEEE Xplore 1 , ACM 2 , Springer 3 , Google Scholar 4 and WoS 5 . The search was always limited to publications from 2015 to 2020. All papers selected for this review are written in English and have been published internationally. By abstract screening, a total of 31 papers was selected.

III. 3D BOUNDING BOX REPRESENTATIONS
A 3D BB describes a cuboid object in 3D space. 3D BBs can be represented in different ways. Two common kinds are the centroid and the two corner representations as seen in Fig. 1. The former defines the center coordinates and the height, width and length of the BB. In the latter case the BB is defined by two opposite corners. Two opposite corners are e.g. the minimum and the maximum coordinate points. Possible BB representations. Using centroids (left) or two opposite corners (right).

IV. 2D VS. 3D IMPLEMENTATION
In the past, a popular approach was to train a model using handcrafted features. In 2010 Criminisi [3] proposed Random Regression Forests (RRF) to localize target structures in 3D Volumes. Unlike traditional approaches, modern Deep Leaning methods like Convolutional Neural Networks (CNN) do not have to rely on handcrafted features, but benefit from automated feature extraction. In recent years the focus has clearly shifted towards Deep Learning.
The implementation of solutions for finding 3D BB for target structures in volumetric data can be performed in 2D or 3D. While a 3D implementation takes the whole volume into account, a 2D implementation distinguishes between three orthogonal image planes. These planes are shown in Fig. 2 as red (sagittal), blue (coronal), green (axial) outlined rectangles. In Fig. 2, a 3D BB then is constructed by shifting the colored planes (plane/outside normals pointing away from the patient) around a structure of the human body, e.g. the head.
Although comparisons have shown that 3D approaches generally deliver better results [23], [24], [25], they still come at a cost. The processing in 3D manner requires far more computational resources. The advantage of capturing spatial information in all dimensions goes hand in hand with higher memory demand and required computing power. Furthermore, 3D training data is often not available to the same extent as 2D training data.

B. 2D and 2.5D Implementations
The 2D implementation approach deals with 3D localization as a 2D problem. Therefore the volumetric data is examined slice wise in one of the three orthogonal image planes (i.e. sagittal, coronal and axial). The 3D image is thus treated as a stack of several 2D images. A common approach is to use a single 2D CNN or a combination (2.5D) of several (usually three) 2D CNNs for slice wise detection in either one or all three orthogonal viewing plane directions. A single 2D CNN can be implemented to analyze exactly one of the three image plane stacks [27], [28], [29], [30]. Adjacent slices as additional channels [31] or dimensions [32] help to capture contextual information. Another possibility is to analyze all three image planes by using a single 2D CNN three times [33], [34], [26], [4] or three separate 2D CNNs per plane [35], [36], [37], [38], [39]. Adjacent slices and separate CNNs can also be used in combination [40]. After one or more 2D models have processed the data for multiple slicing directions, the results still have to be combined to create a 3D BB. This can be done by means of a majority voting as seen in Fig. 3. In the illustrated workflow, the 3D input image is sliced in all three viewing plane directions. A single 2D model processes the input for each direction separately. The output are three different BBs for the target structure. The coordinates of the BBs are evaluated together and a majority vote determines the final BB. The advantage of a 2D compared to a 3D approach, is the lower memory consumption and the larger amount of training data that results from splitting the 3D images into stacks of several slices. A disadvantage is that context information is usually lost. Furthermore, the results of all slices must be assembled to form a cuboid BB, which is further complicated by occurring spatial discontinuity of the slices as seen in Fig. 4. In a 3D detection the image is viewed as a whole. The resulting BB therefore seamlessly encloses the target structure. The problem with 2D detection is that the 3D image is broken down into individual sectional images and BBs are determined individually for each image.

V. APPROACHES
The following approaches for 3D BB detection in volumetric medical image data have been identified amongst the investigated papers.

A. Slice Wise Box Detection
This approach simply detects the presence of the target structure in every slice. The results for each orthogonal image plane stack are combined to produce a 3D BB [35], [36], [37], [4]. The approach works regardless of whether the results were generated by a single 2D CNN or a combination of three 2D CNN.

B. Coarse Segmentation / Probability Maps
The coarse-segmentation of target structures is often an intermediate step for a subsequent refined segmentation. First, the entire image is viewed to roughly locate one or more targets. The resulting sub-optimal segmentation is then utilized to place a BB around the area of interest [32], [33], [34], [29], [5], [31], [19], [20], [39].  [38] implement a 2D pixel-wise probability detection in every image plane direction to obtain confidence heatmaps, which are then used to generate a 3D BB. By applying a threshold against the pixel probabilities, the largest connected component is found and a BB is simply put around it. The procedure is shown step by step in Fig.5. R. Gauriau et al. (2015) [22] calculate voxel probabilities to obtain confidence maps in a 3D manner. They utilize RRFs and divide the localization into 2 steps. A first RRF performs a rough localization of all organs at once. A second, organspecific RRF focuses on the individual organs respectively. In a similar fashion Y. Zhang et al. (2017) [21] first take advantage of the knowledge about the relative positions of the target structures and their voxel intensity by using haarlike features to narrow down the target area. A RRF is then trained on spatial and intensity features to predict a voxel-wise probability map within the target area. Using a threshold, a BB is placed around the target structure.

C. Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) combines Reinforcement Learning (RL) and Deep Learning. In RL an agent takes a sequence of actions in order to achieve a certain goal. In doing so, it receives feedback in form of rewards and penalties. Through trial and error, the agent tries to maximize the accumulated reward and learns which actions to take. DRL incorporates Deep Neural Networks (DNN) into this task. The DNN analyzes the current state and decides which action to take. In the work of F. Navarro et al. (2020) [10], the CNN receives the current BB voxel values and those of the last four states as input for performing the task of finding the final BB. The actions consider the moving direction, translation and scaling of the 3D BB. S. Iyer et al. (2018Iyer et al. ( , 2020 [7], [12] employ two 3D CNNs, one for learning the navigation in the coordinate directions and the other to predict the size of the BB dimensions.

D. Anchor Based Approaches
Another often seen approach is using anchor boxes, which are predefined BB guesses of certain scales and aspect ratios.  [14] scan the whole volume using a 3D sliding window, that is large enough to fully contain the target structure. A 10-layer VGGNet [13] serves as the classifier.
X. Xu et al. (2019a) [8] binarize the predicted sagittal, axial and coronal presence probability curves of the target organs by applying a threshold. The 3D BBs are composed by the largest 1D nonzero component in these three binary curves. Table I gives an overview of the evaluated papers. Included are the author, the image modality, the approach to 3D BB detection, the target structure in the body and the evaluation results of the work. The "Results" column in Table I [4], for instance, did extensive testing and a more detailed evaluation can be found in their paper. Some results are also left blank, since no evaluation was performed as localization was a less important intermediate step in these papers. Measured was mostly Intersection over Union (IoU), Dice  VII. CONCLUSION AND FUTURE WORK We provide a synopsis of the recent works dealing with 3D BB detection in volumetric medical images. For this purpose 31 papers of the last 5 years were evaluated. The review is intended to provide an overview of the current trends as well as information on various options for BB detection in 3D data. 3D and 2D implementations were differentiated, processing the 3D input as a whole or splitting it into several 2D inputs. Various approaches were identified, Coarse Segmentation being the most commonly used. It was also found that Deep Learning methods have largely replaced traditional and other methods, e.g. RRF. The overview of options presented in this review will help future researchers to select a promising approach, which also reflects the state of research. Some of the presented techniques are also applicable to 2D imagery, e.g. detecting, learning and discerning face appearances in photographs [53]. Traditional techniques such as RRFs have been augmented by Deep Learning techniques, especially with CNNs among them. The most promising and increasingly successful methods seem to be CNNs, as they combine traditional signal processing approaches (convolution filtering) with automatic learning from examples in Neural Networks. BB detection helps to save computational cost and to train models for the subsequent semantic segmentation of body areas more specifically, with better results in the end.

VI. RESULTS
To assess the quality and relevance of BB detection for patient modelling in VR simulators [54], [1], [2] thoroughly, we plan studies in our lab to examine the influence of different imaging modalities [55], [56], [57], [58], [59] and BB detection quality by VR visualization and interaction with detected BBs using haptic force feedback [60], [58], [61], [62], [63] for quality assurance. In the future, we will also address the accurate and precise BB detection and content segmentation [64] using nD image data from various imaging sources. Additionally the quality of organ models in the time-dynamic simulation of 4D medical needle [65], [66] interventions [67], [68] shall profit from the hierarchical and more specific approach.
ACKNOWLEDGMENT German Research Foundation DFG MA 6791/1-1, EXPLOR-19AM funds granted by Foundation Kessler+Co. for Education and Research.