Comparison of Kidney Segmentation Under Attention U-Net Architectures

—One of the most prominent machine learning advantages in the medical industry is the early detection of disease. Automatic kidney detection is of great importance for rapid diagnosis and treatment, where related diseases occupy over 73,750 new cases in the US in 2020 [1]. Today, the performance of diagnosis has been by highly trained radiologists. However, the complex structures contribute to speckle noise and inhomogeneous intensity profiles. Thus, there is a necessity to automate segmentation on kidney ultrasounds using U-Net Deep Learning architectures - an innovative solution for Medical Imaging Analysis. In this research, our focus is on the comparison of Attention U-Net in the context of different backbones such as VGG19, ResNet152V2, and EfficientNetB7. By providing this comparison, we will accomplish a survey for future researchers to more effectively decide on which Attention U-Net architecture to utilize for their segmentation projects.


INTRODUCTION
IN recent years, intensive research on medical imaging and pattern recognition with performance equal to human-handed inspection or even better has seen exponential growth -albeit not free of criticism and controversy. However, medical applications are under pressure of high accuracy in the detection of convoluted geometrical shapes. Traditionally, architectures were either non-standard or very complex to use to highlight these shapes. Accordingly, in 2015, U-Net was introduced to accomplish the function of automated image segmentation with regard to medical imaging. It is a system with a specific Deep Learning architecture that resembles a "U" -encoding followed by decoding with skip connections.
The goal of this research is to use kidney detection as a proof of concept to provide an analogy of the performance of different U-Net models. The hope is to facilitate future researchers when deciding on the best U-Net Segmentation algorithm to use. To the best of our knowledge, there does not appear to be any paper comparing Attention U-Net with regards to backbones -VGG19, ResNet152V2, and EfficientNetB7 and exclusively within the context of kidney segmentation. We provide recommendations on what architectures are the best to use.

II. MOTIVATION
This academic contribution aims to demonstrate a segmentation system based on U-Net, to address the elusive challenges that hinder a complex deep network process for medical diagnosis. Comparison of different backbone algorithms was also scarce based on classifications with recent encoders and backbones. Another question that sparked curiosity was tuning a U-Net architecture with the latest backbone algorithms, hitherto without any previous related comparison. The main characteristics of backbones aim to solve efficiency problems by reducing unnecessary computations.

A.
Previous work Related research has been conducted on kidney datasets using U-Net architectures, namely 3D U-Net [2]. Nevertheless, none of them explore the potential advantages of backbones such as VGG19, ResNet152V2, and EfficientNetB7. These backbones are CNN (Convolutional Neural Network) architectures that have won the ImageNet competitions whereby millions of images have been categorized into around 1000 categories like dogs and cats.
Moreover, Seum et. al. [3] suggested incorporating segmentation as the first step for the COVID-19 diagnosis pipeline. The reason for this was the enhancement of tuning what is being sent to the CNN for classification. For instance, kidney segmentation could be utilized to determine kidney location prior to sending to a CNN. This information would indicate if there is a disease such as tumors or stones within the kidney annotated region. Thus, improving the performance of CNN.
Z. Wang et. al. [4] on the other hand, proposed a brand new U-Net called "RAR-U-Net" which stands for "Residual encoder to Attention decoder by Residual connections framework for medical image segmentation under noisy labels". Our investigation deals with the ordinary specimen of healthy kidney ultrasounds that are of relatively good quality, thus, a "noisy label" is not of concern to us. If we are to expand our project to accommodate more complex images, RAR-U-Net would be implemented.
Li. et. al. [5], introduce "ANU-Net". A creation that was attempted for a new "U-Net" that is more robust and able to more correctly annotate the medical images under attention mechanism. In the course of this investigation, we are only considering U-Nets that have already been formally designed and tested -in our case by being part of the Keras library. This article is an excellent springboard for generating our custom U-Net if we wish to pursue further understanding of how U-Nets work and how to improve them.
Overall, we investigate exclusively kidney detection with regards to a comparison of Attention U-Net with various corresponding backbones VGG19, ResNet152V2, and EfficientNetB7. We believe that such a comparison has not been done, especially within the context of kidney detection.

B. Dataset
The dataset was obtained freely from Kaggle under the title "CT2USforKidneySeg -A Dataset synthesized US images from CT data with labels" . In total, the number of samples 1 was 200, with separate segmented masks, rounded on 256 x 256 scale. The slices consist of kidney ultrasounds whereas the masks contain the outline of the kidneys. Furthermore, the repository was randomly shuffled and splitted into a training-set (90%), and a validation-set (10%) to evaluate the experimenting models. Below is an example of an ultrasound followed by the corresponding mask that annotates the kidney. Our comparison is of the Attention U-Net model. In addition to these models, we also have transfer learning from the following backbones: VGG19, ResNet152V2, and EfficientNetB7 with and without ImageNet weights. We seek to compare the performance of all these Attention U-Net and backbone combinations in order to provide a benchmark to facilitate future research.

A. Segmentation models
Segmentation is the ability to separate an image into its semantic components -regions that describe a specific object. Our example is to highlight the borders of a kidney in an ultrasound with a simple binary mask. The following describes U-Net and its successor -Attention U-Net.

a. U-Net
The U-Net architecture belongs to the FCN (Fully Convolutional Networks) family, differentiating from conventional CNN by having an extra layer that enables for complex calculations of various sample sizes. U-Net was the first Deep Learning architecture built for performing biomedical purposes in 2015. Essentially it is a "U"-like autoencoder architecture whereby the first half encodes (dimensionality reduction) and the second half decodes (dimensionality increase). The encoding is performed by a CNN-like structure that uses kernels and pooling in order to preserve important information while compressing it into a smaller context. The decoding is the opposite whereby up-convolutions and also up-pooling essentially mean that it increases the sizes based on the categories obtained in the encoder half. Additionally, there are skip connections that connect components of the encoder with its corresponding components in the decoder of the same layer. Training using U-Net is accomplished by providing, as input, both the original images and the corresponding masks. Essentially, the main idea behind U-Net is that the original image and mask are condensed into its semantic parts. Then, it is uncompressed based on the expansion of these semantic parts. Thus, the image is separated into the semantic parts.

b. Attention U-Net
In 2018, Attention U-Net was created as an enhancement to the classical U-Net. Essentially, it highlights target structures while mitigating irrelevant regions, thus, an "attention mechanism".

c. U-Net Loss Function (Binary Cross Entropy)
For all U-Nets, the loss function is the Binary Cross Entropy. It is given below.

B. CNN (Convolutional Neural Network) backbones
In addition to the U-Net architecture, we also train models that include a CNN backbone. These CNN backbones are ImageNet competition winners -a yearly competition to find the algorithm that is the best in classifying millions of images into thousands of categories. CNNs have been the Deep Learning structures that have been able to accomplish such a massive task successfully. It is these fine-trained CNN architectures that we seek to use in order to give better results in our project. Generally, CNNs are structured like the following: Fig. 3. Generic Convolutional Neural Network [7].
The different CNNs investigated are: VGG19, ResNet152, and EfficientNetB7. These models have pre-trained weights trained on the ImageNet dataset. We investigated using these different backbones with the Attention U-Net model. Essentially, this is transfer learning applied to segmentation tasks.

a. VGG19
VGG was created by the Visual Geometry Group at Oxford in 2015. Essentially, it is a variant of the VGG model group, and it consists of 19 layers (16 convolutional, 3 fully connected, 5 MaxPool, and 1 SoftMax). It has the ability of over 19.6 billion FLOPS (Floating point operations per second) [8]. b.
ResNet152V2 It is a Residual Network that won the ImageNet competition in 2015. It consists of 152 layers. The breakthrough with ResNet is the ability to train very deep networks. Additionally, it is the architecture that introduced "skip connections" to CNN whereby different layers are connected directly [9]. c. EfficientNetB7 The first EfficientNet was introduced in 2019. It was considered one of the most "efficient" models. Overall, it reaches "state-of-the-art" accuracy on ImageNet and also on transfer learning tasks [10]. d.

Loss Function for CNN
Generally, for all CNNs the Loss function is as follows with the probability and the target. It is called the Cross Entropy or Binary Cross Entropy if only two classes are used. (2) This is a Deep Learning segmentation project, accordingly, there are several metrics universally recognized to be used to measure its abilities. These include Confusion Matrix, Precision and Recall, Accuracy, Jaccard index / IoU (Intersection over Union), DICE, and Loss.

A. Confusion Matrix
This is a table that compares the true results with the predicted results. It is important because it gives a representation of how the algorithm is performing. Most specifically, it is very bad to have positives shown as negatives (False Negative -FN). This result is especially worrisome within the medical field with drastic consequences for patients. Essentially, they have a disease but the algorithm fails to detect this.

B. Precision and Recall
Precision is known as the "positive predictive value". It is the ratio of correct positive predictions to the total predicted positives.

C. Accuracy
This is the percentage of correct predictions divided by all predictions. In our case, the pixel accuracy might not represent a strong metric in our analysis, because semantic segmentation increases the correlation between an object and the background, thus, causing a high accuracy score attributed to overfitting. Accordingly, we ignore accuracy in our project.

D. Jaccard index/IoU (Intersection over Union)
The Jaccard is also known as the IoU (Intersection over Union). It is basically a measure of overlap between images divided by the union of the images. A value of zero means no overlap whereas a one means complete overlap. The goal is to reach close to one meaning that the images are very similar.

E. DICE Loss (Sørensen-Dice coefficient)
This metric was developed in the 1940s to measure the similarity between two samples just like Jaccard. Values fluctuate between zero and one, where zero means no spatial overlap and one indicates complete overlap. DICE is calculated by two times the area of overlap divided by the total number of images in both images.

F. Loss Function -Binary Cross Entropy
This is the calculation that is used to lower the differences between the output produced and the desired output of the segmentation engine. The same Loss Function, Binary Cross Entropy, has been applied to both Attention U-Net and the corresponding CNN backbones.

V. RESULTS
The overall training of all models was as follows. Firstly, the U-Net took over 300hrs to finish due to our processing limitations. We were forced to limit ourselves to adopt feasible abilities given our processing characteristics. The following demonstrates what our algorithm accomplished. As can be observed, the mask prediction, below, performed very well.Additionally, we have included a table that summarizes our results. We facilitate experiments with different weights that could deal with the negative background class. Overall, our results have interesting interpretations which we will discuss in section VI DISCUSSIONS and VII CONCLUSIONS.

VI. DISCUSSION
In this section we discuss the Limitations, Expectations, Interpretations, and Recommendations.

A. Limitations
Our major limitation was processing speed by using Google Collab. Accordingly, our results are based on a relatively small number of epochs, sixty, and the number of participants, two-hundred. Overall, it took an average of two hours to train one single model. Nevertheless, our experience is important because we can translate our results to the practical world given that most researchers are similarly limited [13]. Also, because of the nature of binary segmentation of the imbalanced classes, different weights should be calibrated to fit model needs.

B. Expectations
This academic examination produced some unanticipated results. Originally, we had assumed that utilizing the latest CNN as the backbone with ImageNet weights and freezing of the backbone would have produced the overall clear best results. This has not occurred. Instead, we have learned that it is a much more nuanced task to select an appropriate and effective Attention U-Net deep learning architecture. We had assumed that the latest ImageNet winner, EfficientNetB7 would produce the best results. Instead, it tended to produce the worst results which are most likely attributed to its huge architecture and, thus, the larger training requirements -it consists of over 800 layers and over 60 million parameters. Additionally, as we learned later, EfficientNet on Tensorflow accepts only raw images and does not work with masked samples. Thus, we ignore EfficientNet.

C. Interpretations
Highest Overall Scores:    From our results, we took into consideration the performance measurements (IoU %, DICE %, Precision %, and Recall %) as seen in Table IV. They are the standard used to compare the performance of segmentations. Firstly, we learned that "no backbone" produces the best results, however, this may be a result of our rather simple and small dataset and classification. It does not give the opportunity to exploit the benefits of a well-trained, award-winning, Deep CNN.
In general, VGG19 outperformed for more instances better than ResNet152V2 in terms of accuracy across the 60 epochs. The reasons for these successes are a result of both limited dataset and limited training time. ResNet152V2 is newer than VGG19, however, it consists of more layers and more parameters, thus, to function correctly and effectively it needs more processing time.
If we ignore the "no backbone", the order of the best average is detailed in Table IV. When we initialize with (A), we are given an architecture that just needs to be tweaked at the classification layer. Whereas, with (D), essentially, with "No ImageNet" (B & D) this means that our models have weights that are arbitrary and meaningless. Moreover, to freeze this means that we retain the insignificant weights. Now with regards to (B), this is the equivalent of training the whole architecture, however, the convergence takes longer due to the large architectures of the CNN. Finally, with (C), it is no good because we are not changing the weights at all to suit our project needs.
Additionally, in regards to training speed, we have determined that the order of performance from best to worst, regarding DICE epoch convergence is displayed in Table V. This is logical considering that having a baseline with ImageNet accelerates convergence due to it being pre-trained. When Freezing is concerned, it is not as important as using ImageNet weights. However, when Freezing is combined with No ImageNet it gives worse overall performance. The reason why is because we are stuck with weights not trained for our current purposes -or any other purposes.
Thus, we summarize that for complex and larger datasets and enough processing powers, a recent ImageNet architecture should be leveraged with Attention U-Net. This ImageNet backbone should be initialized with its corresponding ImageNet weights and trained without freezing of these weights. Additionally, if there are limitations to training time, pick the best architecture based on the DICE epoch convergence.

D. Recommendations
Arguably, every dataset requires specific hyperparameter tweaking based on complexity and robustness. Due to our resource limitations, we used the standard of a batch size of 8, 60 epochs, loss function as Binary Cross-Entropy, optimizer as Adam, a learning rate of 1e-3, number of participants of 200, and a train-test (split of 90%-10%). From our results, the best overall model to select is Attention U-Net without any backbones. However, the reason why the other backbones did not achieve the best results is that they were built for complex data and complex classification. In this project, we leveraged a small and simple dataset with binary colors and binary masks, thus, we were not able to fully appreciate the advantages of these CNN backbones (VGG19 and ResNet152V2).
If we consider the backbones alone, without the no backbone, we determine that the best backbones have the following characteristics as per Table IV. This behavior is logically based on the complexity of ImageNet and Freezing.
Furthermore, we found that VGG19 performed better than ResNet152V2. This is because of the larger training time required for ResNet152V2 given its more sophisticated and deeper architecture. However, with regards to a faster convergence in DICE score, ResNet152V2 consistently outperformed VGG19.
Training speed is also affected by how the CNN is initialized and if it is frozen. Overall, if it is initialized with ImageNet, then you will have a faster convergence of DICE. Freezing affects it, especially giving the worst results if it is the combination of "No ImageNet and Freeze". Thus, we recommend "ImageNet and No Freeze" because of its faster processing time.
We recommend studying the backbones closely before application. Therefore, when deciding on what Attention U-Net to pursue, you need to consider the dataset size, the complexity, the number of classes, and one's processing abilities. Ideally, we recommend ResNet152V2 with "ImageNet and No Freeze" and be trained with more epochs, more participants, and more processing power.

VII. CONCLUSION
This research has generated many questions and exciting future areas of research to be answered by Attention U-Nets in conjunction with other Deep Learning architectures. We have determined that for complex datasets with enough processing power, Attention U-Net works best under ImageNet and No Freeze. One avenue of further research includes working with 3D models to diagnose kidney disease [11]. As well, although this project focused on the detection of kidneys in ultrasounds, it could easily be extended to find tumors and other abnormalities by providing the appropriate annotations and using the pipeline of first Attention U-Net for detection and then CNN transfer learning for diagnosis. Additionally, we have learned that finding the best model to work with is not a trivial task and involves careful considerations of dataset size, complexity, the number of categories and processing abilities. Overall, online available medical data is difficult to attain, thus, a future avenue would be to generate synthetic data with the help of GAN (Generative Adversarial Network) [12]. Finally, the most important goal which has yet to be answered by researchers in any medical field: is it possible to predict when a disease will develop and how it progresses with the use of Deep Learning.