COVID19 detection from Radiographs: Is Deep Learning able to handle the crisis?

The COVID-19 is a highly contagious viral infection which played havoc on everyone’s life in many different ways. According to the world health organization and scientists, more testing potentially helps governments and disease control organizations in containing the spread of the virus. The use of chest radiographs is one of the early screening tests to determine the onset of disease, as the infection affects the lungs severely. This study will investigate and automate the process of testing by using state-of-the-art CNN classifiers to detect the COVID19 infection. However, the viral could of many different types; therefore, we only regard for COVID19 while the other viral infection types are treated as non-COVID19 in the radiographs of various viral infections. The classification task is challenging due to the limited number of scans available for COVID19 and the minute variations in the viral infections. We aim to employ current state-of-the-art CNN architectures, compare their results, and determine whether deep learning algorithms can handle the crisis appropriately. All trained models are available at https://github.com/saeed-anwar/COVID19-Baselines


Introduction
The COVID-19 infection is caused by a virus known as SARS-C0v-2 or novel Coronavirus belongs to the Corona family. The virus is highly contagious, as is evident from the exponential growth of positive cases throughout the world in a short period with limited testing. The infection causes severe damage to the lungs causing pneumonia with accompanying symptoms of sore throat, dry coughing, sneezing, and high temperature.
Moreover, some of the patients don't show symptoms acting as a carrier is a worrying concern for the health organization. The remainder of the paper is organized as follows. In the following section 2, related work to Covid-19 using computer vision is discussed. The section 3 presents the methodology, and experimental results are discussed in section 4. The paper is finally concluded in section 5. Chen et al. (2020) retrospectively collected 46,096 highquality CT images from 106 admitted patients anonymously.

Related Work
There were 51 laboratory-confirmed COVID-19 patients, while the rest of 55 were patients of other diseases. Three expert radiologists with more than five years of experience annotated the COVID-19 dataset with combined consensus. The problem is framed as a segmentation task, and UNet++  was trained to segment valid areas in the CT images. The trained model was deployed at the Renmin Hospital of Wuhan University and as a web API to assist the diagnosis of COVID-19 cases around the world.
The COVNet  uses 3D deep learning framework to extract 2D local and 3D features for the detection of COVID-19. The ResNet (He et al., 2016) is employed as the backbone to extract features from the input CT slices. The extracted features are passed through the max-pooling operation.
The final feature map is fed through the fully-connected layer and eventually through the softmax function to get probabilities for each class. Previous studies have shown the successful application of deep learning methodologies to chest X-Rays for the diagnosis of bacterial and viral infections (Rajaraman et al., 2018;Kermany et al., 2018).
A deep learning-based CT diagnosis system, termed as DeepPneumonia, is proposed to detect and localize the lesions causing COVID-19 . Firstly, the lung region is extracted in each CT image and then fed to the Details Relation Extraction neural Network (DRE-Net) to produce top K details in a CT scan using pre-trained ResNet with Feature Pyramid Network (FPN) and attention module. The attention module is used to learn the importance of each detail. The predictions are aggregated by the aggregation module to predict the patient-level diagnosis. Deep learning requires a significant amount of annotated data for training the model. As the radiologists are busy dealing with the pandemic, the annotation task is difficult and costly; therefore, a weakly-supervised technique is presented by Zheng et al. (2020) that utilizes weak patient-level labels for the rapid diagnosis of COVID-19 subjects. A 3D deep convolutional neural network called DeCoVNet  is used to take input of CT volumes and 3D lung masks to output the probabilities of COVID-19 and non-COVID-19. A pre-trained model is used to generate a 3D lung mask. The first stage of the architecture consists of a vanilla 3D convolutional base, followed by batch normalization and max-pooling to create a 3D feature map. In the second stage, a 3D feature map is passed through two 3D residual blocks with the batch norm. In the last step, a Progressive Classifier (ProClf) progressively abstracts the information in 3D volumes and classifies using softmax function to output the probability of being COVID-19 and non-COVID-

19.
The Chest x-ray radiography (CXR) is widely used for the diagnosis of various infections due to its lower cost and broader availability. The COVID-19 patients show lung consolidation over the period and, therefore, could be used as a diagnos-tic tool in conjunction with a CT scan for better radiological analysis (Ng et al., 2020). A two-step human-machine collaborative strategy is proposed to design network architecture for the detection of COVID-19 cases from CXR images (Wang and Wong, 2020). In the first step, the initial network design prototype is constructed using human-driven principles. In the second step, initial prototype and human-specific designs are used in machine-driven exploration to find optimal macro archi- We aim to employ the available deep learning state-of-the-art algorithms to identify the COVID19 and non-COVID19 features. The purpose is two-fold: 1) this research will provide baselines, and 2) it will also establish the performance of current state-of-the-art deep learning algorithms.

Methodology
For the sake of completeness, we will discuss the basic build- The modern architectures are broadly grouped into categories.

Plain Networks
AlexNet Krizhevsky et al. (2012) is the first architecture that sparked the research interest in deep learning when it won the ImageNet challenge by a substantial margin. The architecture consists of eight layers, including five convolutional layers, activation function, and three fully-connected layers. AlexNet, for the first time, used multi-GPUs to train bigger models and reduce training time.
Subsequently, Simonyan and Zisserman (2015) proposed Visual Geometry Group (VGG) that comes in different variations such as VGG16 and VGG19 are the most common architectures with 16 and 19 layers, respectively. The typical pattern among these architectures is the use of only 3×3 filters. The initial layers utilize a few filters but increase their number as the depth of the network increase, the kind of pattern that can also be seen in other architectures. In earlier layers of VGG or other plain architectures, learn more spatial information for filters, while the later layers used more filters to balance out the availability of less spatial information. Initially, the VGG architecture was difficult to train from random initialization of weights.
However, the training became easier with the introduction of intelligent initialization techniques such as Xavier (Glorot and Bengio, 2010) and (He et al., 2015). VGG19, the high accurate model, has a size of 574MB.
Contrary to the plain networks, the succeeding architectures share a common property i.e. using shortcut paths from earlier layers to later layers, which addresses the vanishing gradient problem (Hochreiter, 1998)

GoogleNet Inception Networks
GoogleNet introduced by Szegedy et al. (2015), was the winning architecture in the ImageNet challenge. The performance of GoogleNet was slightly better than VGG; however, the GoogleNet model was considerably smaller in size, with only 28.12MB as compared to 574MB of the VGG model.
The basic building block of the GoogleNet is termed as the Inception module, which comes in different variations making it more accurate than the original implementation of GoogleNet Inception. The idea of inception is to use filters of varying dimensions simultaneously, and then it is left to the network to decide during optimization which weights are essential. In this way, the network learns multiscale features efficiently.
The 1×1 convolution was used to reduce the dimension of feature map volume-wise before applying any other filter, thus decreasing the model size significantly. The inception module, as shown in the figure 4, applies 1×1 convolution in the first layer and max-pooling followed by any other filter. The output of all the filters is concatenated volume-wise before passing into the next layer of the network.

Residual Networks
The traditional network suffers from vanishing gradient problem (Hochreiter, 1998) during backpropagation. The gradient becomes very smaller and cannot update the parameters in the initial layers causing the network learning to be prolonged.
The Residual Network (ResNet) made it possible to train deeper networks (He et al., 2016).
The basic module of ResNet is called Residual block, as shown in the figure 5, which starts at the input of the module with two branches. One of the branches takes the input through the series of convolutions, activations, and batch normalization while the other branch is a shortcut that skips all the operations and is added to the output of the other branch, also known as identity-mapping. The residual layer starts learning at the identity function and learns more sophisticated and robust features going towards the depth of the architecture. In the recent version of ResNet, the order of operations in the first branch has been changed from convolution, activation, batch normalization (Conv-ReLU-BN) to batch normalization, activation, convolution (BN-ReLU-Conv). The method is called a pre-activation.

Dense Networks
In DenseNet (Huang et al., 2017), each layer concatenates the feature maps from all the previous layers, using the collective knowledge in the current feature map's computation. The current layer passes on its feature map to all the subsequent layers, which ensure maximum information flow and gradients between layers of the network.
On the contrary, ResNet adds the features from the module input to the output layer. Figure 6 shows the module of DenseNet; a composition layer uses pre-activation on all the previous layers before concatenating with the current layer. The DenseNet has fewer parameters and can learn more complex features.

Efficient Networks
Generally, deeper ConvNets tend to obtain better top 1% accuracy on challenging tasks such as ImageNet detection and classification. However, the trained models are over parame- There always has been a tradeoff between the accuracy and efficiency of model selection for a specific application. Traditionally, models achieved better accuracy by increasing depth of the architecture (using more layers), a width of architecture (via more channels), or increasing the resolution of an input image.  proposed compound scaling to size-up all the three critical parameters (width, depth, and input image resolution) to improve the model performance. The proposed network is called EfficientNets , which is the family of highly scalable and efficient neural network architectures to use compound scaling for the selection of models such as EfficientNet-B0 to B7, keeping in view the resource requirements.
The building block of EfficientNet uses mobile inverted bottleneck MBConv (Sandler et al., 2018) as shown in fig. 9 with squeeze-and-excitation optimization (Hu et al., 2018).

Squeeze Networks
Most of the CNN architectures are resource-hungry to achieve good accuracy on a particular dataset. The smaller architectures with equivalent accuracy require less bandwidth, and easily deployable on the hardware of limited capacity as

Experimental Setup
We used the default settings for each of the networks. The input size of the image is the same, as specified by the authors of the networks. The input batch size is set to 32 with an initial learning rate of 0.0001. The models are finetuned for 500 epochs from the weights of ImageNet (Deng et al., 2009) due to limited number of images available in the datasets. The last layer for classification in all networks are changed to binary, to differentiate between COVID19 and Non-COVID19 radiographs or CT scans. PyTorch is used as a framework for training and testing the algorithms.

Datasets
Since there is no single sizeable dataset available for CXR images of COVID-19 patients, the dataset is curated from multiple sources to have sufficient data for training, testing, and validation.

Metrics
We take into account the following most common five metrics used in detection to evaluate each algorithm discussed earlier in the manuscript.
• Precision: The ratio of correctly predicted positive COVID19 patients to the total positive predictions (i.e. True positives and False positives). This metric gives the ability of an algorithm to determine the rate of false positives. The high the precision is, the low the false positives are.
• Recall: is also known as the sensitivity of the algorithm. It is the ratio of correctly predicted positive outcomes (i.e. True positives) to the actual class observations (i.e. True positives and False negatives).
• F1 Score: takes false positives and false negatives by taking the weighted average of the earlier mentioned metrics. F1 score is useful in cases where class distribution is uneven.
• Accuracy: It is the most used and intuitive measure in classification. Accuracy is defined as the ratio of the correct predictions to the total number of samples. Although high accuracy may be a good measure; however, it may not be the best in certain situations where the class distribution is not symmetric. Hence, we use other metrics to evaluate the performance of algorithms.
• AUC stands for area under the curve and is the second most used metric for classification. It represents the degree of separability.
The aim here is to model the capability of the network in distinguishing between classes. The higher value of the AUC means the model is better in predicting correct values or, in other words, positive as positives and negatives as negatives.

Evaluations
The quantitative results are reported in Table 1 and Table 2 for COVIDCT and COVIDx datasets, respectively.
The accuracy of the COVIDCT dataset varies from 70% to 81%.
Moreover, GoogleNet achieves the highest average recall of 94.29%, and DenseNet169 has the highest precision of 100%. The highest performance for area under the curve is 88.80%, achieved via ResNet101.
The accuracy of the deep learning models on the COVIDx dataset is higher as compared to the COVIDCT dataset ranging from 78.23% to 87.1%. On average, the accuracy is more than 82%. On the other hand, the deep models yield lower results for recall while for similar results for precision on the COVIDx dataset. The highest recall achieved is 47.62% by MNASNet1.0. Similarly, the best precision is by GoogleNet and EfficientNet-b3, which is 83.33%. Similarly, the models are struggling to produce comparable results for the AUC metric, where DenseNet201 gives the top performance, achieving 78.59%.

Attention
Each model focuses on specific aspects of the image to detect an object or an artifact. Here, we present the models on infected and non-Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 June 2020   EfficientNet-b0 h) EfficientNet-b7 and i) SqueezeNet.
infected radiographs. In Figure 12, we present the CT images with feature attention where the red color indicates the region where the models have focused. The first three rows contain COVID19 infections, while the remaining two rows in Figure 12 are infection-free.

Conclusion
In this work, we have tested the capacity of the current state-ofthe-art deep learning algorithms and provide baselines from future research comparisons on two publicly available COVIDCT and COVIDx datasets. We aimed to differentiate between COVID19 infected and non-infected scans and radiographs. We have shown the quantitative results and attention of the models on sample images. We have employed several metrics to give a more comprehensive understanding of network performance. Although the results are promising, the need for a more significant number of images will be helpful for further training and testing.