A mask-guided attention deep learning model for COVID-19 diagnosis based on an integrated CT scan images database

Abstract The global extent of COVID-19 mutations and the consequent depletion of hospital resources highlighted the necessity of effective computer-assisted medical diagnosis. COVID-19 detection mediated by deep learning models can help diagnose this highly contagious disease and lower infectivity and mortality rates. Computed tomography (CT) is the preferred imaging modality for building automatic COVID-19 screening and diagnosis models. It is well-known that the training set size significantly impacts the performance and generalization of deep learning models. However, accessing a large dataset of CT scan images from an emerging disease like COVID-19 is challenging. Therefore, data efficiency becomes a significant factor in choosing a learning model. To this end, we present a multi-task learning approach, namely, a mask-guided attention (MGA) classifier, to improve the generalization and data efficiency of COVID-19 classification on lung CT scan images. The novelty of this method is compensating for the scarcity of data by employing more supervision with lesion masks, increasing the sensitivity of the model to COVID-19 manifestations, and helping both generalization and classification performance. Our proposed model achieves better overall performance than the single-task (without MGA module) baseline and state-of-the-art models, as measured by various popular metrics.


Introduction
The coronavirus pandemic has struck the world since late 2019, causing a global crisis and countless deaths. As of January 11th, 2021, the World Health Organization has reported more than 308.46 million confirmed cases and 5.49 million deaths due to this virus. Furthermore, COVID-19 mutations have significantly constrained hospitals' capacity resulting in delayed care and increased risks for patients suffering from other critical conditions. COVID-19's global reach has brought together experts from a wide range of fields to combat the disease. One of the ongoing research topics has been to improve the COVID-19 diagnosis. Early diagnosis has two main benefits: (1) lowering the infectivity rate by isolating patients, and (2) reducing the fatality rate through early intervention.
While the reverse transcription-polymerase chain reaction (RT-PCR) test, which falls under the category of nucleic acid amplification tests (NAATs), has become the gold standard for detecting COVID-19, it has drawbacks such as limited sensitivity to the new variants, short supply of testing kits, and lengthy wait time for results (Ai et al., 2020;Tahan et al., 2021;Trivizakis et al., 2020;Xie et al., 2020). Alternatively, lung computed tomography (CT) has proven to be a rapid and relatively accurate method of detecting COVID-19 and severity assessment (Ai et al., 2020;Fang et al., 2020;Trivizakis et al., 2020;Xie et al., 2020). Infected patients' lung CT scans may exhibit distinctive characteristics such as ground-glass opacification, bilateral involvement, and diffuse distributions (Misztal et al., 2020;Trivizakis et al., 2020;Xie et al., 2020). However, interpreting CT scans is a complex task requiring extensive radiology expertise. The number of radiologist experts is limited, and they face a heavy workload during an outbreak, increasing the risk of human errors. Therefore, transferring expert knowledge into intelligent models is valuable in order to improve healthcare accessibility, reducing the medical specialists' workload and their unnecessary exposure to the outbreak.
Deep learning has become one of the most extensively used approaches for building intelligent models, which can learn the underlying representation of images and classify them in a time-efficient manner. Notably, deep learning approaches have been successful for COVID-19 diagnosis in lung CT scans. Zhang et al. (2020) proposed an AI system that can identify COVID-19 markers and lesion properties using an extensive CT database of 3,777 patients. Zhao et al. (2020) used a mix of CT scans, lung, and lesion masks to train a COVID-19 diagnosis model leveraging multi-task and self-supervised learning. Rahimzadeh et al. (2021) presented a fast, accurate, and fully automated method for COVID-19 diagnosis from the patient's chest CT scan images. There have been several other studies on deep learning-based COVID-19 diagnosis (Maftouni et al., 2021;Polsinelli et al., 2020;Shamsi et al., 2021;Yazdani et al., 2020). Most of the work uses a single-task approach and devotes the learning model to only one task. On the other hand, jointly learning multiple related tasks, namely, multi-task learning (MTL), has been shown to overcome over-fitting and improve generalization by implicit data augmentation, attention focusing, and regularization (Ruder, 2017).
Despite the promising learning ability of deep models, the generalization power of the trained network depends on the size, distribution, and quality of the training dataset. Inadequate training datasets can easily lead to over-fitted deep learning models that cannot generalize well on a new dataset. Some COVID-19 datasets have been made publicly available (Afshar et al., 2021;Cohen et al., 2020;He et al., 2020;Jun et al., 2020;MedSeg, 2020;Morozov et al., 2020;Rahimzadeh et al., 2021;Zhao et al., 2020). Zhao et al. (2020) introduced the COVID-CT dataset, which includes 349 COVID-19 CT images from 216 patients and 463 non-COVID-19 (a mix of normal cases and patients with other diseases). Misztal et al. (2020) reported improving classification performance by categorizing negative COVID-19 cases into specific groups and creating the COVID-19 CT Radiograph Image Data Stock dataset with careful data split. Afshar et al. (2021) built an open-sourced dataset named COVID-CT-MD, comprising COVID-19, Normal, and community-acquired pneumonia (CAP) cases. The COVID-CT-MD is accompanied by lobe-level, slice-level, and patient-level labels to aid in developing deep learning methods. Notwithstanding, researchers continue to require more data for deep learning models' training in order to provide better insights and generalization performance. To this end, our COVID-19 lung CT-scan dataset is curated from seven open-source datasets.
Our proposed method applies a deep learning model with an attention module, which is a state-of-the-art technique in machine learning, to improve the performance of COVID-19 detection. For an image input, the attention module infers the attention map, which is a collection of pixel-level weights, to prioritize the image features by the level of importance for the task (Woo et al., 2018). It attempts to mimic human visual perception that focuses on specific locations, objects, and attributes in the scene by filtering out irrelevant information. For example, an expert radiologist knows precisely where to focus in a CT scan to find a particular pathology. So, intuitively, the attention map learns which areas on the image are more relevant to the performed task, such as medical diagnosis. The use of attention modules in deep learning networks originated and proved successful in neural machine translation (Bahdanau et al., 2015;Vaswani et al., 2017). Motivated by this success and its consistency with human perception, visual attention modules were adopted in different computer vision applications such as image captioning (Xu et al., 2015), visual question answering , and image classification (Wang et al., 2017). The Residual Attention Network in Wang et al. (2017) achieved state-of-the-art object recognition performance on several benchmark datasets and showed improved robustness against noisy labels. Later, Woo et al. Woo et al. (2018) proposed a lightweight convolutional block attention module (CBAM) that could be integrated into any convolutional neural network (CNN) architecture to infer and refine attention. They showed that integrating CBAM inside various state-of-the-art CNN models improves the classification and detection performance. Accordingly, CBAM is incorporated into our model for enhanced performance through attention map learning and feature refinement.
To summarize, the objective of this article is to improve the generalization and performance of COVID-19 detection deep learning models. Specifically, the main contributions of our article are as follows: A large and broadly representative lung CT scan dataset for COVID-19 detection is built by curating seven opensource datasets. To the best of our knowledge, this is the largest publicly available COVID-19 CT dataset, accompanied by patient metadata. The dataset includes cases from 13 countries and has three classes: COVID-19, Normal, and CAP. The dataset also consists of COVID-19 frames with corresponding lesion masks merged from three of the datasets. A novel mask-guided attention (MGA) classifier for COVID-19 diagnosis is developed that improves classification performance, data efficiency, and interpretability. Our experimental results demonstrate the proposed method's superior performance over the baseline and improved focus on the COVID-19 lesions.
The remainder of this article is organized as follows. In Section 2, a brief review of related research work on COVID-19 diagnosis, lesion segmentation, MGA methods, and multi-task learning is provided. Next, the proposed research methodology is summarized in Section 3. Section 4 introduces our curated CT scan dataset. Our proposed MGA deep learning model for COVID-19 diagnosis is detailed in Section 5, followed by the experimental results and ablation studies in Section 6. Finally, the conclusions and future directions are discussed in Section 7.

Related work
The related works in deep learning-based COVID-19 diagnosis and lesion segmentation on CT scans is reviewed first in Section 2.1. Next, the multi-task learning related to COVID-19 are introduced in Section 2.2. The research gap is identified in Section 2.3.

COVID-19 diagnosis and lesion segmentation based on CT scans
Deep learning has been the method of choice in most existing works on diagnosing COVID-19 infection from CT scans Polsinelli et al., 2020;Rahimzadeh et al., 2021;Shamsi et al., 2021;Yazdani et al., 2020); owing to the success of deep learning methods in image classification. He et al. (2020) tested seven state-of-the-art deep classification models including VGG-16 (Simonyan & Zisserman, 2015), ResNet18, ResNet-50 (He et al., 2016), DenseNet-121, DenseNet-169 (Huang et al., 2017), EfficientNet-b0, and EfficientNet-b1 (Tan & Le, 2019). They integrated contrastive self-supervision  into the transfer learning process to further improve the performance of deep classification algorithms. In Rahimzadeh et al. (2021), a two-stage system was proposed for detecting COVID-19. The first stage filtered out those CT frames in which the inside of the lung is not properly observable. At the next stage, they applied a new feature pyramid network designed for classification problems using a ResNet-50V2 baseline (He et al., 2016), allowing the model to investigate different resolutions of the image and maintain the data from small objects. Polsinelli et al. (2020) proposed a light Convolutional Neural Network design, based on the SqueezeNet model (Iandola et al., 2016), for the efficient differential diagnosis of COVID-19 CT scans from other community-acquired pneumonia infections and healthy CT scans. Shamsi et al. (2021) proposed a novel transfer learning-based and uncertainty-aware framework for reliable detection of COVID-19 cases from X-ray and CT images. In Yazdani et al. (2020), the attentional convolution network Wang et al. (2017) is proposed to focus on the infected areas of the chest so that the network can provide a more accurate prediction. Goel et al. (2021) proposed an optimized CNN model, named OptCoNet, for the automatic screening of COVID-19 patients based on X-ray images and used Grey Wolf optimizer for CNN hyperparameter optimization. Narin et al. (2021) compared the performance of five pre-trained convolutional neural network models (ResNet50, ResNet101, ResNet152, InceptionV3, and Inception-ResNetV2) for COVID-19 classification on X-ray images and reports ResNet50 to have the best overall performance. In Wang et al. (2020), the human-machine collaborative strategy is applied to design a deep convolutional neural network tailored to detect COVID-19 on chest Xray images with improved sensitivity. They also introduce COVIDx, their large and public benchmark dataset of COVID19 Xray images. Lesion segmentation is another task on CT scan images that is well suited for deep learning (Chaganti et al., 2020;Chassagnon et al., 2021;Gao et al., 2021;Wu et al., 2021;Yao et al., 2021). Generally, this task entails automatically predicting binary lesion masks, assigning the same label to all types of lesions. The problem can be expanded to the semantic segmentation of different types of lesions and within and outside lung regions if a sufficient number of lesion-specific ground truth masks are available. Nonetheless, the binary lesion masks are adequate for assessing the extent of involvement and manifestations of the disease in the lung of a confirmed or suspected COVID-19 patient (Tilborghs et al., 2020). Chaganti et al. (2020) proposed to automatically segment ground-glass opacities (GGO) and areas of consolidation together using a DenseUNet (Ronneberger et al., 2015). Chassagnon et al. (2021) proposed CovidENet: an ensemble of 2 D and 3 D CNNs based on AtlasNet (Vakalopoulou et al., 2018) for binary lesion segmentation and achieved human-level segmentation performance in terms of Dice Score and Hausdorff distance. Yao et al. (2021) proposed the NormNet, a voxel-level anomaly modeling network to recognize normal voxels from possible anomalies. A decision boundary for normal contexts of the NormNet was learned by separating healthy tissues from the diverse synthetic "lesions," which can segment COVID-19 lesions without training on any labeled data. To focus more on the lesion areas, a novel lesion attention module was developed to integrate the intermediate segmentation results.

Multi-task learning (MLT)
In general, MTL is known as a machine learning approach that assimilates information from correlated tasks to improve the generalization capability of the overall learning model (Zhang & Yang, 2021). There are two approaches in multi-task learning: hard parameter sharing and soft parameter sharing of hidden layers Ruder (2017). The hard parameter sharing is commonly found in the literature, in which multiple tasks (networks) share some hidden layers while keeping their separated output layers. On the other hand, soft parameter sharing is achieved when each task has its separate model and respective parameters, but the parameters from different tasks are jointly regularized.
MTL has been adopted for COVID-19 diagnosis improvement. Bao et al. (2022) proposed end-to-end multitask learning to detect and assess the severity of COVID-19 cases with improved performance using only a relatively small dataset of 1329 CT scans. Amyar et al. (2020) developed a multi-task deep learning model with three tasks of classification, segmentation, and reconstruction from chest CT images. Goncharov et al. (2021) deployed a two-task deep learning model to identify COVID-19 cases and quantify the disease severity. Wu et al. (2021) developed a novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID-19 chest CT diagnosis. Gao et al. (2021) developed a dual-branch combination network (DCN) for COVID-19 diagnosis to simultaneously achieve individual-level classification and lesion segmentation. These papers reported an improvement over the single-task benchmark models. Furthermore, multi-task learning has improved the performance of smaller datasets more significantly (Crichton et al., 2017;Gong et al., 2019).
Another form of MTL is MGA models that extend the attention convolutional neural networks. The attention weights that the model assigns to each input element are generally learned without dedicated supervision; therefore, they might also converge to irreverent parts of the image for the task. For example, in classifying lung CT scans, the main focus should be on the inside lung manifestations, and assigning high attention weights to outside lung pixels is useless. Accordingly, recent research adopts extra supervision on attention map training. For instance, Song et al. (2018) designed a contrastive attention model guided by binary masks. It can generate a pair of body-aware and background-aware attention maps, which can produce features of body and background for Person Re-Identification. Pang et al. (2019) introduced a novel MGA network that fits into popular pedestrian detection pipelines. The attention network emphasizes visible pedestrian regions while suppressing the occluded parts by modulating full body features. Wang et al. (2021) proposed an MGA model that provides auxiliary supervision from predicted masks from a pre-trained segmentation model for discriminative and patchy representation learning.

Research gap
The research gaps in the COVID-19 diagnosis approaches listed in Section 2.1 and Section 2.2 are identified as: (1) most proposed COVID-19 diagnosis methods are singletask, which may be more susceptible to over-fitting; (2) Training an accurate COVID-19 diagnosis model requires a large amount of broadly representative sample data, which many of the existing research efforts lack; and (3) Some COVID-19 diagnosis applications using a multi-task approach demonstrated improved performance over the single-task model; however, their models lack explainable choices of diagnosis results. In summary, there is a need for a novel approach to improve the generalization, interpretability, and data efficiency of the deep learning model for COVID-19 diagnosis applications. Therefore, in this work, a multi-task COVID-19 detection model, jointly supervising the attention maps and class labels, is developed to fill the aforementioned research gaps. The multi-task learning approach is implemented through an MGA module integrated inside the COVID-19 classifier to supervise its attention map with segmented lesions. Additionally, one of the strengths of our model is that it is trained and tested on a more broadly representative dataset, which promotes its generalizability.

Proposed research methodology
This work aims to develop a data-efficient deep learning model for COVID-19 diagnosis based on chest CT scan slices, with good generalization and interpretability. The performance of the deep learning model is highly dependent on the training data. However, a comprehensive CT scans dataset for COVID-19 is not publicly available to the researchers in the current literature. To fill this gap, a new CT scans dataset for COVID-19 is created and introduced in Section 4.
A CT scan cross-section or slice is reconstructed from the measurements of attenuation coefficients (intensity reduction) of x-ray beams as it passes through the tissues. Tissues with higher attenuation (such as bones) are bright, whereas tissues with little attenuation (such as air and water) appear dark. Since a normal lung looks dark in the CT scan, the abnormal increase in the attenuation in an inside lung area points at lesions related to different diseases (e.g., COVID-19 or CAP). Radiologists have characterized the key lung lesion patterns, or lesion types, for COVID-19 diagnosis. Our dataset contains the marking of these patterns as follows, 1. ground-glass opacities (hazy gray opacities that do not obscure the underlying vessels), 2. consolidation (areas of increased attenuation that obscure the underlying vessels), and 3. pleural effusion (excess fluid build-up between the lung and chest cavity).
These patterns are revealed through the lesion annotations manually marked by radiologist experts. As depicted in Figure 1, the lesion annotations are employed to derive the binary lesion masks of each image. Namely, black pixels are non-lesion while white ones are lesions. Therefore, in this article, all different COVID-19 lesion types are combined as one type used in the classification analysis for COVID-19 diagnosis.
Our idea is to fully utilize the available domain knowledge through the COVID-19 lesion patterns to improve the deep learning model performance while lowering its data requirement. The overall proposed model architecture, depicted in Figure 2, is a two-step approach as follows.
Step 1 (Section 5.1): A lesion segmentation model based on Hierarchical Multi-scale Attention Network HMSANet) is implemented to automatically create lesion masks for the images that radiologists did not mark with lesion masks since manual marking is costly and time-consuming. The lesion masks are then used in the MGA module to supervise the spatial attention map (created by CBAM in Step 2) for the purpose of assigning higher attention weights to the pixels resembling lesions.
Step 2 (Section 5.2): The deep learning classification model is applied to classify the input CT image, guided with the lesion mask generated in Step 1, and provides the diagnosis result, namely, Normal, COVID-19, or CAP case. Our classification model uses CBAM and MGA modules to enhance the model's focus on lesion locations. Particularly, the spatial attention map created by CBAM is guided toward the lesions through the MGA module during training.
These two steps introduced above are integrated through the hard parameter sharing multi-task learning model (namely, the share of some hidden layers). The first task, accomplished through the MGA module, directly supervises the network's attention map using the lesion masks predicted in Step 1. The second task, implemented in the second step, applies supervision on the class predictions with the ground-truth class labels. This multi-task learning model has the following advantages.
1. First, the increased focus on the lesion regions, which are the COVID-19 manifestations, improves the accuracy of COVID-19 diagnosis and alleviates over-fitting by lowering the effective dimensionality of the data. 2. Second, it lowers the training data requirement. Our experiments show that the proposed model offers fewer training data sample requirements by utilizing additional supervision through the lesion data. 3. Third, the model prediction is more interpretable and reliable when focusing on the lesions instead of the entire image with many irrelevant parts to the illness.

Dataset Creation
CT scans show promise in providing COVID-19 screening and testing accurately and quickly Zhao et al. (2020). We created a large lung CT scan dataset for COVID-19 to aid in developing the diagnosis models. The dataset includes curated data from Afshar et al. (2021) Zhao et al. (2020). Each of the seven datasets is illustrated by an example image in Figure 3. These datasets have been utilized publicly in the COVID-19 diagnosis literature and have proven effective in deep learning applications. As a result, the combined dataset is expected to increase the generalization capacity of deep learning models by learning from all of these resources together.
Our objective is to provide a large dataset of axial chest CT scan slices with three labels, namely, (1) COVID-19, (2) Normal, and (3) CAP, together with their corresponding metadata and lesion masks if available.
Our study integrates seven public datasets of CT images from different sources across multiple countries. In this regard, the datasets are quite heterogeneous in terms of the operational parameters of the generation, resolution, and formatting (e.g., NIfTI, DICOM, TIFF, PNG, and JPG).  (2020). Red, yellow, and green colors indicate ground-glass opacities, consolidation, and pleural effusion lesion types, respectively. (c) Semantic segmentation mask that maps non-lesion pixels to black and assigns each lesion type to a different class (level of gray). (d) Binary (black and white) lesion mask, after mapping all lesion types to the general category of lesions. Black represents non-lesion, and white represents lesion. Some datasets consist of class labeled CT slices (CT scan cross-sections, also referred to as frames or images). In contrast, other datasets include 3 D CT scan volumes (slices stacked on top of each other) with slice-level annotations. Section 4.1 details the steps to preprocess these heterogeneous datasets into our unified dataset of consistent format.
It should be noted that not all the 3 D CT volumes in the dataset were annotated with class labels at the slice level, and we worked with our radiologist to annotate the remaining CT images. To ensure the dataset quality, we excluded the chest slices that do not carry information about inside lung manifestations, as well as the adjacent slices with almost identical appearances. Additionally, we removed images lacking clear class labels or patient information. We have collected 7,593 COVID-19 images from 466 patients, 6,893 normal images from 604 patients, and 2,618 CAP images from 60 patients in total. Our CAP images are all from the dataset (Afshar et al., 2021), in which 25 cases are already annotated. Our radiologist has annotated the remaining 35 CT scan volumes. Table 1 summarizes the number of frames from COVID-19 and normal classes, the availability of specific metadata and masks, and the initial data format of each of the seven datasets. As previously stated, all of the cases have patient ID, necessary for data splitting. As listed in the table, three of the datasets have lesion masks (Jun et al., 2020;MedSeg, 2020); Morozov et al. (2020), providing us with 2,729 COVID-19 lesion masks (36% of the COVID cases) to be used to train the mask segmentation model, explained in Section 5.1. The distinct categories of lesions in MedSeg (2020) are mapped to a binary lesion mask for consistency across datasets.  Figure 4(a) indicates that the cases come from 13 countries, with Iran, Russia, and China ranking first through third. According to Figure 4(b) most of the cases are male, and this male dominance holds for all Normal, COVID-19, and Cap classes. Figure 4(c) compares the age distribution of the three classes and shows that all the age groups are represented in the dataset. The median age of Normal, COVID-19, and CAP classes are 50, 49, and 59, respectively. Figure 4(d) compares the prevalence of distinctive CT characteristics in the 796 COVID-19 cases with CT scan reports, highlighting that ground-glass opacities, bilateral involvements, and consolidation have frequently been reported. And patterns attributed to higher severity, such as diffuse distribution Lei et al. (2021), are also present. These statistics indicate that the dataset population is broad and representative, having cases from various ages, gender, nationality, and severity groups.

Data Preprocessing
Data preprocessing aims to make our CT COVID-19 dataset less heterogeneous since our dataset combines datasets from diverse sources. Specifically, the data preprocessing has the following two steps.
Convert all of the CT volumes data into labeled frames. For the CT volumes, we used their slice-level annotations to extract the label of each frame. All the extracted frames are then converted to 8-bit PNG file format for better uniformity and accessibility for the deep learning  analysis.
A few examples of extracted frames are shown in Figure 3. A series of image transformations have been applied, including background removal, cropping lung area, image normalization, and resizing to 224 pixels by 224 pixels. The background is removed to prevent artifacts and appearance differences between datasets from impacting the COVID-19 diagnosis. The lung area is cropped out since it is our region of interest. Figures 3(a) and 3(b) are CT scans before and after transformation, respectively. Image normalization and resizing are applied to create a uniform style of different images.
This article uses a random resized crop of scale ð0:5, 1Þ, random image rotation with a maximum of 10 degrees, and horizontal flipping with a probability of 0.5 for training set augmentations.

Covid-19 diagnosis using deep learning with MGA model
This section presents the multi-task learning model using MGA in detail, consisting of two steps (see Figure 2).
Step 1: In Section 5.1, a lesion mask prediction model is implemented based on the 2729 COVID-19 available lesion masks and then applied to generate the lesion mask in all the images that were not annotated with lesions.
Step 2: In Section 5.2, a classification model is developed to classify if the input image is Normal, COVID-19 or CAP. Additionally, the significance and method of interpreting the model predictions are introduced in Section 5.3.

Segmentation model for lesion mask prediction
Semantic segmentation is to classify every pixel in the image into one of the classes of interest. The problem in this article is simplified to binary segmentation when the aim is to separate out a single class, namely, lesions. Segmentation may be thought of as a pixel-wise classification that requires object localization and boundary detection at the same time. Localization and boundary detection require different image resolutions and network receptive fields (the extent of an image exposed to a single neuron within the model). Predicting object location is better handled at a scale-down image size because the network's receptive field can observe more of the image context. In contrast, detecting fine edges and thin structures is better handled at a scaled-up image size, leading to a smaller receptive field. Therefore, multi-scale inference is an effective means to address both of these underpinning segmentation requirements. The challenge is how to combine the multiple-scale predictions effectively. The simplest way is to combine the results with averaging or max pooling. A more effective approach is to find the weighted average of the multiple scale-level predictions based on pixel-level weight maps learned within the model. HMSANet  uses the second approach and hierarchically combines the multiple scale predictions using the learned weight map, also called attention map. This model can learn the relative weighting between adjacent scales during training and enables the inclusion of other scales during inference on the test images.
HMSANet is adopted in this study for the lesion segmentation because its multi-scale and high resolution learning facilitates lesion localization and accurate boundary detection, especially as lesions appear in different sizes and shapes. Additionally, our results presented in 6.1 shows that HMSANet outperforms other segmentation methods on the lesion segmentation task.
The way that the HMSANet model is adopted for lesion segmentation is shown in Figure 2 (upper part), which is the first step of the proposed methodology. HMSANet model structure is depicted in Figure 5 in which the lesion mask is inferred using three frame scales. These image scales pass through a network trunk for both scale-level lesion mask and attention map inference. The High-Resolution network Object-Contextual Representations (HRNet-OCR) model with ResNet-101 baseline (Yuan et al., 2020) is the best-performing scalelevel trunk for the HMSANet model showing competitive performance on several semantic segmentation benchmarks. As shown in Figure 5, these scale-level mask predictions are combined to generate the final lesion mask by applying a chain of element-wise multiplication between the attention maps (a n ) and the mask predictions (M n ), followed by element-wise addition among the multiple scales. The chain starts at the lowest scale of the image, namely scale 1 in Figure 5, which captures the most global features, and is further refined for details at the following higher scales in order (Scale 2 and 3). Since lower scales take precedence, they take out their contribution share (0 < a n ði, jÞ < 1), higher (whiter) at the pixels of increased confidence, and pass the remaining attention (1 À a n ði, jÞ) to the following higher scales. Specifically, the final predicted mask (M) is calculated by Equation (1), in which U is bilinear upsampling and D is downsampling.
The HMSANet model is trained using the cross-entropy loss function, batch size of 1 per GPU, image scales of 0.5 (scaled down to half the size) and 1.0, stochastic gradient descent optimizer with the learning rate of 0.01, the momentum of 0.9, and the weight decay of 5e À4 : These segmentation model hyperparameters are determined based on their values in the base paper , achieving a new state-of-the-art performance, and showed the best performance on our validation set. We used four NVIDIA GeForce RTX 2080 Ti GPUs and Pytorch library to train the model. The 2729 COVID-19 frames and their ground truth lesion masks are split into the training, validation, and test sets in sizes of 2329, 200, and 200, respectively. After evaluating the segmentation performance, the trained segmentation model is employed to predict COVID-19 lesions masks on all the images without lesion masks, regardless of their class. Then, all the images are paired with their corresponding masks to be used as the ground-truth of the network's attention map in the MGA module, as laid out in the next section.

Classification Model for COVID-19 diagnosis
The lightweight Residual Network (He et al., 2016) with 18 layers (ResNet18) is selected in this work to serve as the backbone of our COVID-19 classification architecture. The residual networks resolve the vanishing gradient and performance degradation problems of deep networks through skip connections, also known as residual connections. Specifically, Resnet18 is chosen for its lightweight architecture, computational efficiency, and competitive performance in COVID-19 diagnosis (Helwan et al., 2021;Pham, 2020). The ResNet18 architecture is our baseline model but without attention. We have embedded CBAM (Woo et al., 2018) as the attention module in the ResNet18 architecture to  Figure 2). HMSANet infers the lesion mask of the same size as the input image by hierarchically combining predictions at multiple scales, weighted by the hierarchically learned attention weights. Lower scales determine the general lesion location, while higher scales refine its details and edges. enhance the activation of discriminate parts of the input image. For the second step of the proposed methodology, namely, the lower part of Figure 2, the more detailed structure is shown in Figure 6.Our classification model's network structure consists of the following components: 1. a convolutional layer (with 7 Â 7 filter size, 64 filters, and stride of 2) to learn 64 filters, 2. a max pooling layer (with 3 Â 3 filter size, and stride of 2) to reduce the input spatial size, 3. four residual stages (four successive convolutional layers with two residual connections and the same number of filters, distinguished by the color in Figure 6), to allow the information flow between layers while gradually reducing the spatial size and learning more filters. CBAM is embedded only in the first three residual stages to save the computation, 4. an average pooling layer, to spatially down-sample the feature map into a vector, and 5. a fully connected layer at the end for classification.
Each convolutional layer outputs a 3 D tensor called a feature map with (height, width) as the spatial axes and multiple-output channels (C) based on the number of filters. The feature maps of convolutional layers in each residual stage have the same dimension. From one residual stage to the next, the feature maps' height and width are halved (noted by/2 in Figure 6) by convolution stride, and the output channels are doubled (64, 128, 256, and 512, respectively). The attention module's role is to reweight the feature map. Since the feature maps are 3 D tensors, the feature map re-weighting can be performed spatially (by spatial attention module) or on the channels (by channel attention module). The spatial attention module assigns higher weights to more informative parts of the input, while the channel attention module weights the channels based on their relevance and importance by multiplying the channel weights with the feature map. CBAM has a consecutive channel and spatial attention (sub)modules, which is shown to be the best performing combination.
The ResNet18 model with embedded CBAM is our baseline with attention but without direct supervision of attention map learning. In addition to applying attention reweighting, our proposed model uses an MGA module to directly supervise the spatial attention map of one of the three CBAMs by the predicted masks. This extra supervision makes our method multi-task learning because we are jointly optimizing the two tasks of classification and attention to lesions through two distinct loss functions specified in the following paragraphs. Figure 6 shows our classification model's network structure when the MGA module is placed at the third residual stage. The optimal placement of the MGA module has been studied in Section 6.3.
In order to create the spatial attention map, the spatial attention module average-pools and maximum-pools the channel-attended feature map of dimension (H,W,C) to aggregate and squeeze its channel information into two (H,W,1)-dimensional tensors. Then, these two poolings are concatenated in the channel dimension (H,W,2), and transformed into the spatial attention map via a convolutional layer with one channel output, padding of 3, filter size of 7 Â 7 (f 7Â7 ), and a sigmoid activation function (r), as formulated in Equation (2). Therefore, the spatial attention map is a one-channel tensor with the same height and width as its corresponding feature map (H,W,1) in which all the values are between zero and one.
As indicated by Equation (3), the image features extracted at the j th residual stage denoted by f j R HÂWÂC are spatially multiplied by the spatial attention map SA 1 R HÂW to construct the attended features f j att : H, W, and C denote height, weight, and the number of channels, respectively. In the element-wise multiplication of the broadcasted (copied) onechannel spatial attention map with multi-channel features, i signifies the channel index.
We directly supervise one of the spatial attention maps (SA) with the same sized predicted lesion mask (M) from Step 1 (Section 5.1) by minimizing the pixel-wise mean squared error loss function L att : The MGA module is intended to direct the spatial attention map emphasis to the inside lung manifestations and give extra attention to lesions and lung parts that resemble lesions. Since the predicted masks might not completely match the ground truth lesion masks, it is critical that our model performance is not overly sensitive to them. The residual connection right after the CBAM module (see Figure 6) facilitates the flow of unattended features via the skip connections and helps prevent the error propagation from inaccurate masks. The sensitivity of classification performance to the predicted masks has been further studied in Section 6.3.
The classification task is supervised with the crossentropy loss between the predicted class probabilities (ŷ) and one-hot encoded ground truth class labels (y) of the three classes as stated in Equation (5).
Our proposed classifier is therefore applying supervision over two tasks, namely, the attention map using the attention mean squared error loss (L att ) and supervision over the class label predictions using classification cross-entropy loss (L ce ). As represented in Equation (6), we adopted learning with uncertainty loss weighting (Kendall et al., 2018) between the L ce and L att because it has shown superior performance over using fixed weights (Gong et al., 2019). This weighting scheme lets the model adjust the weight of each loss by learning the observation noise parameters r 1 and r 2 alongside the model weights (W). Smaller values of the observation noise parameter will increase the contribution of its associated loss function. These noise parameters are regularized to avoid very large values, which diminishes the contribution of each of the tasks.
LðW, r 1 , r 2 Þ ¼ 1 2r 2 1 L ce ðWÞ þ 1 2r 2 2 L att ðWÞ þ log r 1 þ log r 2 The model is trained using an Adam optimizer with a learning rate of 0.0001, a cosine annealing scheduler, a batch size of 32, and 100 epochs with early stopping with the patience of 10. These hyperparameters are tuned using Bayesian Optimization (Nogueira, 2014). We used four NVIDIA GeForce RTX 2080 Ti GPUs and Pytorch library to train our models. Table 2 specifies one example dataset split between training, validation, and testing. Since the images from a single patient are naturally dependent, all the data splits are made in a patient-aware manner to avoid performance overestimation from the data leak (Shin et al., 2016;Yang et al., 2018). Patient-aware splitting keeps images from each unique patient together in one of the train, validation, or test splits. On the other hand, having multiple slices from each patient in the training set is not problematic because it can have a similar effect as data augmentation. We also applied stratification in our splitting which means that the splits have relatively the same proportion for each of the classes. Patient-aware splitting must be strictly adhered to. Limited by the patient-aware splitting, stratification is performed as much as feasible.

Interpreting the model's prediction
So far, we have introduced our proposed classification model that provides the COVID-19 diagnosis prediction but without interpretability. Achieving highly accurate but uninterpretable decisions makes deep learning models less trustable and has an adverse impact on their clinical applications. Although deep learning has a black box nature, much recent work has investigated the flow of information and input-output connections in deep neural networks to shed light on how it predicts. Such explanation methods help increase trust in the model when it predicts correctly and identifies the failure modes (such as data corruption and learning wrong patterns) when wrong. The gradientbased attribution methods (Shrikumar et al., 2017;Simonyan et al., 2013;Sundararajan et al., 2017;Zeiler & Fergus, 2014) provide input-specific explanations of the deep learning predictions by assigning an attribution value to each input feature. Each gradient-based attribution method has a slightly different formulation for identifying the contribution of each feature to the model's output through backpropagating the output prediction and decomposing it on the input image. The result is an attribution map, an image with the same size as the input containing the pixel level contribution scores.
Attribution maps are often shown as heatmaps, representing the attribution map with colors. For instance, red indicates features that contribute positively to the activation of the target output; the blue color distinguishes features that have a suppressing effect on it; and the white color indicates the insignificance for the derived output. In this work, we use two prominent attribution methods called Integrated Gradient (Sundararajan et al., 2017) and DeepLIFT (Shrikumar et al., 2017) methods to highlight disease features in the CT images. The Integrated Gradient method calculates the integral of gradients of each feature along the path from a baseline (such as a black image) to input, while DeepLIFT is its faster approximation.

Results and discussion
This section presents the performance of the lesion mask segmentation method (Section 6.1) and the proposed classification model with attention (Section 6.2). Additionally, Section 6.3 covers the ablation studies to determine the placement of the MGA module, the effectiveness of MGA classification with different training set sizes, and sensitivity of the classification performance to the predicted masks. Next, Section 6.4 presents the interpretability of the decisions of our deep learning model. Finally, Section 6.5 discusses our work from the physician's perspective.

Segmentation performance
The HMSANet architecture, presented in Section 5.1, is employed as the mask prediction method because of its state-of-the-art segmentation performance. We compared the HMSANet's performance with UNet (Ronneberger et al., 2015), SegNet (Badrinarayanan et al., 2017), and DeepLabV3 (Chen et al., 2017) architectures, which are among the most widely used segmentation methods in the literature. As reported in Table 3, HMSANet achieves the highest intersection over union (IOU), Dice coefficient, precision, and recall on the test set (consisting of 200 COVID-19 frames). The segmentation models' complexity is assessed by their number of trainable parameters and floating-point operations (FLOPs). These columns point to the tradeoff between the model's computational cost and accuracy. Figure 7 provides the qualitative comparison of the predicted masks on five sample test images. According to this figure, HMSANet predicted masks most closely resemble the ground truth masks and are our best choice for the lesion prediction. Table 4 compares our proposed classification model (Section 5.2) with two baseline models without and with CBAM attention modules (ResNet18 and ResNet18 þ CBAM), and four state-of-the-art benchmark models (ResNet50, MobileNetV2, VGG16, and DenseNet121). For the sake of consistency, all the models are trained from scratch. The performance metrics are the average over three random patient-aware stratified test splits (one of which is reported in Table 2). The four benchmark architectures are among the most used deep learning architectures, showing the best performance on our dataset. ResNet50, VGG16, and DenseNet121 are also reported to achieve high COVID-19 diagnosis accuracy from CT scan by He et al. (2020). MobileNetV2 by Sandler et al. (2018) is included in the comparison for its competitive speed-accuracy tradeoff, which is useful for mobile applications. For our medical application, achieving the highest diagnosis accuracy and F1 score takes precedence over training speed. According to Table 4, our proposed approach achieves the highest accuracy, F1 score and recall at a reasonable speed. Particularly, the recall, also known as sensitivity, has shown significant improvement since the better focus on the lesions has boosted the detection of COVID-19 cases. Regarding the average recall, our model is the best and outperforms the second and third best-performing methods by 2.06% and 3.34%, respectively. The ROC curve measures the true-positive rate (sensitivity) and false-positive rate (1 specificity) tradeoff, and its area under the curve (ROC AUC, also referred to as AUC) has meaningful interpretation for disease classification and is extensively used in medical diagnosis. F1 score, which is the harmonic mean of recall and precision, is another reported metric. Our model's enhanced AUC and F1 score metrics indicate that the increased sensitivity did not come at the expense of more false positives. The proposed multi-task learning improves generalization by leveraging the domain-specific knowledge contained in the training data and makes it capable of learning a more meaningful representation. Table 4 also reports the measured minibatch training time using four NVIDIA GeForce RTX 2080 Ti GPUs, the number of parameters, and FLOPs of a single forward pass. While the time column compares the speed of the models during training, FLOPs measures the computational overhead at the inference time. According to this Table, the proposed model notably improves the classification performance while keeping the number of parameters and FLOPs at the same level as the simple ResNet18 model. Moreover, despite training two tasks, our approach is 35% quicker than the DenseNet121 model, which is less memory efficient due to the dense concatenation operations.

Ablation studies
Since the attention supervision can be applied inside any residual stage (Figure 6), the first ablation study determines the best placement of the MGA module.
The results in Table 5 indicate that the best performance is achieved by placing the MGA module at the third residual stage. One possible explanation from the perspective of multi-task learning is that increasing the number of shared hidden layers between the highly related tasks helps the performance (Caruana, 1997) and representation learning. Since the number of parameters and forward pass FLOPs is not impacted by the MGA module placements, all the reported models in Table 5 have the same values for these columns as the ones in the proposed row of Table 4. However, the backward pass FLOP slightly increases as the MGA module moves to deeper residual stages. Consequently, the training time of the third residual stage model has slightly increased. Additionally, comparing the results in Table 5 with the other models' performance in Table 4 shows that using the MGA module in any location improves the overall classification performance. The second experiment investigates the effectiveness of MGA classification for different training set sizes. We simplified the model and ran this experiment on a ResNet18 base model with only one embedded CBAM at the first residual stage to save the computation. Specifically, the single-task and multi-task models are identical, except that the spatial attention map of the CBAM is supervised with the predicted lesion masks for the multi-task learning case. Figures 8(a) and 8(b) show the test classification performance comparison, and (c) is the IOU between the lesion masks and their binarized attention maps of the single and multi-task classification for different train set sizes. While the test set is separated and fixed, the remaining data is split between the train and validation according to the train data size. It is worth noting that the percentages are not exact since the data should be divided in a patient-aware and stratified manner. Consistent with the literature Crichton et al. (2017) and Gong et al. (2019), the results in subfigures (a-b) show that multi-task learning improves performance, especially when the training data are small and sufficient for the learning to happen. We can see that from 20% to 60% there is the most improvement. 10% is too small for learning, and for the large train sets, there is less difference in performance, and yet, the generalization and interpretability advantages remain. In other words, the proposed multi-task learning stands out in the model performance when the train set size is sufficient but relatively small. Moreover, a 70-30 data split between the train and validation has given the best performance; therefore, it is the ratio we used for comparing all the models. Figure 8(c) employs intersection over union as a measure to quantify and compare the focus on the lesions. Each point is calculated by averaging the IOU results of test images with non-zero masks. We can see that the IoU of the proposed method is significantly better than the baseline. Additionally, increasing the train data has improved the focus of both learners. The same patterns can be observed from the attention map visualizations in Figure 9. This figure corroborates that, as the train data increases, the attention map of both the single-task learner and the multi-task learner converges to the lesions. However, the latter one starts to converge using only 30% of the data, while the improved focus emerges  in the former after using 80% of the data. Therefore, attention supervision can help the fast convergence of the attention map with a smaller required train data size. Our last experiment interrogates the impact of the error propagation from the segmentation step into the downstream classification task. We test this impact by applying the following transformations to the predicted lesion masks: Erosion: shrinking the lesion mask by removing pixels on the boundaries (of size 9) Dilation: expanding the lesion mask by adding pixels to the lesion boundaries (of size 9) Shifting: displacing the lesion mask (by 9 pixels downwards) Our overall results reported in Table 6 show that the model is not sensitive to the exact boundaries of the lesions, and the mask guidance results in performance gain using masks that point to the overall location of the mask. According to Table 6, the performance gain is higher for the dilated predicted masks. On the other hand, a comparison between the results in Table 4 and Table 6 indicates that high levels of mask erosion result in only slightly better performance and mask shifting in the same level of performance as the single-task model. This robustness to mask changes is attributed to the residual connections right after the CBAM module that facilitates the flow of unattended features (features before attention weights are applied). These skip connections may prevent the error propagation from inaccurate masks. Also, using uncertainty loss weighting between the classification cross-entropy and attention mean squared error losses equips the network with enough flexibility to adjust the weights of each loss function and gloss over the attention loss if it is not consolidating the classification task. Therefore, using an imperfect mask does not result in performance below that of the single task learning, and the only downside to an unhelpful mask is delaying the convergence.
6.4. Interpretability using attribution maps Figure 10 compares the attribution maps of our model with the other models for five COVID-19 frames, using 10(a) DeepLIFT (Rescale) (Shrikumar et al., 2017) and 10(b) Integrated Gradient attribution Sundararajan et al. (2017) methods. We can see that the red and blue regions (pointing to influential features) of our model's attribution map highly overlap with the lesion regions (represented with red color in the lesion mask). In other words, the lesion regions highly contribute to our model decision while other models are less focused on the lesions. This visualization further emphasizes the effectiveness of our multi-task learning approach on improving the model's attention to the relevant regions. This is because, compared to the single-task (classification), the two integrated tasks (namely, attention supervision and classification) can provide evidence for the relevance or irrelevance of specific features.
Moreover, DeepLIFT (Rescale) and Integrated Gradients have generated highly correlated attribution maps, consistent with the past works, while DeepLIFT is considerably faster to execute. Current attribution methods do not explain how the network combines the features to produce the answer Figure 9. The Single-task vs. Multi-task attention maps of two example frames when different percentages of training data is used. The color changes from blue to red as the pixel's attention weight increases. While attention maps of both methods converge to highly score the lesions, the supervised attention maps in the multi-task classification converge using considerably less training data (20%). The unsupervised attention map takes a lot more data, 80% in our case, to focus on the lesions. and scores them independently, but DeepLIFT (RevealCancel) method takes dependencies into account. For future exploration, it would be interesting to derive and compare the DeepLIFT (RevealCancel) attribution maps, which claimed to outperform the two other techniques when Pytorch support is available.  guiding its management (Pan et al., 2020). The understanding of COVID-19-related abnormalities in CT images has evolved since the onset of the pandemic. Employing intelligent systems that can accumulate and share knowledge about emerging diseases like COVID-19 across the globe may expedite understanding of the disease and facilitate its diagnosis and management. The current work showcases the possibility of accumulating knowledge about CT scan findings into intelligent machines and using them to make interpretable diagnoses by focusing on the abnormalities.
Even though CT scans can differentiate between most cases of CAP, COVID-19, and Normal, differential diagnosis of a broader range of disease classes necessitates the inclusion of clinical and paraclinical examination results (Parekh et al., 2020). Deep learning models can distinguish between many class labels and learn from various data formats (e.g., images, text, and tabular data) if an adequate dataset is available. Therefore, building a comprehensive and integrated database of patients' information (CT scans, clinical and paraclinical results, etc.) is a requirement for creating more intelligent and practical systems that can address the following challenges: Cases with non-typical CT findings: when the signs in the CT scan are non-typical or nonspecific, accompanying clinical and paraclinical symptoms are required.
Cases with multiple medical conditions: Usually, the high-risk patients simultaneously present various medical conditions (e.g., diabetes, cardiovascular disorders, immunosuppressive therapy, etc.). Therefore, such cases require identifying more than one complication.

Conclusion and future direction
This article presented the MGA-based classification model, a novel multi-task learner for COVID-19 diagnosis based on CT scan images. Specifically, the proposed model leveraged the predicted lesion masks to impose extra supervision on the classifier's attention module. Since attention supervision and classification are consolidatory tasks, their multi-task learning yielded a significant performance improvement over the single-task baseline (i.e., the baseline model without MGA module) and the state-of-the-art deep learning methods in image classification. Our experiments also showed that the proposed method benefits from improved data efficiency and interpretability, which are especially valuable in the medical domain in which data may be often limited, and reliability is paramount. Additionally, in this work, a large, nationally diverse, and broadly representative COVID-19 CT slice classification dataset has been curated for conducting experiments and serving as a benchmark dataset for the research community. The quality of our dataset is ensured using slices with patient identification and precise labels.
This research could be extended to include an MGA module that segments both the lungs and the lesions to improve the overall inside lung learning, especially for normal cases. Additionally, as only two groups of COVID-19 and non-COVID-19 are examined in most of the literature, the effect of having more precisely categorized disease classes on COVID-19 detection could be further investigated.