PCXRNet: Condense attention block and Multiconvolution spatial attention block for Pneumonia Chest X-Ray detection

 Abstract —Pneumonia is one of the leading causes of worldwide child deaths. Many recognition approaches based on convolutional neural networks have been proposed for children's chest X-ray images in recent years. However, few of them make good use of the potential inter-and intra-relationship of feature maps. Considering the challenge mentioned above, this paper proposes an attention-based convolutional neural network, named PCXRNet, for pneumonia chest X-ray diagnosis. To utilize the information on the channels of feature maps, we add a novel condense attention module (CDSE), containing two steps: condense step and squeeze-excitation step. Unlike traditional channel attention modules, CDSE first downsamples the feature map channel by channel to condense the information before the squeeze-excitation step, in which channel weights are calculated. In order to make the model pay more attention to informative spatial parts in every feature map, we propose a multi-convolution spatial attention module (MCSA). It reduces parameters and introducing more nonlinearity. The CDSE and MCSA can work in series to dig up useful information on and between feature maps complementarily. We conduct extensive experiments on ChestXray2017, and compared with existing methods, PCXRNet reaches remarkable performance with the accuracy of 97.436%, recall of 96.838%, precision of 97.705%, and F1-score of 97.241%. We also use the other two pneumonia datasets COVID-19 and Tuberculosis, where PCXRNet achieves state-of-the-art results. A gradient-weighted class activation mapping algorithm is applied to further demonstrate the effectiveness of our

. Examples of Chest X-rays in ChestXRay2017. The normal, viral pneumonia and bacterial pneumonia are on the left, middle and right, respectively.
In 2016, more than 880,000 children died of pneumonia [1], and the number of child deaths caused by it exceeds malaria, AIDS, and yellow fever combined. The incidence of childhood pneumonia shows a significant difference between developed and underdeveloped countries [2], [3]. Childhood pneumonia may cause respiratory failure and easily lead to life-threatening if not treated quickly enough. However, most patients can be cured through early detection and treatment [4], [5].
Chest X-ray images can detect lung lesions, which can help doctors judge the severity of lung infections [6]. Relying solely on radiologists for diagnosis of childhood pneumonia has the following problems. Firstly, the proportion of doctors and patients is seriously imbalanced in low-income countries with severe children's pneumonia. Secondly, the diagnostic accuracy is highly influenced by the radiologists' experience and carefulness. Finally, the increasing number of patients infected brings a heavy burden for radiologists to finish the diagnostic process, and fatigue may impact diagnostic accuracy [7]. Therefore, it is indispensable to use computer-aided diagnosis technology to help radiologists increase the detection speed and relieve the burden.
Classical Machine Learning (ML) methods cannot extract the high-dimensional features of the medical image well and have poor accuracy and generalization ability [8], [9]. Deep learning, particularly with convolutional neural networks (CNNs), has achieved great success in computer vision. Compared to the traditional image processing algorithms, CNNs use relatively few pre-processing methods and can be trained in an end-to-end way. Therefore, CNNs are usually viewed as a general feature extractor with no need for prior knowledge and human intervention. CNNs have achieved remarkable performance on many tasks, such as classification [10], object detection [11] and semantic segmentation [12]. Deep learning is also used to detect lung diseases with the release of medical imaging datasets. Lessmann et al. [13] proposed a tow-step method to detect the lung disease in low-dose chest CT. The first CNN identifies potential lesion locations using dilated convolution and the second CNN identifies true calcifications. Aledhari et al. [14] explored the effects of three basic CNNs, VGG16, Inception v3 and ResNet-50, on pneumonia detection. Fine-tuned VGG16 demonstrates the most test accuracy of 75%.
Although computer-aided diagnosis technology has made some progress in distinguishing childhood pneumonia, it is still challenging to apply deep learning technology to clinical diagnosis due to the complexity of the pulmonary lesions. First, the potential relationship between feature maps is ignored by traditional CNNs algorithms, which do not suppress the useless information in the feature maps. Second, unlike adults, children can receive lower radiation doses, and the treatment compliance is poorer [15], which makes the lesions of the chest X-ray image blurred and deformed. Third, previous CNN-based pneumonia detection methods have poor generalization ability due to the overfitting problems.
Considering the above-mentioned problems, we propose an attention-based convolutional neural network for the rapid detection of childhood pneumonia. For the first problem, this paper introduces a novel channel attention module, named condense attention block (CDSE), to generate more accurate channel weights of feature maps. Compared with traditional channel attention blocks, CDSE downsamples each channel of the input feature map to the appropriate size using depthwise convolution before the squeeze-excitation block. The residual structure is adopted by CDSE to enhance the representation ability of the network. For the second problem, we propose a multi-convolution spatial attention module (MCSA), a lightweight module using three small convolutions instead of a large convolution. MCSA has smaller parameters and more nonlinearity, suppressing overfitting and enhance the ability to learn spatial information from feature maps. We embed the two proposed attention modules into a baseline network for pneumonia chest X-ray recognition, called PCXRNet, which has better generalization performance and can more effectively identify the lesions in chest X-ray images. As is shown in Fig 1, the chest X-ray examples include normal, viral, and bacterial pneumonia. The main contributions of this paper include the following.
(1) A novel channel attention module (CDSE) is proposed to explore appropriate channel weights of input feature maps. CDSE firstly condenses information using depthwise convolution to gradually downsample feature map of baseline network's different stages to a fixed size with dynamic adaptive strategy and then feeds condensed information to squeeze-excitation block. Experimental results show that embedding CDSE into baseline can enhance the generalization ability and bring a significant improvement to pneumonia recognition (2) A lightweight multi-convolution spatial attention module (MCSA) is proposed to local lesion on feature maps while suppresses useless spatial information. CDSE and MCSA are two complementary attention modules. The extensive ablation experimental results show that integrating the two attention modules in a sequential manner makes the network more stable and suppresses the transmission of useless information.
(3) Additional experiments are conducted on a COVID-19 dataset with 18,479 X-ray images and a tuberculosis dataset with 7000 images in order to verify the effectiveness of the proposed PCXRNet. As a result, the network not only performs best in childhood pneumonia but also achieves state-of-the-art results on the COVID-19 dataset and tuberculosis dataset.

II. RELATER WORKS
In this section, we present the related works on pneumonia recognition and attention mechanisms in deep learning.

A. Automatic pneumonia diagnosis in deep learning
Deep learning applications in medical diagnosis continue to be catalyzed and expanded by the increasing release of medical image datasets [16], [17], [18], [19], [20]. Trained with a large amount of data, CNNs can achieve the same accuracy as radiologists but have a higher diagnostic efficiency [21]. Deep learning algorithms can solve the radiologist shortage for areas with insufficient medical resources. Kermany et al. [22] collected and labeled 5856 chest X-ray images, including normal, viral, and bacterial pneumonia. Inception V3 pretrained on the ImageNet dataset is used to recognize pneumonia. Hu et al. [23] proposed the multi-kernel depthwise convolution (MD-Conv), which contains different filter sizes in one convolution layer. Experiments on the pneumonia dataset are conducted to verify the generalization ability. The proposed network achieved an accuracy of 93.4%. Masud et al. [24] identified the presence of pneumonia through analyzing chest X-ray images. The augmentation technique was used to balance the number of pneumonia radiographs. The chest X-ray images resized to 256×256 were used to extract features using a statistical and a deep learning algorithm. These features are combined and sent to the random forest for classification. Previous works have proved that deep learning networks can effectively diagnose childhood pneumonia in chest X-ray images. However, they do not make full use of the information of feature maps and hence cannot be promoted in clinical diagnosis.
Corona Virus Disease 2019 (COVID-19) has rapidly become a global epidemic because SARS-CoV-2 is highly contagious. It is necessary to apply deep learning technology to diagnose COVID-19 from chest radiographs. Wang et al. [25] used 1136 images, including 723 COVID-19 positives, to train the deep learning model. The dataset was collected from five hospitals and they organized several senior physicians to choose the required data. The proposed system first extracted the lung lesion regions using 3D U-Net++, and then the segmentation results were sent into ResNet50 to get classification results. Ghoshal et al. [26] proposed a multi-classification network for diagnosing COVID-19, pneumonia, and lung cancer from a joint dataset including chest x-ray and CT images. The proposed network consists of four architectures: VGG19-CNN, ResNet152V2, ResNet152V2 + GRU, and ResNet152V2 + Bi-GRU. The VGG19-CNN network achieved an accuracy of 98.05%. Wang et al. [27] proposed COVID-Net that was tailored to diagnose COVID-19 from chest X-ray images. They chose the lightweight network design strategy, which reduced the redundancy of the network and accelerated inference. Zhang et al. [28] applied ResNet-18 pretrained on the ImageNet as the backbone network to extract the high-level features, which are fed to classification head and anomaly detection head respectively to detect COVID-19. The proposed method achieved a sensitivity of 96.0%.
Tuberculosis (TB) is one of the primary infectious diseases that seriously threaten human health. There were more than 10 million cases of active TB, resulting in 1.5 million deaths in 2018. Lakhani et al. [30] applied AlexNet and GoogLeNet to classify the chest X-ray images as tuberculosis or normal, demonstrating that pretrained models have better classification efficiency than randomly initialized models. Kim et al. [31] used pretrained ResNet50 to check whether chest radiographs have possible TB. The proposed method reached a sensitivity of 85% and a specificity of 76%. They visualize the results using class activation mapping to show the regions of interest.

B. Attentional mechanism in deep learning
In recent years, the attention mechanism has been widely applied in computer vision tasks and achieved remarkable achievements [33] [34] [35]. Given different channels of feature maps in traditional CNNs contains much redundant information, the attention module assigns lower weights to useless information and higher weights to helpful information, enhancing representation capacity and improving classification performance without increasing the depth of the deep learning network. Hu et al. [36] proposed SENet to extract channel weights of feature maps, which proves weakening the useless information can increase the representation power for a deep learning network. SENet significantly reduced the classification error at little additional computational cost.
Zhang et al. [37] proposed a novel Shuffle Attention (SA) module to decrease the computational overhead of the attention module. They applied both spatial attention and channel attention to the input feature maps. The performance boost of SANet was more than 1.34% in accuracy on ImageNet-1k. The attention mechanism is helpful to capture small lesion information in medical radiographs. Li et al. [38] proposed a two-stage network for diagnosing pneumonia. They used pretrained U-Net to extract the ROI area of lungs in the first stage and then adopted SE-ResNet34 as the backbone with designed 1×1 convolution to fuse features. Xi et al. [39] collected 2186 CT samples from 1588 patients for identifying COVID-19. They proposed a novel online attention module with a 3D CNN. The proposed method reached an accuracy of 87.5% and a sensitivity of 86.9%. Accordingly, it is practical to embed channel attention and spatial attention in deep learning networks for diagnosing pneumonia.

A. Framework of PCXRNet
PCXRNet uses ResNet-34 as the backbone, which has an input stem, four subsequent stages containing several residual blocks, and a classifier. As shown in Fig. 2, in the input stem, the chest X-ray image with a size of 224 × 224 × 3 is first transformed to feature maps F 1 ∈ R 112×112×64 with reduced width and height and increased channel size using convolution operation. Then, to reduce memory footprint and the training time, these feature maps are down sampled to obtain the final stem output F stem_out ∈ R 56×56×64 using max pooling.
Subsequently, the output of stem _ is fed into four stages of backbone network to extract semantic features from low to high. Each stage has the specific number of residual blocks, consisting of two types of connections, residual connection and skip connection. We set _ ∈ H×W×C to represent the output of the residual connection part in every residual block, where H, W and C denote the height, width and number of channels of the feature maps, respectively. Taking _ as the input, CDSE attention module explore the inter-channel relationship of features by learning the channel-wise attention to generate weighted feature maps _ ∈ H×W×C . Specifically, unlike traditional channel attention mechanism, CDSE is designed with two components, i.e. Channel Feature Extract (CFE) block and mSE block, where CFE first applies depthwise convolution operation to condense information channel by channel and then mSE computes each channel weight with global average pooling and 1 × 1 convolution layers to select more useful features and suppress less informative features. Next, given the feature maps _ , multiconvolution spatial attention (MCSA) module indicates the importance of each spatial position by learning inter-spatial relationship of features, multiplying extracted spatial-wise attention weight to the input feature maps and producing feature maps _ ∈ H×W×C . We aggregate the two attention modules, CDSE and MCSA, to complementarily exploit the inter-channel and inter-spatial relationship of features in a sequential order, bringing a significant performance improvement in recognition of chest X-ray images, which is supported by experimental results.

B. Condense channel attention module
(1) Structure: Constructed of several CFE blocks and one mSE block, CDSE module is designed for selecting useful information and suppressed useless information in chest X-ray images more effectively to improve the classification performance of the network. CFE blocks applies depthwise convolution, IEBN [40] and Mish function [41] to condense information from feature maps for every channel. We replace the ReLU activation function in original SE block with Mish function to propose mSE block, considering the better capability of Mish such as transmission of gradients, representation ability and generalization. The mSE block is followed to squeeze and produce excitation channel-wise attention weights Note that we adopt the dynamic adaptive strategy to determine the number of CFE blocks, making the size of the smallest output of CFE blocks 7 × 7 in every residual block for four stages, which is consistent with the smallest output of the last stage without CDSE module. Therefore, 3, 2, 1 CFE blocks are used in the first three stages, and none for the last stage.
Given an intermediate feature map _ from residual connection as input, CFE block product output ∈ × × as follows: where CFE_res and CFE_shortcut represent the residual connection and the skip connection of the CFE block, respectively. is _ . CFE_res uses two depthwise convolution with the kernel size of 3 × 3 to extract features and the stride of the former one is 2. CFE block adopts IEBN [40] to normalize the output of convolution , which can refine the noise produced by traditional BN, and use Mish [41] as activation function for introducing nonlinearity. Compared with ReLU, Mish has the smoother gradient and relieve the dying ReLU problem, so it provides better transmission of gradient when training. The formula of Mish is shown as: With the input _ ∈ 7×7×C obtained by stacking several CFE blocks, mSE block first squeezes feature maps to feature vectors with the size of 1 × 1 × .
where H and W denote the height and width of the feature maps, respectively and ( = 1,2, …, ) represents the i-th channel of the feature map. Then, mSE uses convolution to produces the channel-wise attention weights. Note that we also set reduction ratio to vary the number of channels between two convolution layers and choose the best ratio through experiments. Finally, the output of CDSE module _ is calculated as follows: where GAP denotes global average pooling and conv2 indicates two 1 × 1 convolutional layers. is Sigmoid activation function. ⊗ denotes element-wise multiplication.
(2) Analysis First, we analyze two methods to condense the information of feature maps before squeeze operation. The first is to use the fixed number of CFE blocks. The other is to adopt the proposed dynamic adaptive strategy to determine the number of CFE blocks in each residual block. The second method condense feature maps to a fixed size of 7 × 7 for every residual block, which is then fed into mSE block. Compared with the former, dynamic adaptive strategy reduces extra parameters. For example, the first method adopts the fixed number of 2 CFE blocks in each CDSE module introducing additional 68 convolution layers, yet the second method only adds 46 additional convolution layers, which reduces nearly one in three. In addition, using the adaptive strategy is beneficial for extracting channel-wise weights, for making the size of input for mSE block in different stage the same.
Second, we adopt a new operation combination of Conv-IEBN-Mish to replace Conv-BN-ReLU in the CDSE module, which has a better capability of generalization and improves the classification performance significantly with the small additional computational cost.
Third, we improve the traditional SE module using Mish as the activation function. CDSE module aggregates several stacked CFE blocks and the mSE block to extract global feature maps and study the channel-wise attention weights, which select important information related to the disease and suppress useless information of feature maps.

C. Multi-convolution spatial attention block
(1) Structure: Multi-convolution spatial attention module (MCSA) is designed for extracting spatial information from feature maps, of which the structure is shown in Fig.3. Inspired by CBAM [42], MCSA adopts two-branch spatial attention mechanism, Instead of using 7 × 7 convolution, three conservative 3 × 3 convolutions are stacked to introduce more non-linearity with the same receptive fields but fewer parameters, which can further boost the model accuracy. Also, we use Mish as activation function.
Given input feature map _ ∈ H×W×C , MCSA uses max pooling and average pooling to extract spatial information of feature maps, and then outputs feature maps ∈ H×W×1 , ∈ H×W×1 , respectively. The process can be described as: where ∈ 1, 2, …, , ∈ 1, 2, …, , (1,2, …, ) is i-th channel of input feature maps. Max pooling and average pooling gather spatial information from different aspects. Then, the output before convolution ∈ H×W×2 is obtained as follows: After concatenating average-pooled and max-pooled features, MCSA uses convolution layers to output the spatial mask, which finally converts the input feature map _ to _ ∈ H×W×C by the following formula: where is the sigmoid function and ⨂ denotes element-wise multiplication. Conv and ConvM2 represents two layers of 3 × 3 convolution without and with Mish function, respectively.
(2) Analysis MCSA adopts two-branch structure to learn spatial-wise attention from feature maps. It uses three consecutive convolutions with smaller kernel size of 3 × 3 to replace original 7 × 7 convolution, in order to reduce parameters under the same receptive field. For the use of Mish function, our experimental result shows that it exactly works better than ReLU function. CDSE and MCSA are two different attention modules, where CDSE focuses on extracting channel-wise attention weights, while MCSA mainly extract spatial-wise attention weights to capture disease specific information from feature maps. Aggregating two attention modules in a sequential order brings a significant performance improvement for chest X-ray classification.

IV. EXPERIMENT
In this section, we first introduce the chest X-ray datasets on pneumonia and common evaluation metrics, then present the experimental results on three pneumonia datasets, and interprets the effectiveness of the proposed attention mechanism through visualization.

A. Datasets
ChestXRay2017: This dataset was selected from retrospective study of pediatric patients aged from one to five years old in Guangzhou Women and Children's Medical Center, Guangzhou [22]. This dataset contains a total of 5856 chest X-ray images. The training set contains 3883 pneumonia images (2538 bacterial images and 1345 viral images) and 1349 normal images, and the test set contains 390 pneumonia images (242 bacterial and 148 viral) and 234 normal images. COVID-19 Radiography Database V4: This dataset was created by a team of researchers from Qatar University, the University of Dhaka, etc. [43]. It is currently one of the largest datasets of chest X-ray images of COVID-19. We used 80% of the chest X-ray images for training and 20% for testing to split data. And 20 % of images in the training dataset were utilized as a subset for validation. Table 1 shows the details of the number of training, validation and test used in this paper.  Train  2314  3847  5664  11825  Validation  578  962  1416  2956  Test  724  1203  1771  3698  Total  3616  6012  8851  18479 Tuberculosis Chest X-ray Database: This dataset is composed of four public datasets [44], including NLM dataset [45], Belarus dataset [46], NIAID TB dataset [47] and RSNA CXR dataset [48]. This dataset is one of the largest tuberculosis datasets so far. Our experiment uses 80% of images for training, 20% for testing, and 20% of the training set for validation. Table 2 shows the details of the tuberculosis dataset.  Tuberculosis  Normal  Total  Train  2240  2240  4480  Validation  560  560  1120  Test  700  700  1400  Total  3500 3500 7000

B. Evaluation criteria
In order to comprehensively measure the effectiveness of the proposed network, we employ accuracy, recall, precision and F1-score as evaluation indicators. The calculation formulas are defined as follows: where TP is True Positive, TN is True Negative, FP is False Positive and FN is False Negative.

C. Ablation Experiments on ChestXRay2017
(1) Implementation Details In ablation experiments, the input images of PCXRNet were resized to 224×224 pixels. The initial learning rate was 0.01 and the maximum epoch was 200. At the 15th, 65th and 150th epoch, the learning rate was reduced by factor 10, respectively. The network was trained using the batch size of 16 with SGD optimizer and cross-entropy loss function. Our framework was implemented with PyTorch library and all experiments were conducted on NVIDIA GTX 2080Ti.

(2) The results of ablation experiments
In order to verify the effectiveness of the attention module proposed in this paper, we show the results of ablation experiments on ChestXRay2017 in Table 3. Note that all experiments in this paper have not used any pre-trained models for a fair comparison.
In Table 3, model 1 and model 2 refers to ResNet34 [10] and SE-ResNet34 [36], respectively. Model 3 introduces the CDSE attention module based on ResNet34. The introduction of the proposed channel attention module can improve the accuracy, recall, precision and F1-score by 1.28%, 1.54%, 1.19% and 1.40%, respectively, compared with the baseline.
Model 4 introduces the spatial attention module (SA) of CBAM [42] into the baseline. It first adopts average pooling and max pooling along the channel dimension, concatenates the output, and then generates spatial mask. MCSA module is added into model 5. The introduction of the MCSA can improve the accuracy, recall, precision and F1-score by 0.48%, 0.47%, 0.56% and 0.52%, respectively, compared with model 4. As can be seen, the MCSA module has stronger stability by using 3×3 convolution and introducing more activation layers.
Model 6 introduces the CBAM module into the baseline, which is a lightweight attention module. Model 7 is our proposed method PCXRNet, which integrated the CDSE module and MCSA module in series at the end of each residual block. Results show that the classification performance of PCXRNet is better than CBAM or only using either CDSE module or MCSA module. PCXRNet achieves an overall accuracy of 97.436%, recall of 96.838%, precision of 97.705%, and F1-score of 97.241%.
It can thus be seen that the CDSE module and MCSA module are complementary. With two modules in combination, our network can present a good classification performance even not using pretraining. Training and testing curves in Fig. 4 demonstrate that the accuracy curve of training increases rapidly from epoch 0 to epoch 40 and then it gradually stabilizes until 200 rounds. The confusion matrix of PCXRNet and other networks is presented in Fig. 5 and Fig. 6, respectively. Observed from the confusion matrix, PCXRNet has the highest accuracy whether in normal images or pneumonia images.

D. Ablation Experiments on ChestXRay2017
In this section, we explore the internal structure of CDSE. (1) the number of CFE module There are two ways to reduce the size of the feature map: the first way is to insert the same number of CFE blocks in each residual block; the second is to insert a different number of CFE blocks in each residual block. We designed three experiments to explore a suitable way: the first model named CFE_1 uses one CFE block to downsample feature maps two times in each residual block; the second model named CFE_2 uses two CFE blocks to downsample four times in each residual block; the third model named CFE_stage adopts the dynamic adaptive strategy to determine the number of CFE blocks, making the size of the output of CFE blocks the same as the smallest output of baseline's last stage. Table 4 shows the classification performance of three experiments and ROC curves are shown in Fig. 7(a).
Results show that the model CFE_stage achieves the best classification performance. It has several advantages. First, in the traditional channel attention module, global information was squeezed directly into a channel descriptor regardless of the size of input feature maps. CFE_stage condenses the global information of feature maps before squeezing, which is beneficial for extracting channel-wise weights more accurately. Second, CFE_stage has fewer parameters than CFE_1 and CFE_2, improving the inference speed and relieving the overfitting problem. As shown in Fig. 7(a), CFE_stage also displays the highest AUC. (2) the downsample layer of CFE shortcut In the shortcut connection of the CFE block, the input feature map is downsampled, and the output is then added to the output of the residual connection. There are many downsample ways: 1×1 convolution, average pooling and max pooling. Different downsample layers will affect the information distribution in the feature map. Table 5 shows the classification performance of different downsample layers and ROC curves are shown in Fig. 7(b). It is obvious that using 1 × 1 convolution achieves better accuracy. (3) the downsample location of CFE_res There are two convolution layers in the residual connection of CFE blocks. We conducted a set of comparative experiments to explore where to downsample. As shown in Table 6, downsample in the first convolution layer has a better effect than in the second convolution, improving the accuracy by 0.641%. Also, downsampling in the second layer increases memory usage. ROC curves are shown in Fig. 7(c).  (4) the activation function of CFE module The activation function is essential for CNNs because it can strengthen the learning ability of the network. ReLU is a common activation function, but it suffers from the dying neuron problem, leading to the loss of gradient information, so it is difficult for the CNNs to learn effective lesion information on chest X-rays. Mish is a self-regularized non-monotonic activation function and preserves a few negative values. Table  7 shows the classification performance for different activation functions. The results in Fig. 7(d) demonstrate that Mish performed better than ReLU. (5) the normalization layer of CFE module Normalization layer reduces the dependence on their initial values, which is beneficial for gradient flow. Batch normalization mitigates the internal covariate shift problem and improve generalization capability. IEBN introduces the self-attention mechanism to further improve the generalization, which can refine the noise produced by traditional BN. As shown in Table 8, IEBN achieves superior classification performance. ROC curves are shown in Fig. 7(e).  (6) the choice of r in mSE In this part, we study the important hyperparameter the reduction ratio r in mSE block, which represents the capacity of the channel attention module. Table 9 shows that as r increases from 2 to 32, the network performance enhances then drops. When r is set to 16, the model reaches the highest performance. ROC curves are shown in Fig. 7(f).
The above experiments and discussions finally determined the structure of CDSE. We adopt Conv-IEBN-Mish, a new combination of operations, to replace the traditional Conv-BN-ReLU combination in the CDSE module. This structure generates better gradient flow and extracts channel weights of feature maps more accurately. We adopt a different number of CFE blocks with the dynamic adaptive strategy to condense the information from feature maps before use mSE block to generate channel descriptors.  Fig. 7(g). The introduction of the Mish can improve the accuracy, recall, precision and F1-score by 0.80%, 0.98%, 0.73% and 0.87%, respectively. (2) the location of MCSA module In this part, we mainly discuss the position relationship between MCSA and CDSE. Because MCSA and CDSE focus on extracting effective spatial information and channel information in feature maps, respectively, aggregating two attention modules in a sequential order brings a significant performance improvement for chest X-ray classification. Table  11 shows that the CDSE-first sequential manner can improve the accuracy by 0.641%. This sequential manner can strengthen the learning ability of the network. Results show that CDSE and MCSA are complementary attention modules. ROC curves of different arrangements are shown in Fig. 7(h).

F. Visualization of Network
In order to verify the effectiveness of our proposed method, we visualize the output of CNNs by using Gradient-weighted Class Activation Mapping (Grad-CAM) [49]. As shown in Fig.  8, the first row is five pneumonia chest X-ray images from the ChestXray2017 dataset, and the second and third rows are the visualization results of ResNet34, SE-ResNet34 respectively. The visualization results of SE-ResNet34 are better than of ResNet34. The introduction of the SE attention module can suppress useless information in chest X-ray images more effectively than the network without attention, as shown in CXR0, CXR1 and CXR3 in the second row. The visualization results of PCXRNet are shown in the fourth row, which are better than ResNet34 and SE-ResNet34, indicating that PCXRNet can utilize the lesion information in the chest X-ray images more effectively.

G. Comparisons with Other CNNs on ChestXRay2017
This part compares the classification results of PCXRNet with other state-of-the-art CNNs on ChestXRay2017. Asnaoui et al. [50] preprocess the chest X-ray images with intensity normalization. The pre-trained CNNs are used to recognize pneumonia images. Liang et al. [51] combines residual network and dilated convolution to recognize childhood pneumonia. The proposed method was pre-trained on the ChestX-ray14 dataset to improve classification performance. Chouhan et al. [52] proposed an ensemble model including five CNNs pre-trained on ImageNet to diagnose pneumonia. Our model PCXRNet does not use any pre-trained model in all experiments but still achieves the best performance in accuracy and F1-score of 97.46% and 97.241%, respectively, clearly seen from Table 12. This demonstrates that our proposed method has higher accuracy and stability. Model 9 has the best recall reaching 99.62% and outperforms our proposed PCXRNet by 2.78%, but its precision of 93.28% is much lower than our method. PCXRNet achieves a precision of 97.705% and outperforms model 9 by 4.425%. Model 4 has the best precision reaching 99.01%, but it has recall of 89.44%. PCXRNet achieves a recall of 96.838% and outperforms model 4 by 7.398%. It can be seen from the above analysis that PCXRNet has strong stability and achieves remarkable performance in recognition of pneumonia images, compared with other state-of-the-art CNNs.

H. Results on Other pneumonia Datasets
In this section, we verify the effectiveness of PCXRNet on other pneumonia datasets.
(1) Results on COVID-19 Radiography Dataset Rahman et al. [43] first used the modified U-Net to segment lung area and then applied a classification network to recognize the COVID-19. This experiment did not use other data enhancement. Table 13 shows that PCXRNet achieves the best performance in accuracy, recall, precision and F1-score of 94.132%, 94.403%, 94.682% and 94.682%, respectively. Fig.  9(a) is the confusion matrix of PCXRNet on the COVID-19 dataset. Compared with the deeper ResNet50 and ResNet101, PCXRNet makes better use of the information on feature maps and strengthens the generalization ability. It can be seen from this experiment that PCXRNet not only displays a good classification performance on the childhood pneumonia dataset but also achieves remarkable performance on the COVID-19 dataset. (2) Results on Tuberculosis Chest X-ray Database It is worth note that we do not use any image augmentation in this part. The input images were resized to 224×224 pixels. The initial learning rate was 0.001 and the maximum epoch was 100. The images were trained using the batch size of 16 with SGD optimizer. Fig. 9(b) is the confusion matrix of PCXRNet on the Tuberculosis dataset. Table 14 shows that PCXRNet achieves state-of-the-art results on the Tuberculosis dataset, compared with traditional CNNs.

V. CONCLUSION
In this paper, we proposed a novel deep learning network named PCXRNet, which contains two attention modules: CDSE and MCSA. Unlike traditional channel attention modules, CDSE first condenses the global information channel by channel using grouped depthwise convolution and then extracted channel-wise attention. MCSA uses three consecutive convolutions with the smaller kernel size to replace the original convolution in order to reduce parameters and introduce more nonlinearity. Numerous ablation experiments strongly proved that CDSE and MCSA are two complementary attention modules, where CDSE focuses on extracting channel-wise attention weights to exploit the inter-channel relationship of feature maps, while MCSA mainly extracts spatial-wise attention weights to capture disease-specific information from feature maps. For the qualitative analysis, we visualize the output of PCXRNet by using the Grad-CAM algorithm. Results show that our proposed method can utilize the lesion information in the chest X-ray images more effectively. We conduct extensive experiments on three chest x-ray recognition datasets to verify the effectiveness of our proposed method, which consistently achieves the state-of-the-art result.