key: cord-0448582-y2ymujva authors: Shi, Jun; Yi, Huite; Ruan, Shulan; Wang, Zhaohui; Hao, Xiaoyu; An, Hong; Wei, Wei title: DARNet: Dual-Attention Residual Network for Automatic Diagnosis of COVID-19 via CT Images date: 2021-05-14 journal: nan DOI: nan sha: 11f4fdc114391e22016687cffb5315d24a417a71 doc_id: 448582 cord_uid: y2ymujva The ongoing global pandemic of Coronavirus Disease 2019 (COVID-19) poses a serious threat to public health and the economy. Rapid and accurate diagnosis of COVID-19 is crucial to prevent the further spread of the disease and reduce its mortality. Chest Computed tomography (CT) is an effective tool for the early diagnosis of lung diseases including pneumonia. However, detecting COVID-19 from CT is demanding and prone to human errors as some early-stage patients may have negative findings on images. Recently, many deep learning methods have achieved impressive performance in this regard. Despite their effectiveness, most of these methods underestimate the rich spatial information preserved in the 3D structure or suffer from the propagation of errors. To address this problem, we propose a Dual-Attention Residual Network (DARNet) to automatically identify COVID-19 from other common pneumonia (CP) and healthy people using 3D chest CT images. Specifically, we design a dual-attention module consisting of channel-wise attention and depth-wise attention mechanisms. The former is utilized to enhance channel independence, while the latter is developed to recalibrate the depth-level features. Then, we integrate them in a unified manner to extract and refine the features at different levels to further improve the diagnostic performance. We evaluate DARNet on a large public CT dataset and obtain superior performance. Besides, the ablation study and visualization analysis prove the effectiveness and interpretability of the proposed method. The Coronavirus Disease 2019 , caused by the severe acute respiratory symptom coronavirus 2 (SARS-CoV-2), is spreading rapidly across the world through extensive person-to-person transmission [1] . The World Health Organization (WHO) officially declared the COVID-19 a pandemic on 11 March 2020. As of 23 August 2021, the COVID-19 has infected more than 211 million people in more than 192 countries and territories and caused more than 4.43 million deaths [2] . Due to the high infectivity and fatality rate, the COVID-19 pandemic has had a devastating impact on public health and the economy. It is of great importance to conduct early diagnosis of COVID-19, for preventing the further spread of the disease and delivering proper treatment regimen. The real-time reverse transcription-polymerase chain reaction (RT-PCR) test is the golden standard for the diagnosis of COVID-The work is supported by the National Key Research and Development Program of China(GrantsNo. 2017YFB0202002). 19 infection [3] . However, the high false-negative rate [1] of RT-PCR may delay the diagnosis of potential cases. As a complementary strategy, Chest X-ray and Computed Tomography (CT) are widely used in the early diagnosis of patients suspected of SARS-CoV-2 infection [4] . Compared with X-ray images, chest CT scans have higher sensitivity in diagnosing COVID-19 infection, and can provide more detailed information about the lesion, which is helpful for quantitative analysis [5] . Early investigations have observed typical radiographic features on chest CT images such as ground-glass opacities (GGO), multifocal patchy consolidation, and vascular dilation in the lesions [6] - [9] . However, detecting COVID-19 from CT images is demanding and prone to human errors as some early-stage patients may have normal imaging features. Besides, the similar imaging findings between COVID-19 cases and common pneumonia (CP) cases on the image make it difficult to differentiate. Recently, many deep learning methods have been applied to the automatic diagnosis of COVID-19 using chest CT images and achieved impressive performance. Some keyframe-based methods [8] , [10] use local abnormal slices rather than 3D images to make diagnostic decisions, while [11] - [13] focus on segmenting the lesion area and then extract specific features for diagnosis. Despite their effectiveness, most of these methods provide a multi-phase framework, which means that the errors in upstream tasks will propagate backwards. For instance, the keyframe-based methods highly rely on the accurate classification of abnormal slices, otherwise incorrect results will negatively affect subsequent tasks. Furthermore, these methods usually have high requirements for annotation data because of the additional upstream tasks. Based on traditional 2D neural networks, other methods [14] , [15] make efforts on extending them to classify 3D CT images and obtain promising results. However, the simple network transformation has limitations in taking full advantage of the 3D properties of CT images, resulting in the diagnostic performance that may not meet actual clinical needs. To this end, in this paper, we propose a dual-attention residual network (DARNet), to automatically diagnose COVID-19 from CP and healthy people using CT images. In DARNet, the 3D variant of ResNet-18 [16] is used as the backbone network, which takes a full 3D chest CT image as input. To fully leverage the 3D spatial information, we design a dual-attention module to extract and refine the representation features at different levels. The module mainly consists of two parts: 1) channel-wise attention and 2) depth-wise attention. The former is first proposed in [17] , and we implement its 3D extension. In this study, we develop the latter, which can adaptively assign depth-level weights to each feature map during the training. We evaluate our method on the largest public CT image dataset, to the best of our knowledge. The experimental results show that DARNet is superior to existing methods. We further provide ablation studies and prove the effectiveness of the proposed dual-attention module in improving the classification accuracy and the interpretability of the model. As a summary, our work has three major contributions as follows: • We propose DARNet to realize automatic and accurate diagnosis of COVID-19 using 3D chest CT images. In addition to superior classification performance, our method is more sensitive to the location of the lesion regions in visual attention. • To make full use of 3D spatial information of CT images, we design a dual-attention module, which can refine the learned features at different levels. The experimental results prove the effectiveness of this module in improving the classification performance and the interpretability. Recently, the successful application of artificial intelligence (AI) in medical image analysis [18] has promoted the development of radiological diagnosis technology. To combat the current pandemic, plenty of research efforts had been carried out over the past few months to design an AI system for the early diagnosis of COVID-19 via radiological imaging. [19] - [21] employed convolutional neural networks (CNNs) to automatically identify COVID-19 infection from chest Xray images and obtained impressive results. However, these methods are still limited due to the low contrast and the lack of significant features caused by the high overlapping of ribs and soft tissues. Compared with a single X-ray image, a chest CT scan composed of hundreds of 2D slices can reflect more detailed radiographic features about the lesions, such as GGO and consolidation. To simplify the computation, several keyframebased methods [8] , [10] were proposed to diagnose COVID-19 in CT images and achieved promising results. But these methods underestimated the 3D spatial information of CT images and highly relied on the accurate detection of abnormal slices. [11] - [13] proposed the segmentation-based approaches that can generate more specific lesion information, such as the number and volume of lesions, which was valuable for the quantitative analysis in COVID-19 diagnosis. However, obtaining large amounts of CT data with segmentation labels is the primary challenge of these methods. Besides, most of the above methods provide a multi-stage framework, which means that these methods may be affected by error propagation. [14] , [15] directly transfer 2D neural networks to classify 3D CT images, but their performance may not meet actual clinical needs. We thus develop DARNet to diagnose COVID-19 in an end-to-end fashion, which takes a complete chest CT image as input and can achieve competitive classification performance. Attention mechanism is an effective way to improve network performance by enhancing the learned features. Hu et al. [17] proposed the channel-wise attention (CA) to refine the hidden features in the channel level during training, which can make the network more focused on the important regions. In other words, the CA module amplifies the difference between channel features by highlighting the features with a greater response, and suppressing the others. The most important is that this adjustment mechanism is completely dynamic and learnable. The effectiveness of the CA module has been proved in many applications [22] - [24] . At the same time, there have been many variations and extensions. For example, [25] , [26] proposed a joint attention module based on the CA module, which brings a significant improvement in segmentation performance. These studies show that multi-attention fusion has great potential in improving network performance. Inspired by this, we design a novel attention mechanism called depthwise attention (DA) to recalibrate the depth-level features. By combining this module with the CA module, we construct a dual-attention module to improve the representation ability of the 3D neural networks. As shown in Fig. 1(a) , the overall architecture of DARNet mainly consists of three submodules: 1) input module, 2) dual-attention module, and 3) output module. Considering the computation complexity and GPU memory capacity, we use the 3D ResNet-18 [16] as the backbone network. Specifically, the input module is composed of a 3D convolutional layer (Conv3D) with a kernel size of (3, 7, 7) and a stride of (1, 2, 2), a batch normalization layer (BN), and a ReLU activation layer. Besides, unlike naive ResNet-18, we remove the max-pooling layer. In this way, the input 3D CT image is downsampled by a factor of 8 in the depth dimension and a factor of 16 in the other two dimensions. The higher-resolution feature maps retain more contextual information, which is also conducive to visual analysis. In the feature extraction part, a total of 8 dual-attention modules with residual connections constitute the main structure. Each dual-attention module consists of two consecutive convolutional layers with a kernel size of (3, 3, 3), followed by BN, ReLU, and two attention mechanisms: 1) channel-wise attention and 2) depth-wise attention. More detailed information about this module is introduced in the next subsection. For the output module, the global average pooling layer (GAP) is first used to squeeze the input features. Then a followed fully connected layer with a softmax layer generates corresponding prediction probabilities. Finally, the network returns the predicted category based on the probabilities. A complete CT image is usually composed of hundreds of 2D slices stacked in sequence. These slices have high spatial continuity and content relevance, constituting the complete contextual information of the lungs. Moreover, we observe that the lesions of various sizes appear randomly in the lungs, resulting in only a portion of the slices containing visible disease characterizations. The spatial correlations of different dimensions and the inter-slice information will be entangled by a 3D convolution operator when using 3D CNN to directly classify CT images. To refine the hidden features, Hu et al. [17] proposed the channel-wise attention module to enhance channel independence and thereby improve the performance of the networks. But this module has limitations in our task, due to the sparse distribution of lesion features at the depth level. Motivated by this observation, we design a complementary mechanism called depth-wise attention module for 3D CNN to recalibrate the depth-level features, which can make the network more sensitive to the important regions of the images. By integrating DA and CA modules, we construct the dualattention module used in DARNet. 1) Channel-wise Attention Module: We implement the 3D version of CA module based on the origin idea in [17] , as shown in Fig 1(b) . Firstly, the input features are squeezed by a GAP layer. Considering the input feature map F in ∈ R C×D×H×W and F in = [f 1 , f 2 , ..., f C ], where C, D, H, and W are the input channels, depth, height, and width, respectively, and f i ∈ R D×H×W . The output of the GAP represented by Z ∈ R C×1×1×1 with its element Above operation embeds the global spatial information in vector Z. This vector is transformed to the weight vector Z = σ(W 2 (ξ(W 1 Z))), with W 1 ∈ R C r ×C , W 2 ∈ R C× C r being the weights of two fully-connected layers, the ReLU function ξ(·) and the sigmoid function σ(·). The parameter r refers to the reduction ratio and is set to 16 in this study. The recalibrated output vector is ( Each element inẐ indicates the importance of the corresponding channel and is used to dynamically amplify or suppress the input response. In this way, the CA module can enhance the important features and ignore the irrelevant ones. However, it is limited to directly extend this module in 3D neural networks to classify CT images. Due to the sparse distribution of lesions, the information between slices varies greatly. The performance improvement achieved by differentiating channel-level features alone is not very significant. Therefore, we design the DA module to make up for this defect. 2) Depth-wise Attention Module: For DA module, as same as CA module, the spatial information is aggregated first along the depth axis by GAP layer, as shown in Fig. 1(c) . Considering the input feature map U in ∈ R C×D×H×W and U in = [u 1,1 , u 1,2 , ..., u i,j , ..., u C,D ], and u i,j ∈ R H×W . The output of the GAP represented by T ∈ R C×D×1×1 with its element Then, a gating mechanism is designed to the learn non-linear and non-mutually-exclusive relationships in the depth dimension. The gating mechanism is parameterized by two fullyconnected layers and two non-linearity activation functions. The output isT = σ(W 2 (ξ(W 1 T))), with W 1 ∈ R CD r ×CD , W 2 ∈ R CD× CD r being the weights of two fully-connected layers. The parameter r here is equal to the number of input channels. Finally, the resultant tensor is used to refine U in to U out = [t 1,1 u 1,1 ,t 1,2 u 1,2 , ...,t i,j u i,j , ...,t C,D u C,D ]. (4) The DA module recalibrates the depth-level features by adaptively assigning weights, which can make the network more focused on the important regions distributed sparsely along the depth dimension. This module makes up for the deficiency of the CA module. Then, we develop the dualattention module of DARNet based on the serial combination of the two, which can refine the learned features at different levels. We conduct experiments on a public dataset provided by the China Consortium of Chest CT Image Investigation (CC-CCII 1 ) [11] to evaluate our method. In this section, the 1 http://ncov-ai.big.ac.cn/download?lang=en construction of the dataset used and implementation details are described first. Then, we compare different networks in terms of the diagnostic performance, and perform ablation studies to validate the effectiveness of the proposed dualattention module in improving the performance. Finally, class activation mapping (CAM) [27] is employed to visualize the discriminative regions of these networks in diagnosing COVID-19, which can help to explore the interpretability of different methods. In this paper, we evaluate our proposed method on a large publicly available CT dataset provided by CC-CCII. The CT dataset contains a total of 4,178 chest CT images from 2,742 patients, including 1,544 CT images from 929 COVID-19 patients, 1,556 CT images from 964 CP patients, and 1,078 CT images from 849 healthy controls. As shown in Table I , we separate the dataset into two parts. The first part (Training set) is used for training, which includes 1,245 COVID-19 images, 1,137 CP images, and 856 images of healthy controls. The second part (Test set) serves for independent testing, including 299 COVID-19 images, 419 CP image, and 222 images of healthy controls. In particular, the split is done on patient level, which means the images of same subject are kept in the same set of training or testing. In the training stage, the training set is randomly divided into five folds on patient level for cross-validation. For evaluating, we use five different classification metrics, including the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and F1-score, to evaluate the performance of different networks. The mathematical expressions of accuracy, sensitivity, and specificity are shown below. Specif icity = T N T N + F P . True positive, true negative, false positive, and false negative are denoted by TP, TN, FP, and FN respectively. Pytorch is adopted to implement our proposed method. For training the networks, we use Adam optimizer [29] to minimize the cross-entropy loss with an initial learning rate of 10 −3 . The convolutional layer weights are initialized by the Kaiming Normalization [30] and the biases are set to 0. Besides, we apply the multi-step decay strategy to control the change of the learning rate during training. The learning rate is reduced every 30 epochs with a decay factor of 0.1. All the models are trained from scratch using 2 NVIDIA Tesla P40 graphic processing units. Given the limitation of GPU memory, the batch size is set to 8 and the size of all images is fixed to 64 × 224 × 224 by under-sampling or up-sampling. In each fold, the model is evaluated on the validation set at the end of each training epoch, and finally the best model within 80 epochs is evaluated on the independent test set. To alleviate the overfitting problem, we conduct online data augmentation including random flipping, rotation, translation, and scaling. The codes used in the experiments is available 2 . We compare the performance of DARNet with four existing methods. For a fair comparison, the test sets used by these methods are also from the same CT dataset provided by CC-CCII, and we directly quote the results reported in related papers. As shown in Table II , we can see that DARNet achieves the best performance on four indicators with sensitivity of 96.86%, specificity of 97.19%, F1-score of 95.49%, and AUC of 0.995. As for the accuracy, the performance of DARNet is a little bit lower than that of [15] . In particular, [15] proposed an ensemble learning method using multiple classifiers to make the diagnostic decision. Although this method has high accuracy, it is also demanding on the classifier design and integration strategy. [14] provides a benchmark for COVID-19 detection using deep learning models. The benchmark tests multiple models and we select the best performing one for comparison. According to its results, we observe that it is limited to directly transfer 2D neural networks to classify 3D CT images. The main reason is that this method ignores the rich spatial information preserved 2 https://github.com/shijun18/COVID-19 CLS in the 3D structure. Moreover, [11] and [28] are segmentationbased methods, which highly rely on accurate segmentation of the lesions. However, these multi-stage frameworks often suffer from error propagation. For example, the incorrect segmentation results can directly make a negative impact on subsequent tasks. In contrast, DARNet is an end-to-end model that can avoid this problem. Besides, the proposed dualattention module can effectively improve the feature extraction ability of the model, which helps to obtain higher classification performance than the naive CNN-based methods. The results in Table II prove the superiority of DARNet in identifying COVID-19 from CP and healthy people. The overall experiments have proved the superiority of DARNet. However, which module plays a more important role in performance improvement is still unclear. Therefore, we conduct an ablation study to validate the effectiveness of each module, including CA, DA, and the dual-attention modules. Table IV quantitatively compares the performance of different networks on the independent test set. For COVID-19 versus the other two classes (CP and healthy controls), DARNet achieves the highest AUC, accuracy, sensitivity, and F1-score. Meanwhile, DARNet obtains the best results on all performance indicators for the three-way classification. The results of the ablation experiments reveal the importance of each part. According to the results, we can observe varying degrees of model performance decline. Among all of them, the dual-attention module has the biggest impact on the model performance. By applying the dual-attention module, DARNet has a significant improvement on all performance indicators, while the parameter is only increased by about 6.4% as shown in Table III . Moreover, removing CA or DA module will also have a negative impact on network performance. These observations further prove the effectiveness of the dual-attention module. To further explore the interpretability of DARNet, we employ CAM [27] to visualize the discriminative regions of different networks in diagnosing COVID-19. Fig. 2 shows the visualization results on three COVID-19 cases with different degrees (mild, moderate, and severe) of infection, highlighting the regions that the network focuses on when making deci-sions. We observe that DARNet can accurately locate lung lesions that vary greatly in size and distribution. However, after removing CA or DA module, the positioning ability of the network has declined significantly. For instance, for the severe COVID-19 case in Fig. 2 , we can see diffuse lesions in both lungs, consolidation of the lower lobe of the left lung. When we remove the CA and DA modules in turn, the highlighted area in the right lung gradually shrinks. Especially, the network without these two modules has very low sensitivity to the lesions, and may even be disturbed by the information outside the lung area. The above results demonstrate that DA and CA modules can enhance the learned features to ensure that the decisions made by the network depend mainly on the infection regions to a certain extent, rather than the irrelevant parts of the images. More importantly, the results also show that DARNet has better interpretability and reliability in diagnosing COVID-19. In this work, we proposed a dual-attention residual network that can realize the automatic and accurate diagnosis of COVID-19 using 3D chest CT images. In our method, we constructed the dual-attention module by combining CA and DA modules to refine the hidden features by adaptively assigning weights during training. This module can effectively improve the classification performance and interpretability of 3D ResNet, while only slightly increasing the computational complexity. We evaluated our method on a large public CT dataset, achieving state-of-the-art results. To further understand the decision of the proposed method, we showed the visual evidence to reveal the discriminative regions used in the model for diagnosis. In future work, we will further investigate the generalization capability of the proposed method. Besides, more work is still devoted to analyzing the relationship between these discriminative regions and the image findings. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster An interactive web-based dashboard to track covid-19 in real time Coronavirus Disease 2019 (COVID-19): A Perspective from China Development of a Machine-Learning System to Classify Lung CT Scan Images into Normal/COVID-19 Class Correlation of Chest CT and RT-PCR Testing for Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases The Novel Coronavirus Originating in Wuhan, China: Challenges for Global Health Governance The extent of transmission of novel coronavirus in wuhan, china, 2020 Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 Deep learning system to screen coronavirus disease 2019 pneumonia Classification of Covid-19 coronavirus, pneumonia and healthy lungs in CT scans using Qdeformed entropy and deep learning features Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia Using artificial intelligence to detect covid-19 and community-acquired pneumonia based on pulmonary ct: Evaluation of the diagnostic accuracy Automated model design and benchmarking of deep learning models for covid-19 detection with chest ct scans Classification of COVID-19 Chest CT Images Based on Ensemble Deep Learning Deep residual learning for image recognition Squeeze-and-excitation networks A survey on deep learning in medical image analysis Viral Pneumonia Screening on Chest X-ray Images Using Confidence-Aware Anomaly Detection COVID-MobileXpert: On-Device Patient Triage and Follow-up using Chest X-rays DeepCOVIDExplainer: Explainable COVID-19 Diagnosis from Chest X-ray Images GCNet: Non-local networks meet squeeze-excitation networks and beyond Res2Net: A New Multi-scale Backbone Architecture M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network Recalibrating Fully Convolutional Networks With Spatial and Channel 'Squeeze and Excitation' Blocks Dual attention network for scene segmentation Learning deep features for discriminative localization Development and evaluation of an artificial intelligence system for COVID-19 diagnosis Adam: A method for stochastic optimization Delving deep into rectifiers: Surpassing human-level performance on imagenet classification