key: cord-129728-fpoqjmes authors: Ouyang, Xi; Huo, Jiayu; Xia, Liming; Shan, Fei; Liu, Jun; Mo, Zhanhao; Yan, Fuhua; Ding, Zhongxiang; Yang, Qi; Song, Bin; Shi, Feng; Yuan, Huan; Wei, Ying; Cao, Xiaohuan; Gao, Yaozong; Wu, Dijia; Wang, Qian; Shen, Dinggang title: Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia date: 2020-05-06 journal: nan DOI: nan sha: doc_id: 129728 cord_uid: fpoqjmes The coronavirus disease (COVID-19) is rapidly spreading all over the world, and has infected more than 1,436,000 people in more than 200 countries and territories as of April 9, 2020. Detecting COVID-19 at early stage is essential to deliver proper healthcare to the patients and also to protect the uninfected population. To this end, we develop a dual-sampling attention network to automatically diagnose COVID- 19 from the community acquired pneumonia (CAP) in chest computed tomography (CT). In particular, we propose a novel online attention module with a 3D convolutional network (CNN) to focus on the infection regions in lungs when making decisions of diagnoses. Note that there exists imbalanced distribution of the sizes of the infection regions between COVID-19 and CAP, partially due to fast progress of COVID-19 after symptom onset. Therefore, we develop a dual-sampling strategy to mitigate the imbalanced learning. Our method is evaluated (to our best knowledge) upon the largest multi-center CT data for COVID-19 from 8 hospitals. In the training-validation stage, we collect 2186 CT scans from 1588 patients for a 5-fold cross-validation. In the testing stage, we employ another independent large-scale testing dataset including 2796 CT scans from 2057 patients. Results show that our algorithm can identify the COVID-19 images with the area under the receiver operating characteristic curve (AUC) value of 0.944, accuracy of 87.5%, sensitivity of 86.9%, specificity of 90.1%, and F1-score of 82.0%. With this performance, the proposed algorithm could potentially aid radiologists with COVID-19 diagnosis from CAP, especially in the early stage of the COVID-19 outbreak. T HE disease caused by the novel coronavirus, or Coronavirus Disease 2019 (COVID-19) is quickly spreading globally. It has infected more than 1,436,000 people in more than 200 countries and territories as of April 9, 2020 [1] . On February 12, 2020, the World Health Organization (WHO) officially named the disease caused by the novel coronavirus as Coronavirus Disease 2019 (COVID-19) [2] . Now, the number of COVID-19 patients, is dramatically increasing every day around the world [3] . Compared with the prior Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS), COVID-19 has spread to more places and caused more deaths, despite its relatively lower fatality rate [4] , [5] . Considering the pandemic of COVID-19, it is important to detect COVID-19 early, which could facilitate the slowdown of viral transmission and thus disease containment. In clinics, real-time reverse-transcriptionpolymerase-chainreaction (RT-PCR) is the golden standard to make a definitive diagnosis of COVID-19 infection [6] . However, the high false negative rate [7] and unavailability of RT-PCR assay in the early stage of an outbreak may delay the identification of potential patients. Due to the highly contagious nature of the virus, it then constitutes a high risk for infecting a larger population. At the same time, thoracic computed tomography (CT) is relatively easy to perform and can produce fast diagnosis [8] . For example, almost all COVID-19 patients have some typical radiographic features in chest CT, including ground-glass opacities (GGO), multifocal patchy consolidation, and/or interstitial changes with a peripheral distribution [9] . Thus chest CT has been recommended as a major tool for clinical diagnosis especially in the hard-hit region such as Hubei, China [6] . Considering the need of high-throughput screening by chest CT and the workload for radiologists especially in the outbreak, we design a deep-learning-based method to automatically diagnose COVID-19 infection from the community acquired pneumonia (CAP) infection. With the development of deep learning [11] , [12] , [13] , [14] , [15] , the technology has a wide range of applications in medical image processing, including disease diagnosis [16] , and organ segmentation [17] , etc. Convolutional neural network (CNN) [18] , one of the most representative deep learning technology, has been applied to reading and analyzing CT images in many recent studies [19] , [20] . For example, Koichiro et. al. use CNN for differentiation of liver masses on dynamic contrast agentenhanced CT images [21] . Also, some studies focus on the diagnoses of lung diseases in chest CT, e.g., pulmonary nodules [22] , [23] and pulmonary tuberculosis [24] . Although deep learning has achieved remarkable performance for abnormality diagnoses of medical images [16] , [25] , [26] , physicians have concerns especially in the lack of model interpretability and understanding [27] , which is important for the diagnosis of COVID-19. To provide more insight for model decisions, the class activation mapping (CAM) [28] and gradient-weighted class activation mapping (Grad-CAM) [29] methods have been proposed to produce localization heatmaps highlighting important regions that are closely associated with predicted results. In this study, we propose a dual-sampling attention network to classify the COVID-19 and CAP infection. To focus on the lung, our method leverages a lung mask to suppress image context of none-lung regions in chest CT. At the same time, we refine the attention of the deep learning model through an online mechanism, in order to better focus on the infection regions in the lung. In this way, the model facilitates interpreting and explaining the evidence for the automatic diagnosis of COVID-19. The experimental results also demonstrate that the proposed online attention refinement can effectively improve classification performance. In our work, an important observation is that COVID-19 cases usually have more severe infection than CAP cases [30] , although some COVID-19 cases and CAP cases do have similar infection sizes. To illustrate it, we use an established VB-Net toolkit [10] to automatically segment lungs and pneumonia infection regions on all the cases in our training-validation (TV) set (with details of our TV set provided in Section IV), and show the distribution of the ratios between the infection regions and lungs in Fig. 1 . We can see the imbalanced distribution of the infection size ratios in both COVID-19 and CAP data. In this situation, the conventional uniform sampling on the entire dataset to train the network could lead to unsatisfactory diagnosis performance, especially concerning the limited cases of COVID-19 with small infections and also the limited cases of CAP with large infections. To this end, we train the second network with the size-balanced sampling strategy, by sampling more cases of COVID-19 with small infections and also more cases of CAP with large infections within mini-batches. Finally, we apply ensemble learning to integrate the networks of uniform sampling and size-balanced sampling to get the final diagnosis results, by following the dual-sampling strategy. As a summary, the contributions of our work are in threefold: • We propose an online module to utilize the segmented pneumonia infection regions to refine the attention for the network. This ensures the network to focus on the infection regions and increase the adoption of visual attention for model interpretability and explainability. • We propose a dual-sampling strategy to train the network, which further alleviates the imbalanced distribution of the sizes of pneumonia infection regions. • To our knowledge, we have used the largest multi-center CT data in the world for evaluating automatic COVID-19 diagnosis. In particular, we conduct extensive crossvalidations in a TV dataset of 2186 CT scans from 1588 patients. Moreover, to better evaluate the performance and generalization ability of the proposed method, a large independent testing set of 2796 CT scans from 2057 patients is also used. Experimental results demonstrate that our algorithm is able to identify the COVID-19 images with the area under the receiver operating characteristic curve (AUC) value of 0.944, accuracy of 87.5%, sensitivity of 86.9%, specificity of 90.1%, and F1-score of 82.0%. Chest X-ray (CXR) is one of the firstline imaging modality to diagnose pneumonia, which manifests as increased opacity [31] . The CNN networks have been successfully applied to pneumonia diagnosis in CXR images [16] , [32] . As the release of the Radiological Society of North America (RSNA) pneumonia detection challenge [33] dataset, object detection methods (i.e., RetinaNet [34] and Mask R-CNN [35] ) have been used for pneumonia localization in CXR images. At the same time, CT has been used as a standard procedure in the diagnosis of lung diseases [36] . An automated classification method has been proposed to use regional volumetric texture analysis for usual interstitial pneumonia diagnosis in highresolution CT [37] . For COVID-19, GGO and consolidation along the subpleural area of the lung are the typical radiographic features of COVID-19 patients [9] . Chest CT, especially high-resolution CT, can detect small areas of ground glass opacity (GGO) [38] . Some recent works have focused on the COVID-19 diagnosis from other pneumonia in CT images [39] , [40] , [41] . It requires the chest CT images to identify some typical features, including GGO, multifocal patchy consolidation, and/or interstitial changes with a peripheral distribution [9] . Wang et al. [39] propose a 2D CNN network to classify between COVID-19 and other viral pneumonia based on manually delineated regions. Xu et al. [40] use a V-Net model to segment the infection region and apply a ResNet18 network for the classification. Ying et al. [41] use a ResNet50 network to process all the slices of each 3D chest CT images to form the final prediction for each CT images. However, all these methods are evaluated in small datasets. In this paper, we have collected 4982 CT scans from 3645 patients, provided by 8 collaborative hospitals. To our best knowledge, it is the largest multi-center dataset for COVID-19 till now, which can prove the effectiveness of the method. Note that, in the context of pneumonia diagnosis, lung segmentation is often an essential preprocessing step in analyzing chest CT images to assess pneumonia. In the literature, Alom et al. [42] utilize U-net, residual network and recurrent CNN for lung lesion segmentation. A convolutional-deconvolutional capsule network has also been proposed for pathological lung segmentation in CT images. In this paper, we use an established VB-Net toolkit for lung segmentation, which has been reported with high Dice similarity coefficient of > 98% in evaluation [10] . Also, this VB-Net toolkit achieves Dice similarity coefficient of 92% between automatically and manually delineated pneumonia infection regions, showing the state-of-the-art performance [43] . For more related works, a recent review paper of automatic segmentation methods on COVID-19 could be found in [43] . For network training in the datasets with long-tailed data distribution, there exist some problems for the universal paradigm to sample the entire dataset uniformly [45] . In such datasets, some classes contain relatively few samples. The information of these cases may be ignored by the network if applying uniform sampling. To address this, some class resampling strategies have been proposed in the literature [46] , [47] , [48] , [49] , [50] . The aim of these methods is to adjust the numbers of the examples from different classes within mini-batches, which achieves better performance on longtailed dataset. Generally, class re-sampling strategies could be categorized into two groups, i.e., over-sampling by repeating data for minority classes [46] , [47] , [48] and under-sampling by randomly removing samples to make the number of each class to be equal [47] , [49] , [50] . The COVID-19 data is hard to collect and precious, so abandoning data is not a good choice. In this study, we adapt the over-sampling strategies [46] on the COVID-19 with small infections and also CAP with large infections to form a size-balanced sampling method, which can better balance the distribution of the infection regions of COVID-19 and CAP cases within mini-batches. However, over-sampling may lead to over-fitting upon these minority classes [51] , [52] . We thus propose the dual-sampling strategy to integrate results from the two networks trained with uniform sampling and size-balanced sampling, respectively. Attention mechanism has been widely used in many deep networks, and can be roughly divided into two types: 1) activation-based attention [53] , [54] , [55] and 2) gradientbased attention [28] , [29] . The activation-based attention usually serves as an inserted module to refine the hidden feature maps during the training, which can make the network to focus on the important regions. For the activation-based attention, the channel-wise attention assigns weights to each channel in the feature maps [55] while the position-wise attention produces heatmaps of importance for each pixel of the feature maps [53] , [54] . The most common gradient-based attention methods are CAM [28] and Grad-CAM [29] , which reveal the important regions influencing the network prediction. These methods are normally conducted offline and provide a pattern of model interpretability during the inference stage. Recently, some studies [56] , [57] argue that the gradient-based methods can be developed as an online module during the training for better localization. In this study, we extend the gradient-based attention to composing an online trainable component and the scenario of 3D input. The proposed attention module utilizes the segmented pneumonia infection regions to ensure that the network can make decisions based on these infection regions. The overall framework is shown in Fig. 2 . The input for the network is the 3D CT images masked in lungs only. We use an established VB-Net toolkit [10] to segment the lungs for all CT images, and perform auto-contouring of possible infection regions as shown in Fig. 3 . The VB-Net toolkit is a modified network that combines V-Net [58] with bottleneck layers to reduce and integrate feature map channels. The toolkit is capable of segmenting the infected regions as well as the lung fields, achieving Dice similarity coefficient of 92% between automatically and manually delineated infection regions [10] . By labeling all voxels within the segmented regions to 1, and the rest part to 0, we can get the corresponding lung mask and then input image by masking the original CT image with the corresponding lung mask. As shown in Fig. 2 , the training pipeline of our method consists of two stages: 1) using different sampling strategies to train two 3D ResNet34 models [44] with the online attention module; 2) training an ensemble learning layer to integrate the predictions from the two models. The details of our method are introduced in the following sections. Fig. 2 . Illustration of the pipeline of the proposed method, including two steps. 1) We train two 3D ResNet34 networks [44] with different sampling strategies. Also, the online attention mechanism generates attention maps during training, which refer to the segmented infection regions to refine the attention localization. 2) We use the ensemble learning to integrate predictions from the two trained networks. In this figure, "Attention RN34 + US" means the 3D ResNet34 (RN34) with attention module and uniform sampling (US) strategy, while "Attention RN34 + SS" means the 3D ResNet34 with attention module and size-balanced sampling (SS) strategy. "GAP" indicates the global average pooling layer, and "FC" indicates the fully connected layer. "1 × 1 × 1 Conv" refers to the convolutional layer with 1 × 1 × 1 kernel, and takes the parameters from the fully connected layer as the kernel weights. "MSE Loss" refers to the mean square error function. Infection Mask We use the 3D ResNet34 architecture [44] as the backbone network. It is the 3D extended version of residual network [13] , which uses the 3D kernels in all the convolutional layers. In 3D ResNet34, we set the stride of each dimension as 1 in the last residual block instead of 2. This makes the resolution of the feature maps before the global average pooling (GAP) [59] operation into 1/16 of the input CT image in each dimension. Compared with the case of downsampling the input image by a factor of 32 in each dimension in the original 3D ResNet34, it can greatly improve the quality of the generated attention maps based on higher-resolution feature maps. To exhaustively learn all features that are important for classification, and also to produce the corresponding attention maps, we use an online attention mechanism of 3D class activation mapping (CAM). The key idea of CAM [28] , [29] , [56] is to back-propagate weights of the fully-connected layer onto the convolutional feature maps for generating the attention maps. In this study, we extend this offline operation to become an online trainable component for the scenario of 3D input. Let f denote the feature maps before the GAP operation and also w denote the weight matrix of the fully-connected layer. To make our attention generation procedure trainable, we use w as the kernel of a 1 × 1 × 1 convolution layer and apply a ReLU layer [60] to generate the attention feature map A as: where A has the shape X × Y × Z, and X, Y, Z is 1/16 of corresponding size of the input CT images. Given the attention feature map A, we first upsample it to the input image size, then normalize it to have intensity values between 0 and 1, and finally perform sigmoid for soft masking [57] , as follows: where values of α and β are set to 100 and 0.4 respectively. T (A) is the generated attention map of this online attention module, where A is defined in Eq. 1. During the training, the parameters in the 1×1×1 convolution layer are always copied from the fully-connected layer and only updated by the binary cross entropy (BCE) loss for the classification task. The main idea of size-balanced sampling is to repeat the data sampling for the COVID-19 cases with small infections and also the CAP cases with large infections in each minibatch during training. Normally, we use the uniform sampling in the entire dataset for the network training (i.e., "Attention RN34 + US" branch in Fig. 2) . Specifically, each sample in the training dataset is fed into the network only once with equal probability within one epoch. Thus, the model can review the entire dataset when maintaining the intrinsic data distribution. Due to the imbalance of the distribution of infection size, we train a second network via the size-balanced sampling strategy (i.e., "Attention RN34 + SS" branch). It aims to boost the sampling possibility of the small-infection-area COVID-19 and also large-infection-area CAP cases in each mini-batch. To this end, we split the data into 4 groups according to the volume ratio of the pneumonia infection regions and the lung: 1) smallinfection-area COVID-19, 2) large-infection-area COVID-19, 3) small-infection-area CAP, and 4) large-infection-area CAP. For COVID-19, we define the cases that meet the criteria of < 0.030 as small-infection-area COVID-19, and the rest as large-infection-area COVID-19. For CAP, we define the cases with the ratio > 0.001 as large-infection-area CAP and the rest as small-infection-area CAP. We define the numbers of samples for the 4 , and uniformly pick up a sample from the selected group. This strategy ensures to have more possibility to sample cases from the two groups of 1) COVID-19 with small infections and 2) CAP with large infections. We conduct the size-balanced sampling strategy for all mini-batches when training the "Attention RN34 + SS" model. Two losses are used to train "Attention RN34 + US" and "Attention RN34 + SS" models, i.e., the classification loss L c and the extra attention loss L ex for COVID-19 cases, respectively. We adopt the binary cross entropy as constrain for the COVID-19/CAP classification loss L c . For the COVID-19 cases, given the pneumonia infection segmentation mask M , we can use them to directly refine the attention maps from our model and L ex is thus formulated as: where T (A ijk ) is the attention map generated from our online attention module (Eq. 2), and i, j and k represent the (i, j, k) th voxel in the attention map. The proposed L ex is modified from the traditional mean square error (MSE) loss, using the sum of regions of attention map T (A ijk ) and the corresponding mask M ijk as an adaptive normalization factor. It can adjust the loss value dynamically according to the sizes of pneumonia infection regions. Then, the overall objective function for training "Attention RN34 + US" and "Attention RN34 + SS" models is expressed as: where λ is a weight factor for the attention loss. It is set to 0.5 in our experiments. For the CAP cases, only the classification loss L c is used for model training. The size-balanced sampling method could gain more attention on the minority classes and remedy the infection area bias in COVID-19 and CAP patients. A drawback is that it may suffer from the possible over-fitting of these minority classes. In contrast, the uniform sampling method could learn feature representation from the original data distribution in a relatively robust way. Taking the advantages of both sampling methods, we propose a dual-sampling method via an ensemble learning layer, which gauges the weights for the prediction results produced by the two models. After training the two models with different sampling strategies, we use an ensemble learning layer to integrate the predictions from two models into the final diagnosis result. We combine the prediction scores with different weights for different ratios of the pneumonia infection regions and the lung: where, w is the weight factor. In our experiment, it is set to 0.35 for the case where the ratio meets the criterion < 0.001 or > 0.030, and 0.96 for the rest cases. The factor values are determined with a hyperparameter search on the TV set. Then, P f inal is the final prediction result of the dual-sampling model. As presented in Eq. 5, the dual-sampling strategy combines the characteristics of uniform sampling and sizebalanced sampling. For the minority classes, i.e., COVID-19 with small infections as well as CAP with large infections, we assign extra weights to the "Attention RN34 + SS" model. For the rest cases, more weights are assigned to the "Attention RN34 + US" model. In this study, we use a large multi-center CT data for evaluating the proposed method in diagnosis of COVID-19. In particular, we have collected a total of 4982 (<2mm) chest CT images from 3645 patients, including 3389 COVID-19 CT images and 1593 CAP CT images. All recruited COVID-19 patients were confirmed by RT-PCR test. Here, the images were provided by the Tongji Hospital of Huazhong University Table I . Thin-slice chest CT images are used in this study with the CT thickness ranging from 0.625 to 1.5mm. CT scanners include uCT 780 from UIH, Optima CT520, Discovery CT750, LightSpeed 16 from GE, Aquilion ONE from Toshiba, SO-MATOM Force from Siemens, and SCENARIA from Hitachi. Scanning protocol includes: 120 kV, with breath hold at full inspiration. All CT images are anonymized before sending them for conducting this research project. The study is approved by the Institutional Review Board of participating institutes. Written informed consent is waived due to the retrospective nature of the study. Data are pre-processed in the following steps before feeding them into the network. First, we resample all CT images and the corresponding masks of lungs and infection regions to the same spacing (0.7168mm, 0.7168mm, 1.25mm for the x, y, and z axes, respectively) for the normalization to the same voxel size. Second, we down-sample the CT images and segmentation masks into the approximately half sizes considering efficient computation. To avoid morphological change in down-sampling, we use the same scale factor in all three dimensions and pad zeros to ensure the final size of 138 × 256 × 256. We should emphasize that our method is capable of handling full-size images. Third, we conduct "window/level" (window: 1500, level: -600) scaling in CT images for contrast enhancement. We truncate the CT image into the window [-1350, 150], which sets the intensity value above 150 to 150, and below -1350 to -1350. Finally, following the standard protocol of data pre-processing, we normalize the voxel-wise intensities in the CT images to the interval [0, 1]. We implement the networks in PyTorch [61] , and use NVIDIA Apex for less memory consumption and faster computation. We also use the Adam [62] optimizer with momentum set to 0.9, a weight decay of 0.0001, and a learning rate of 0.0002 that is reduced by a factor of 10 after every 5 epochs. We set the batch size as 20 during the training. In our experiments, all the models are trained from scratch. In the TV set, we conduct 5-fold cross-validation. In each fold, the model is evaluated on the validation set in the end of each training epoch. The best checkpoint model with the best evaluation performance within 20 epochs is used as the final model and then evaluated on the test set. All the models are trained in 4 NVIDIA TITAN RTX graphics processing units, and the inference time for one sample is approximately 4.6s in one NVIDIA TITAN RTX GPU. For evaluating, we use five different metrics to measure the classification results from the model: area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and F1-score. AUC represents degree or measure of separability. In this study, we calculated the accuracy, sensitivity, specificity, and F1-score at the threshold of 0.5. First, we conduct 5-fold cross-validation on the TV set. The experimental results are shown in Table II , which combines the results of all 5 validation sets. The receiver operating characteristic (ROC) curve is also shown in Fig. 4(A) . We can see that the models with the proposed attention refinement technique can improve the AUC and sensitivity scores. At the same time, we can see that "Attention RN34 + DS" achieves the highest performance in AUC, accuracy, sensitivity, and F1score, when combining the two models with different sampling strategies. As for the specificity, the performance of the dualsampling method is a little bit lower than that of ResNet34 with uniform sampling. We further investigate the generalization capability of the model by deploying the five trained models of five individual folds on the independent testing dataset. From Fig. 4(B-F) , we can see that the trained model of each fold achieves similar performance, implying consistent performance with different training data. Compared with the results on the TV set in Fig. 4(A) , the AUC score of the models with the proposed attention module ("Attention RN34 + DS") on the independent test set drops from 0.988 to 0.944, while the AUC score of "RN34 + US" drops from 0.984 to 0.934. This indicates the strong robustness of our model, trained with our attention module, against possible over-fitting. The proposed attention module can also ensure that the decisions made by the model depend mainly on the infection regions, suppressing the contributions from the non-related parts in the images. All 501 CAP images in the test set are from a single site that was not included in the TV set. "Attention RN34 + US" and "Attention RN34 + DS" models achieves ≥ 90.0% in specificity for these images. We can see that our algorithm maintains a great performance on the data acquired from different centers. In the next section, the effects of different sampling strategies are presented. In order to confirm whether there exists significant difference when using the proposed attention module or not, paired t-tests are applied. The p-values between "RN34 + US" and the three proposed methods are calculated. All the p-values are small than 0.01, implying that the proposed methods have significant improvements compared with "RN34 + US". To demonstrate the effectiveness in diagnosing pneumonia of different severity, we use the VB-Net toolkit [10] to get the lung mask and the pneumonia infection regions for all CT images. Based on the quantified volume ratio of pneumonia infection regions over the lung, we roughly divide the data into 3 groups in both the TV set and the test set, according to the ratios, i.e., 1) < 0.005, 2) 0.005 − 0.030, and 3) > 0.030. As shown in Table III , most of COVID-19 images have high ratios (higher than 0.030), while most CAPs are lower than 0.005, which may indicate that the severity of COVID-19 is usually higher than that of CAP in our collected dataset. Furthermore, the classification results of COVID-19 is highly related with the ratio. In Table III , we can see that the sensitivity scores are relatively high for the high infected region group (> 0.030), while the specificity scores are relatively low for the small infection region group (< 0.005). This performance matches the nature of COVID-19 and CAP in the collected dataset. As size-balanced sampling strategy ("Attention RN34 + SS") is applied in the training procedure, we can find that III GROUP-WISE RESULTS ON TV SET AND TEST SET. BASED ON THE VOLUME RATIO OF PNEUMONIA REGIONS AND THE LUNG, THE DATA IS DIVIDED INTO 3 GROUPS: THE VOLUME RATIOS THAT MEET THE [10] . For the attention results, we show the Grad-CAM results of "RN34 +US" (4 th row), and the attention maps obtained by our proposed attention module of "Attention RN34 + US" and "Attention RN34 + SS" models (5 th and 6 th rows). the sensitivity of the small infected region group (< 0.005) increases from 0.534 to 0.569, compared with the case of using the uniform sampling strategy ("Attention RN34 + US"). And also the specificity of the large infected region group (> 0.030) increases from 0.642 to 0.667. These results demonstrate that the size-balanced sampling strategy can effectively improve the classification robustness when the bias of the pneumonia area exists. However, if we only utilize the size-balanced sampling strategy in the training process, the sensitivity of the large infected region group (> 0.030) will decrease from 0.965 to 0.955, and the specificity of the small infected region group (< 0.005) will decrease from 0.933 to 0.896. This reflects that some advantages of the network may be sacrificed in order to achieve specific requirements. To achieve a dynamic balance between the two extreme conditions, we present the results using the ensemble learning with the dual-sampling model (i.e., "Attention RN34 + DS"). From the sensitivity and specificity in both small and large infected region groups, dual sampling strategy can preserve the classification ability obtained by uniform sampling, and slightly improve the classification performance of the COVID-19 cases in the small infected region group and the CAP cases in the large infected region group. Furthermore, the p-values between "Attention RN34 + US" and "Attention RN34 + DS" in both small-infected-region group (< 0.005) and high-infected-region group (> 0.030) are calculated. All the p-values are smaller than 0.01, which also proves the effectiveness and necessity of the dual sampling strategy. Finally, we show typical attention maps obtained by our models (Fig. 5 ) trained in one fold. For comparison, we show the attention results of naive ReNset34 ("RN34 + US") in the same fold without both the online attention module and the infection mask refinement, and perform the model explanation techniques (Grad-CAM [29] ) to get the heatmaps for classification. We can see that the output of Grad-CAM roughly indicates the infection localization, yet sometimes appears far outside of the lung. However, the attention maps from our models ("Attention RN34 + US" and "Attention RN34 + SS") can reveal the precise locations of the infection. These conspicuous areas in attention maps are similar to the infection segmentation results, which demonstrates that the final classification results determined by our model are reliable and interpretable. The attention maps thus can be possibly used as the basis to derive the COVID-19 diagnosis in clinical practice. We also show two failure cases in Fig. 6 , where the COVID-19 cases are classified as CAP by mistake for all the models. As can be observed from the results shown in Fig. 5 , the attention maps from all the models incorrectly get activated on many areas unrelated to pneumonia. "RN34 + US" model even generates many highlighted areas in the none-lung region instead of focusing on lungs. With the proposed attention constrain, the attention maps of "Attention RN34 + US" and "Attention RN34 + SS" have partially alleviated this problem. But still the visual evidences are insufficient to reach a final correct prediction. For COVID-19, it is important to get the diagnosis result at soon as possible. Although RT-PCR is the current ground truth to diagnose COVID-19, it will take up to days to get the final results and the capacity of the tests is also limited in many places especially in the early outbreak [8] . CT is shown as a powerful tool and could provide the chest scan results in several minutes. It is beneficial to develop an automatic diagnosis method based on chest CT to assist the COVID-19 screening. In this study, we explore a deep-learningbased method to perform automatic COVID-19 diagnosis from CAP in chest CT images. We evaluate our method by the largest multi-center CT data in the world, to the best of our knowledge. To further evaluate the generalization ability of the model, we use independent data from different hospitals (not included in the TV set), achieving AUC of 0.944, accuracy of 87.5%, sensitivity of 86.9%, specificity of 90.1%, and F1-score of 82.0%. At the same time, to better understand the decision of the deep learning model, we also refine the attention module and show the visual evidence, which is able to reveal important regions used in the model for diagnosis. Our proposed method could be further extended for differential diagnosis of pneumonia, which can greatly assist physicians. There also exist several limitations in this study. First, when longitudinal data becomes ready, the proposed model should be tested for its consistency tracking the development of the COVID-19 during the treatment, as considered in [63] . Second, although the proposed online attention module could largely improve the interpretability and explainability in COVID-19 diagnosis, in comparison to the conventional methods such as Grad-CAM, future work is still needed to analyze the correlation between these attention localizations with the specific imaging signs that are frequently used in clinical diagnosis. There also exist some failure cases that the visualization results do not appear correctly at the pneumonia infection regions, as shown in Fig. 6 . This motivates us to further improve the attention module to better focus on the related regions and reduce the distortion from cofounding visual information to the classification task in the future research. Third, we also notice that the accuracy of the small-infectionarea COVID-19 is not quite satisfactory. This indicates the necessity of combining CT images with clinical assessment and laboratory tests for precise diagnosis of early COVID-19, which will also be covered by our future work. The last but not least, the CAP cases used in this study do not include the subtype information, i.e., bacterial, fungal, and non-COVID-19 viral pneumonia. To assist the clinical diagnosis of pneumonia subtypes would also be beneficial. To conclude, we have developed a 3D CNN network with both online attention refinement and dual-sampling strategy to distinguish COVID-19 from the CAP in the chest CT images. The generalization performance of this algorithm is also verified by the largest multi-center CT data in the world, to our best knowledge. Coronavirus disease 2019 (covid-19): situation report Who director-general's remarks at the media briefing on Coronavirus disease (covid-2019) situation reports Characteristics of and important lessons from the coronavirus disease 2019 (covid-19) outbreak in china: summary of a report of 72 314 cases from the chinese center for disease control and prevention Coronavirus: covid-19 has killed more people than sars and mers combined, despite lower case fatality rate Coronavirus disease 2019 (covid-19): A perspective from china A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Ct imaging features of 2019 novel coronavirus Lung infection quantification of covid-19 in ct images with deep learning Deep learning Imagenet classification with deep convolutional neural networks Deep residual learning for image recognition Densely connected convolutional networks Estimating ct image from mri data using 3d fully convolutional networks Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases U-net: Convolutional networks for biomedical image segmentation Backpropagation applied to handwritten zip code recognition Automatic lung segmentation based on texture and deep features of hrct images with interstitial lung disease Lung segmentation on hrct and volumetric ct for diffuse interstitial lung disease using deep convolutional neural networks Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced ct: a preliminary study Added value of computer-aided ct image features for early lung cancer diagnosis with small pulmonary nodules: a matched case-control study End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection Visual interpretability for deep learning: a survey Learning deep features for discriminative localization Grad-cam: Visual explanations from deep networks via gradient-based localization Large-scale screening of covid-19 from community acquired pneumonia using infection size-aware classification Imaging of community-acquired pneumonia Chexnet: Radiologistlevel pneumonia detection on chest x-rays with deep learning Radiological society of north america Focal loss for dense object detection Mask r-cnn Radiological diagnosis in lung disease: factoring treatment options into the choice of diagnostic modality Automated classification of usual interstitial pneumonia using regional volumetric texture analysis in high-resolution ct Guidelines for management of incidental pulmonary nodules detected on ct images: from the fleischner society A deep learning algorithm using ct images to screen for corona virus disease Deep learning system to screen coronavirus disease 2019 pneumonia Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet The devil is in the tails: Fine-grained classification in the wild Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition A systematic study of the class imbalance problem in convolutional neural networks Relay backpropagation for effective learning of deep convolutional neural networks Learning from imbalanced data The class imbalance problem: A systematic study Class-balanced loss based on effective number of samples Smote: synthetic minority over-sampling technique Non-local neural networks Dual attention network for scene segmentation Squeeze-and-excitation networks Attention branch network: Learning of attention mechanism for visual explanation Tell me where to look: Guided attention inference network V-net: Fully convolutional neural networks for volumetric medical image segmentation Network in network Rectified linear units improve restricted boltzmann machines Pytorch: An imperative style, high-performance deep learning library Adam: A method for stochastic optimization Classic: consistent longitudinal alignment and segmentation for serial image computing