key: cord-0121044-5cr5f4xa authors: Heidarian, Shahin; Afshar, Parnian; Enshaei, Nastaran; Naderkhani, Farnoosh; Rafiee, Moezedin Javad; Oikonomou, Anastasia; Shafiee, Akbar; Fard, Faranak Babaki; Plataniotis, Konstantinos N.; Mohammadi, Arash title: Robust Automated Framework for COVID-19 Disease Identification from a Multicenter Dataset of Chest CT Scans date: 2021-09-19 journal: nan DOI: nan sha: 0975a2bb8364037e550372ba4be25a9693f742ea doc_id: 121044 cord_uid: 5cr5f4xa The objective of this study is to develop a robust deep learning-based framework to distinguish COVID-19, Community-Acquired Pneumonia (CAP), and Normal cases based on chest CT scans acquired in different imaging centers using various protocols, and radiation doses. We showed that while our proposed model is trained on a relatively small dataset acquired from only one imaging center using a specific scanning protocol, the model performs well on heterogeneous test sets obtained by multiple scanners using different technical parameters. We also showed that the model can be updated via an unsupervised approach to cope with the data shift between the train and test sets and enhance the robustness of the model upon receiving a new external dataset from a different center. We adopted an ensemble architecture to aggregate the predictions from multiple versions of the model. For initial training and development purposes, an in-house dataset of 171 COVID-19, 60 CAP, and 76 Normal cases was used, which contained volumetric CT scans acquired from one imaging center using a constant standard radiation dose scanning protocol. To evaluate the model, we collected four different test sets retrospectively to investigate the effects of the shifts in the data characteristics on the model's performance. Among the test cases, there were CT scans with similar characteristics as the train set as well as noisy low-dose and ultra-low dose CT scans. In addition, some test CT scans were obtained from patients with a history of cardiovascular diseases or surgeries. The entire test dataset used in this study contained 51 COVID-19, 28 CAP, and 51 Normal cases. Experimental results indicate that our proposed framework performs well on all test sets achieving total accuracy of 96.15% (95%CI: [91.25-98.74]), COVID-19 sensitivity of 96.08% (95%CI: [86.54-99.5]), CAP sensitivity of 92.86% (95%CI: [76.50-99.19]). Since the emergence of the novel coronavirus disease and the consequent global pandemic, healthcare authorities have used different diagnostic technologies to rapidly and accurately detect infected cases. Among such diagnostic technologies, chest Computed Tomography (CT) scans have been widely used, providing informative images of the lung parenchyma. More importantly, CT scans are highly sensitive to the diagnosis of COVID-19 infection, particularly based on its specific abnormality pattern and infection distribution in the lung 1 . To analyze a CT scan, radiologists should review several 2D images (slices), jointly creating a 3D representation of the body. Consequently, the analysis of a CT scan requires careful review of all slices. Furthermore, the COVID-19 lung imaging manifestations are highly overlapped with those of the Community Acquired Pneumonia (CAP), making the diagnosis even more challenging for radiologists. The aforementioned issues have motivated the development of Artificial Intelligence (AI)-based diagnostic solutions using advancements in Deep Learning (DL) to analyze volumetric CT scans and provide diagnostic labels in a timely fashion 2 . Despite the recent surge of interest and success of DL-based diagnostic solutions, such models commonly fail to achieve acceptable performances when there is heterogeneity in the data characteristics between the train and test sets, which is common when acquiring data from multiple imaging centers 3 . Therefore, the necessity of developing a robust framework is of utmost importance to minimize the effect of the gap between the train and test sets and provide acceptable results on varied external datasets. In the case of CT scans, there are several factors contributing to the characteristics of the images among which, type of scanners, scanner manufacturers, and scanning protocols have the most influence on the quality and characteristics of the scans 4, 5 . Furthermore, the patients' clinical and surgical history can add more complexity and undesired artifacts to the CT scans that might have been blind to the trained model 6 . Capitalizing on the above discussion, this study aims to develop a robust deep learning-based framework that can be generalized on varied external datasets with high flexibility to update itself upon receiving new external datasets. In this context, on the one hand, the paper introduces an automated two-stage classification framework based on Capsule Networks, which is tailored to robustly classify volumetric chest CT scans into one of the three target classes (COVID-19, CAP, or normal). The proposed Capsule Network-based framework integrates a scalable enhancement approach to boost its performance and robustness in the presence of gaps between the train and test sets regarding types of scanners, imaging protocols, and technical parameters. On the other hand, the paper introduces a unique test dataset, referred to as the SPGC-COVID dataset, which is available for public access through Figshare 7 . SPGC-COVID dataset consists of COVID-19, CAP, and normal cases acquired with various imaging settings from different medical centers. The SPGC-COVID dataset contains four subsets, illustrated in Fig. 1 , including images with different slice thickness, radiation dose, and noise level. In addition to different technical parameters, the dataset consists of CT scans of patients who have heart diseases or have undergone heart surgery, besides having COVID-19 or CAP infections. The SPGC-COVID dataset was used as the test set in the 2021 Signal Processing Grand Challenge (SPGC) on COVID-19 diagnosis, which the authors organized as part of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The performance of our proposed Capsule Network-based framework is compared with state-of-the-art approaches of the COVID-19 grand challenge. The results demonstrate that our proposed framework outperforms all the submitted models by achieving the overall accuracy of 96.15% ( Our proposed framework adopts a two-stage architecture based on Capsule Networks (CapsNets) 8 , as shown in Fig. 2 , which is fed by a volumetric CT scan and provides the probability of the input scan belonging to one of the three target classes. In brief, the first stage identifies CT slices demonstrating infection and passes them to the second stage to be classified as one of the target classes. The output of the first stage is also used to filter normal cases, by applying a 3% threshold on the involvement of the lung parenchyma (i.e., the ratio of the infectious slices in the whole volume). In addition to the proposed framework, four partially enhanced models are developed (based on the four test sets), and the final model aggregates the outputs of the partially enhanced models to provide the final predictions. The proposed enhancement approach extracts confidently predicted images from each test set in an unsupervised fashion, which are then used to update the model's parameters. To evaluate the performance of the proposed model and the effectiveness of its unsupervised enhancement approach, we used the first three test sets to enhance the benchmark model and kept the fourth test set aside only for evaluation purposes. The results obtained by applying the enhanced ensemble model on all of the test sets are shown in Table 2 Under the ROC curve (AUC) is calculated based on the micro average of the values obtained for each class. In addition, to further validate the obtained results, confidence intervals for the total accuracy and sensitivity are provided using the method introduced in 9 . To elaborate the effect of the proposed unsupervised enhancement approach, we have provided the performance of the benchmark model (i.e., before enhancement) as well as the models enhanced by individual tests sets (i.e., before averaging the outputs) in Table 3 . Results shown in Table 3 imply that the probability of the input CT scan belonging to the target class in some misclassified cases have been on the thresholding edge (close to 0.5) and could be corrected after incorporating the models enhanced over other test sets. In addition to the final patient-level predictions, we have evaluated the performance of the first stage on the validation set in detecting slices demonstrating infection to have a clearer insight into the internal components of the framework. The first stage achieved an accuracy of 93.41%, sensitivity of 91.04%, and specificity of 94.26% in the binary (infectious & non-infectious) classification task. As slice-level labels (i.e., binary labels indicating the existence of infection in a CT slice) are not available for test sets, the result on the validation set is only reported. Moreover, as mentioned earlier, the output of the first stage can be used to identify most normal cases before entering the next stage. We found that nearly all of the normal cases in the four test Table 3 . The ratio of correctly classified cases over total cases in the associated class obtained for the proposed model, the benchmark model, and the partially enhanced models. sets (45/46 cases) have been identified correctly by the thresholding mechanism applied on the output of the first stage, while none of COVID-19 and CAP cases have been misclassified as normal using this thresholding approach. In Fig. 3 , the ROC curves for COVID-19 and CAP cases against other classes (e.g., COVID-19 vs. CAP and Normal) are plotted. The associated AUC values are also provided. We have compared our proposed framework with top six models 10-15 developed following the Signal Processing Grand Challenge (SPGC) on COVID-19 diagnosis, which was organized by the authors as part of the 2021 IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP). In the first phase of this SPGC, participants had access to the same train and validation sets as those used in this study to develop and evaluate their models. In the second phase, they have been provided with the first three test sets and had two weeks to submit their final models. Finally, the best-performing models based on the first three test sets have been evaluated on the fourth test set to determine the overall performances. Experimental results demonstrate that our proposed framework outperforms its counterparts proposed in the SPGC. Furthermore, it benefits from a scalable enhancement approach that can be integrated into most of the state-of-the-art models to improve their performance when testing on a heterogeneous dataset. In what follows, the six best-performing models from the SPGC COVID-19 are briefly described, followed by their corresponding performances on the entire test sets presented in Table 4 . • Ref. 10 : In this model, slice-level predictions are acquired from an EfficientNet-based classifier 16 and a weighted majority voting is proposed to obtain the final patient-level labels. To train this classifier, the authors first trained two separate binary classifiers to detect slices demonstrating infection from COVID-19 and CAP cases. Then, they fed these models with unlabelled cases to provide the training set for the main classifier. Additionally, they only considered the middle slices (e.g., 80 middle slices) in a volumetric CT scan at the training phase. • Ref. 11 : This model aggregates the output of six classifiers developed based on the 3D ResNet101 model 17 . One model in this proposed framework is a three-way classifier trained over all of the cases while the other five models are binary classifiers independently trained over COVID-19 and CAP cases using different combinations of train and validation sets. • Ref. 12 : This model presents a feature extraction-based approach in which a modified pre-trained ResNet50 model classifies each slice into the target classes and the penultimate fully connected layer is extracted as the feature map. Next, a max-pooling layer followed by two fully connected layers are used to generate patient-level prediction from slice-level feature maps. The output of this model is then aggregated with two BiLSTM patient-level classifiers, which are fed by the same slice-level feature maps to provide the final patient-level labels. • Ref. 13 : The pre-trained 3D Resnet50 18 is the backbone of this model. The authors first doubled the number of slices for each case using a 3D cubic interpolation method. Then, they extracted the lung area using a pixel-based segmentation approach, followed by classical image processing techniques such as pixel filling and border cleaning. Finally, a subset of slices is selected from each volumetric CT scan based on their lung area and an experimentally-set threshold, which are then resized into a (224, 224, 224) data, using a 3D cubic interpolation method, providing the patient-level input for training and evaluation purposes. • Ref. 14 : This model utilizes a two-stage framework in which the first stage is responsible for performing a multi-task classification to classify 2D slices into one of the target groups and identify the location of the slice in the sequence of CT images at the same time. The model at the first stage uses an ensemble of four popular CNN-based classifiers (i.e., ResneXt50 19 , DenseNet161 20 , Inception-V3 21 , and Wide-Resnet 22 ), followed by an aggregation mechanism that divides the whole volumetric CT scan into 20 groups of slices and calculates the percentage of infected slices related to COVID-19 and CAP classes in each group. The values obtained for all groups are then concatenated and fed into a XG-boost classifier 23 in the second stage to generate patient-level predictions. • Ref. 15 : The model proposed in this work initiates with a slice-level EfficientNet-B1 classifier 16 aiming to classify slices and generate feature maps (intermediate layers) to be used in the subsequent sequence classifier. In the sequence classifier, several weak classifiers are trained and the outputs are aggregated using an adaptive weighting mechanism to obtain the final patient-level results. To further enhance the performance of the model and cope with the imbalanced training set, a combination of weak and strong data augmentations are applied to the training cases, forcing the model to produce similar labels for both types of augmented images. Furthermore, to improve the robustness of the model when being tested on varied datasets, a K-Means clustering method (K = 3) 24 is adopted to develop a single classifier for each cluster of the data and aggregate the results via a majority voting approach. In addition to the aforementioned models, we have further compared our proposed framework with another model, which has utilized the same train and test sets (excluding the 4th test set) to target the same classification task 25 . A brief description of this model is as follows: • Ref. 25 : This model aims to introduce a robust training algorithm and classification framework, which is capable of being updated upon receiving new datasets to deal with the characteristic shifts in different test sets. First, it adopts a two-stage architecture similar to the COVID-FACT model proposed in reference 26 and trains the benchmark model in a self-supervised fashion 27 and the majority voting is adopted to obtain patient-level labels. The backbone model used in this study is DenseNet169 20 and strict slice preprocessing and sampling methods are applied to the training set. Such methods contain pixel-based approaches with some fixed thresholds used to extract lung areas and select the slices with the most visible lung area. Next, each test set is divided into four quarters, which are then used in an unsupervised updating process, in which quarters are passed to the model sequentially and confident predictions are selected to fine-tune the slice-level classifiers. A slice-level prediction is considered confident in this study if it achieves the probability of at least 0.9 in agreement with the patient-level label. Table 4 illustrates the performance of seven automated models developed to tackle the same task as that of this study using the same train and test datasets. We have also compared the overall performance of our proposed framework with the aforementioned models using the statistical McNemar's test 28 with the significance level of 0.05. We tested the hypothesis that the models have the same proportion of errors on the entire test sets. The corresponding p-values are reported in Table 4 and indicate that the hypothesis is rejected for almost all the models except the first one as the corresponding p-value is slightly more than 0.05. In other words, there is a significant difference in the proportion of errors between our proposed framework and six of the aforementioned models while such difference is not significant in the case of the model proposed in Ref. 10 . Model In this paper, we expanded the fully-automated framework developed in our previous study 26 to tackle the three-way classification task (i.e., identification of COVID-19, CAP, and Normal cases) based on volumetric CT scans acquired from multiple centers using different imaging protocols. We also proposed an unsupervised enhancement approach, which can enable all deep learning-based frameworks to be adapted with the heterogeneity in different test sets. In Table 5 , the numbers of slices extracted from each test set to augment the train set are presented. The low number of normal slices demonstrates the high performance of the first stage in identifying slices with and without the evidence of infection. As another advantage of the proposed framework, we can mention the capability of the Capsule Network-based model to be trained using a relatively small dataset, which is of utmost importance in the field of Medical Image Processing, in particular the COVID-19 disease, where, typically, small annotated datasets are available. The other noteworthy advantage is that the model does not require any infection annotation, which is a challenging and time-consuming task. The only segmentation used in our study is the lung area segmentation (i.e., extracting the lung parenchyma using a pre-trained U-Net model 29 ), which is a well-studied task and does not add much complexity to the model. We would like to highlight the effect of the suggested 3% threshold used to identify normal cases based on the outcome of the first stage. As mentioned earlier, 3% is a safe threshold to identify normal cases as it is extremely rare to observe less than 3% involvement of the lung parenchyma in COVID-19 cases. However, it is possible that the number of slices identified as infectious in a normal case exceeds this 3% threshold. This could happen mainly in those CT scans with a large slice-thickness and fewer slices (e.g., less than 100 slices). In such cases, a minor error (a few number of misclassified slices by the first stage) will mistakenly indicate a large involvement of the lung parenchyma. Such errors can be avoided by increasing the 3% threshold or using an adaptive threshold (e.g., based on the slice-thickness and number of slices) when we are dealing with a fewer number of slices per patient. In this study, only one normal case has been misclassified and increasing the threshold to 6% could remove the error while the other cases were not affected. The promising results and benefits of the first stage in identifying slices demonstrating infection indicate its significant potential to be used in other CT scan-related models to help identify normal cases and concentrate only on a subset of slices rather than the whole volume. Furthermore, we would like to highlight that the results shown in Table 3 demonstrate the incapability of the model enhanced based on a test set to improve the performance of the model on the same set. This is mainly because of the fact that the additional data used to update the benchmark model is constructed by the cases with the highest probability scores (whether correct or not) and incorporating them into the train set will force the model to further increase the corresponding probability scores while does not have much effect on other slices. As such, in the test phase, it is more reasonable to aggregate the outputs obtained by all enhanced models except the one associated with the target test set. It is also worth mentioning that due to the nature of the data (i.e., Medical Images), obtaining a large and diversified dataset from different countries is challenging. However, we will continue to expand the diversity of the dataset to perform more comprehensive investigations 6/14 on the generalizability of our proposed framework on other test sets as well as determining the maximum level of the shift in image characteristics that can be compensated using our proposed framework. Finally, it is worth noting that it is possible to design more advanced techniques to select the cases and images from the new test sets using the metrics introduced in the field of Active Learning 30, 31 through which the cases which bring more diversity to the training set and the associated feature maps are detected and used for training purposes. In addition to the enhancement techniques in the field of Active Learning, there have been recently several studies on using Generative Adversarial Networks (GANs) to cope with the data and domain shift in medical images 32, 33 where the labeled data is not available in the target domain. The main goal in such frameworks is to achieve a domain invariant image representation which can efficiently embed the important features of the image regardless of the imaging modality or imaging technique. Similarly in 34 , an auto-encoder and feature augmentation-based approach is proposed to adapt the model with various imaging modalities obtained by different scanners. However, in this study, we are dealing with only one imaging modality (i.e., CT scan) and the level of characteristic shift between the images is lower compared to the images investigated in the aforementioned studies. Moreover, we could achieve high performances using a far less complicated mechanism. In conclusion, we have proposed an approach to update the model's parameters by extracting confident predictions from the test sets and utilize them to re-train the model in order to increase its capability and robustness in the presence of gaps between the imaging protocols and patients' clinical history. We showed that we can train different versions of the model based on different test sets and combine their outputs to generate the final predictions, which are more accurate and robust. In this section, we first introduce the datasets utilized in this study. We then describe the components of the two-stage Capsule Network-based classification framework, followed by a detailed description of the proposed unsupervised enhancement approach. In what follows, different datasets used in this study are described individually, followed by supplementary information about the demographic data, imaging protocols, acquisition settings, de-identification, and the labeling process. The utilized dataset consists of a training and a test set, where the training dataset is the COVID-CT-MD 35 , we introduced previously and is acquired from one imaging center using similar scanning parameters. The test dataset, the so-called SPGC-COVID, is comprised of four different sets each with specific characteristics to evaluate the robustness and generalizability of the DL model from different aspects. The SPGC-COVID dataset is publicly available on Figshare 7 . An overview of different datasets and imaging centers is visualized in Fig. 1 . Different components of the utilized dataset are as follows: • Train Set: We used our in-house and publicly available dataset 35 , referred to as the "COVID-CT-MD", as the training dataset which contains CT scans of COVID-19, CAP, and normal cases acquired by the "SIEMENS, SOMATOM Scope" scanner using the standard radiation dose from Babak Imaging Center, Tehran, Iran. A subset of 55 COVID-19, and 25 CAP cases are analyzed by one radiologist (M.J.R.) to identify slices demonstrating infection. The labeled subset of the data contains 4, 993 slices demonstrating infection and 18, 416 slices without evidence of infection. 30% of the cases in this set are randomly selected as the validation set. • The SPGC-COVID Test Set: This dataset, which is released through this manuscript, comprises the following four different subsets: -Test Set 1: Low and Ultra-Low dose CT scans of COVID-19 and normal cases acquired from the same imaging center as that of the train set. This dataset is a subset of our in-house dataset of Low Dose CT scans 36 and is publicly available. -Test Set 2: CT scans of COVID-19, CAP, and normal cases acquired in a different imaging center (Tehran Heart Center, Iran) using the "SIEMENS SOMATOM Emotion 16" scanner and different scanning parameters. Some cases in this dataset have additional history of cardiovascular disease/surgeries with specific CT imaging findings, which are not available in the train set. Additional statistical and demographic information about different train and test sets used in this study are provided in Table 1 . In Table 1 slices from the first three test sets are shown in Fig 4. Various scanning protocols and settings have been used to obtain the train and test datasets used in this study. The important parameters that contribute the most to the image quality and characteristics are presented in Table 6 . De-identification: The data used in this study complies with the DICOM supplement 142 (Clinical Trial De-identification Profiles) 38 , which ensures that all personal information is removed or obfuscated, including names, UIDs, dates, times, comments, and center-related information. Some demographic and acquisition attributes related to the patients' gender and age, scanner type, and image acquisition settings have been preserved to provide useful information about the dataset. Labeling Process: Diagnosis of the cases scanned in Center 1 is obtained by finding the consensus between three experienced radiologists who have considered the following three main criteria to label the data: For the cases acquired from Center 2, (13/18) COVID-19 cases have positive RT-PCR test results and the remaining cases have been labeled by one experienced radiologist following the same aforementioned criteria. Slice-level labels are provided by one radiologist to identify and label slices with evidence of infection. A subset of 15 random cases has been further reviewed by the two other radiologists to confirm the accuracy of the slice-level labels. In this study, we have developed a two-stage framework similar to the model proposed in our previous study 26 , referred to as the "COVID-FACT", as our benchmark model to classify volumetric CT scans into three target classes of COVID-19, CAP, and normal. We then use the unlabeled data from the test sets to boost the performance and robustness of the framework on the unseen cases. The pipeline of the proposed framework is shown in Fig. 2 . Different components of the proposed framework are described below: • Preprocessing: Raw CT scans, typically, contain uninformative components and unwanted artifacts (e.g., metallic artifacts), which can negatively affect the performance of the DL model. In addition, image sizes may vary and pixel intensities may be in different ranges when the images are acquired by different scanners. As such, we first extracted the lung areas from the CT images to remove the insignificant and distracting components. In this regard, we used a well-trained U-net based segmentation model 29 , which is fine-tuned on COVID-19 cases to specify lung areas in the first step. We then down-sampled all images into the (256 × 256) size to reduce the memory allocation and complexity without significant loss of information. Furthermore, we normalized each 2D image into the [0, 1] interval. • Stage 1: The first stage performs the infection identification task, which aims to find slices with the evidence of infection (caused by CAP or COVID-19) for each patient. The identified slices will then be classified into one of the three target classes in the second stage. The input of Stage 1 is the normalized lung area as a 2D image and the output is the label indicating whether the input image demonstrates infection or not. The classification model used in this stage is based on the Capsule Networks (CapsNets) 8 , which have shown a superior discriminative capability compared to their CNN-based counterparts, especially when they are trained over small datasets [39] [40] [41] [42] . Each capsule layer consists of multiple capsules, which are groups of neurons represented by a vector. Capsule Network benefits from an iterative process, known as the "Routing by Agreement", that aims to evaluate the agreement between the capsules in a lower layer on the existence of an object in the higher layer. Using the Routing by Agreement process, the model can recognize the relation between multiple instances in an image. Furthermore, CapsNets have lower time and space complexity compared to the conventional CNNs 26 . Such advantages make CapsNet-based models ideal in the case of COVID-19 where small annotated datasets are available and disease manifestations show specific spatial distributions in the lung. The detailed structure of the classification model in the first stage is shown in Fig. 5(a) . For the first stage, we adopted the same architecture as the model proposed in 26 . More specifically, the model in this stage uses a stack of four convolution layers, one batch normalization layer, and one max pooling layer to generate initial feature maps. Next, the output of the last convolution layer is reshaped to form the first capsule layer, followed by three consecutive capsule layers, as shown in Fig. 5(a) . The last layer contains two capsules representing the two target classes (i.e., slices with and without the evidence of infection.) The length of each capsule represents the probability of the corresponding class being present. Different from COVID-FACT, residual connections are added between the convolution layers to transfer low-level features to the deeper layers. This modification further assists the model in identifying informative features. Additionally, we have added a dropout layer before the capsule layers to overcome the overfitting problems during the training. The labeled subset of the training dataset has been used to train this stage over 100 epochs using the Adam optimizer with the learning rate of 1e − 4. To account for the imbalanced number of slices in each class, we have used a weighted loss function to increase the contribution of the minority group (i.e., slices demonstrating infection) to the final loss value and balance the influence of each class. The balanced loss function used to train Stage 1 is given by where w 1 and w 2 represent the weights corresponding to the loss value calculated for negative and positive samples, respectively. Term loss 1 denotes the loss associated with negative samples, while loss 2 is the loss associated with positive samples. Term N 1 represents the number of negative samples, and N 2 is the number of positive samples. • Stage 2: The second stage takes the candidate slices from the previous stage and classifies them into one of the COVID-19, CAP, or normal cases. More specifically, we have used the slices demonstrating infection recognized by the first stage for all of the cases in the train set (with or without slice level labels) to train a three-way classification model. Stage 2 utilizes a CapsNet architecture similar to the one used in the first stage but with smaller dimensions and three capsules in the last layer to represent three target classes. The architecture of stage two is shown in Fig. 5(b) . Similar to the first stage, we used a weighted loss function to cope with the imbalanced number of samples in some categories. At this stage, the loss weights associated with normal and CAP classes are set to 5 and the weight for COVID-19 class is set to 1. Note that as the normal cases are extremely rare at this stage, the weights are set differently compared to those calculated by Eq. 1, to maintain the stability of the training process, while enforcing the model to pay more attention to the minority classes. We also used the binary cross-entropy loss function, which translates the three-way classification problem at hand into three binary classification tasks. In fact, the loss value is calculated separately for each binary label associated with a target class (i.e., COVID-19, CAP, normal). Finally, a majority voting mechanism is adopted to transfer slice-level predictions into patient-level ones and determine the final label. It is worth noting that an accurate model in the first stage detects only a few candidate slices from normal cases. We can then apply a thresholding mechanism on the output of the first stage to identify those cases with only a few identified infectious slices in the first stage and label them as normal. We have used a threshold of 3% to specify normal cases immediately after the first stage. More specifically, if less than 3% of the slices in a volumetric CT scan are classified as infectious, the corresponding CT scan is classified as a normal case. Based on 43 , the minimum lung lesion involvement in patients with COVID-19-related CT findings is 4%. In addition, the minimum percentage of slices demonstrating infection in our training dataset is 7%. In the case that the model in stage 1, misclassifies more than 3% of slices for a normal case, there is still a chance to classify the slices as normal in the second stage. fashion. In other words, inspired by the ideas from "Active Learning 30, 31, 44 ", where different data samples are extracted to train the model in different stages, and "Semi-Supervised Learning 45, 46 ", where a label is assigned to unlabelled cases based on a pre-defined metric, we developed an autonomous mechanism to extract and label a part of data in the test sets using a probabilistic selection criteria with reduced complexity. The selected sample and the assigned labels are then used to re-train and boost the initially trained model. More specifically, we selected those test cases for which the model generated the most confident results (i.e., high probability). Similarly, among the selected cases, those with high confidence in slice-level predictions are used. To define the confident results, the probability of a volumetric CT scan belonging to a specific target class is considered to be equal to the ratio of the slices belonging to that class over the total number of slices (all slices containing the lung lesion), which can be written as follows where X represents the input volumetric CT scan, C represents the number of target classes, and n C i denotes the number of slices belonging to the target class C i . Then, we introduced a confidence threshold value and considered a prediction confident if the probability of the input CT scan belonging to any of the target classes is more than the pre-set threshold. In this study, we have used 80% as the confidence threshold. A similar approach is used to extract confident slices and their corresponding labels. In this case, the probability of a slice belonging to a target class is determined by the output of the CapsNet classifier in Stage 2, which is the length (L 2 Norm) of capsules in the last layer. It is worth mentioning that for those normal cases, which are identified in the first stage using the described thresholding mechanism, we only select the slices which are misclassified as infectious with a high probability (e.g., more than the confidence threshold). Such slices will be labeled as normal in the enhancement phase. Following the aforementioned steps, we can obtain a set of slices and their corresponding labels to augment the training dataset aiming to make the model more aware of the new features available in the unseen datasets and achieving more robust feature maps. Therefore, for each test set, we obtained a set of confident slices and their associated labels which have been added to the train set to re-train the model of the second stage. It is worth noting that the first stage has been kept unchanged in this approach. Finally, after re-training the benchmark model based on the confident slices acquired from each test set, we have obtained several enhanced models (each related to one test set) and averaged the associated patient-level probability scores to achieve the final prediction. This aggregation mechanism depends on the target test set. More specifically, to apply the model on each test set, we take the average of the predictions obtained by the models enhanced over the other test sets. For instance, the model developed for the diagnosis of cases in Test set 1 takes the average of probability scores provided by the models enhanced on Test set 2 and 3. The main reason for using such an aggregation mechanism is that the enhancement based on a specific test set will further boost the probability scores of confidently predicted slices while having limited influence on other cases in the same set. As such, incorporating the model enhanced on a test set will not bring in any further improvement to the evaluation process of the same set. The results presented in Table 3 further support this discussion. It is worth noting that we used the first three test sets to enhance the benchmark model and kept the fourth test set aside for only evaluation purposes. As such, upon receiving new test datasets, we can aggregate the results of the enhanced models on the individual test sets (each representing a specific center or scanning protocol) to provide the classification results for the new cases. The unsupervised model enhancement described above along with the subsequent ensemble averaging make the entire framework a robust automated framework that can be easily improved and updated upon receiving new datasets from different imaging centers. Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Diagnosis/Prognosis of COVID-19 Chest Images via Machine Learning and Hypersignal Processing: Challenges, opportunities, and applications Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements, and Prognosis of COVID-19 Pneumonia Using Computed Tomography Reproducibility of CT Radiomic Features within the Same Patient: Influence of Radiation Dose and CT Reconstruction Settings Effects of contrast-enhancement, reconstruction slice thickness and convolution kernel on the diagnostic performance of radiomics signature in solitary pulmonary nodule Cardiovascular disease potentially contributes to the progression and poor prognosis of COVID-19 Matrix capsules with EM routing. 6th Int. Conf. on Learn. Represent. ICLR 2018 -Conf. Track Proc Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The Detecting Covid-19 and Community Acquired Pneumonia Using Chest CT Scan Images With Deep Learning A Multi-Stage Progressive Learning Strategy for Covid-19 Diagnosis Using Chest Computed Tomography with Imbalanced Data Multi-Scale Residual Network for Covid-19 Diagnosis Using Ct-Scans Covid-19 Diagnostic Using 3d Deep Transfer Learning for Classification of Volumetric Computerised Tomography Chest Scans CNR-IEMN : A DEEP LEARNING BASED APPROACH TO RECOGNISE COVID-19 FROM CT-SCAN DIAGNOSING COVID -19 FROM CT IMAGES BASED ON AN ENSEMBLE LEARNING FRAMEWORK Rethinking Model Scaling for Convolutional Neural Networks. Arxiv Deep Residual Learning for Image Recognition Introducing Transfer Learning to 3D ResNet-18 for Alzheimer's Disease Detection on MRI Images Aggregated Residual Transformations for Deep Neural Networks Densely Connected Convolutional Networks Rethinking the Inception Architecture for Computer Vision Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Some methods for classification and analysis of multivariate observations ICAS 2021 -2021 IEEE International Conference on Autonomous Systems (ICAS) COVID-FACT: A Fully-Automated Capsule Network-Based Framework for Identification of COVID-19 Cases from Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey Note on the sampling error of the difference between correlated proportions or percentages Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem MedAL: Accurate and Robust Deep Active Learning for Medical Image Analysis O-MedAL: Online active deep learning for medical image analysis Adversarial Discriminative Domain Adaptation Target-Independent Domain Adaptation for WBC Classification Using Generative Latent Search Unsupervised Domain Adaptation to Classify Medical Images Using Zero-Bias Convolutional Auto-Encoders and Context-Based Feature Augmentation COVID-19 computed tomography scan dataset applicable in machine learning and deep learning Human-level COVID-19 Diagnosis from Low-dose CT Scans Using a Two-stage Time-distributed Capsule Network Chest computed tomography using iterative reconstruction vs filtered back projection (Part 1): evaluation of image noise reduction in 32 patients MIXCAPS: A capsule network-based mixture of experts for lung nodule malignancy prediction BayesCap: A Bayesian Approach to Brain Tumor Classification Using Capsule Networks Ct-Caps: Feature Extraction-Based Automated Framework for Covid-19 Disease Identification From Chest Ct Scans Using Capsule Networks Hybrid Deep Learning Model For Diagnosis Of Covid-19 Using Ct Scans And Clinical/Demographic Data Lung involvement in patients with coronavirus disease-19 (COVID-19): a retrospective study based on quantitative CT findings A survey on active learning and human-in-the-loop deep learning for medical image analysis Heterogeneous image features integration via multi-modal semi-supervised learning model A Survey on Semi-, Self-and Unsupervised Learning for Image Classification The datasets generated and/or analyzed during the current study are available for public access. Competing Interests: Authors declare no competing interests.