key: cord-1013125-8tz6yqnj authors: Tartaglione, Enzo; Barbano, Carlo Alberto; Berzovini, Claudio; Calandri, Marco; Grangetto, Marco title: Unveiling COVID-19 from CHEST X-Ray with Deep Learning: A Hurdles Race with Small Data date: 2020-09-22 journal: Int J Environ Res Public Health DOI: 10.3390/ijerph17186933 sha: 90fd519d56164052857b84c455c5c566e009b854 doc_id: 1013125 cord_uid: 8tz6yqnj The possibility to use widespread and simple chest X-ray (CXR) imaging for early screening of COVID-19 patients is attracting much interest from both the clinical and the AI community. In this study we provide insights and also raise warnings on what is reasonable to expect by applying deep learning to COVID classification of CXR images. We provide a methodological guide and critical reading of an extensive set of statistical results that can be obtained using currently available datasets. In particular, we take the challenge posed by current small size COVID data and show how significant can be the bias introduced by transfer-learning using larger public non-COVID CXR datasets. We also contribute by providing results on a medium size COVID CXR dataset, just collected by one of the major emergency hospitals in Northern Italy during the peak of the COVID pandemic. These novel data allow us to contribute to validate the generalization capacity of preliminary results circulating in the scientific community. Our conclusions shed some light into the possibility to effectively discriminate COVID using CXR. COVID-19 virus has rapidly spread in mainland China and into multiple countries worldwide [1] . As of 9 August 2020, 19,432,244 Patients with COVID-19 have been recorded, and 721,594 of them died [2] . Early diagnosis is a key element for proper treatment of the patients and prevention of the spread of the disease. Given the high tropism of COVID-19 for respiratory airways and lung epythelium, identification of lung involvement in infected patients can be relevant for treatment and monitoring of the disease. Virus testing is currently considered the only specific method of diagnosis. The Center for Disease Control (CDC) in the US recommends collecting and testing specimens from the upper respiratory tract (nasopharyngeal and oropharyngeal swabs) or from the lower respiratory tract when available (bronchoalveolar lavage, BAL) for viral testing with reverse transcription polymerase chain reaction (RT-PCR) assay [3] . Testing on BAL samples provides higher accuracy, however this test is unconfortable for the patient, possibly dangerous for the operator due to aerosol emission during the procedure and cannot be performed routinely. Nasopharingeal swabs are instead easily executable can be used to detect and classify ILD tissue. The authors of [13] focus on a design a CNN tailored to match the ILD CT texture features, e.g., small filters and no pooling to guarantee spatial locality. Fewer contributions focus on classification of X-ray chest images to help SARS diagnosis: in [14] lung segmentation, followed by feature extraction and three classification algorithms, namely decision tree, shallow neural network and classification and regression tree are compared, the latter yielding the higher accuracy on the SARS detection task. However, on the pneumonia classification task, NN-based approaches show encouraging results. In [15] texture features for SARS identification in radiographic images are proposed and designed using signal processing tools. In the last days a number of pre-prints targeting COVID classification with CNN on radiographic images have begun to circulate thanks to open access archives. Many approaches have been taken to tackle the problem of classifying chest X-ray scans to discriminate COVID-positive cases. For example, Sethy et al. compare classification performances obtained between some of the most famous convolutional architectures [16] . In particular, they use a transfer learning-based approach: they take pre-trained deep networks and they use these models to extract features from images. Then, they train a SVM on these "deep features" to the COVID classification task. A similar approach is also used by Apostopolous et al.: they pre-train a neural network on a similar task, and then they use the trained convolutional filters to extract features, on top of which a classifier attempts to select COVID features [17] . Narin et al. make use of resnet-based architectures and the recent Inception v3 and then they use a 5-fold cross validation strategy [18] . Finally, Wang et al. propose a new neural network architecture to be trained on the COVID classification task [19] . All of these approaches use a very small dataset, COVID-ChestXRay [20] , consisting of approximately 100 COVID cases considering CXR only, at the time of writing. Furthermore, in order to build COVID negative cases, typically data are sampled from other datasets (mostly, from ChestXRay). However, this introduces a potential issue: if any bias is present in the dataset (a label in the corners, a medical device, or other contingent factors like similar age, same sex etc.) the deep model could learn to recognize these dataset biases, instead of focusing on COVID-related features. These works present some potential issues to be investigated: Transfer learning: in the literature it is widely recognized that transfer learning-based approaches prove to be effective, also for medical imaging [21] . However, it is very important to be careful on the particular task the feature extractor is trained on: if such task is very specific, or contains biases, then the transfer learning approach should be carefully carried on. • Hidden biases in the dataset: most of the current works rely on very small datasets, due to the limited availability of public data on COVID positive cases. These few data, then, contain little or even no metadata on age, gender, different pathologies also present in these subjects, and other necessary information necessary to spot on this kind of biases. In this work we do not mean to answer whether and how CXR can be used in the early diagnosis of COVID, but to provide a methodological guide and critical reading of the statistical results that can be obtained using currently available datasets and learning mechanisms. Our main contribution is an extensive experimental evaluation of different combinations of usage of existing datasets for pre-training and transfer learning of standard CNN models. Such analysis allows us to raise some warnings on how to build datasets, pre-process data and train deep models for COVID classification of X-ray images. We show that, given the fact that datasets are still small and geographically local, subtle biases in the pre-trained models used for transfer learning can emerge, dramatically impacting on the significance of the performance one achieves. In this section we are going to describe the proposed deep-learning approach based on quite standard pipeline, namely chest image pre-processing and lung segmentation followed by classification model obtained with transfer learning. Data pre-processing is fundamental to remove any bias present in the data: we will show that it is easy for a deep model to recognize these biases which drive the learning process. Given the small size of COVID datasets, a key role is played by the larger datasets used for pre-training. Therefore, we first discuss which datasets can be used for our goals. For the experiments we are going to show, six different datasets are used. Four of these datasets provide a label for COVID classification task (COVID-ChestXRay, CORDA, ChestXRay and RSNA) while the other two (Montgomery County X-ray Set and Shenzhen Hospital X-ray Set) provide a segmentation mask for lungs; these two are used in the pre-processing phase only. In the following we briefly recall the main characteristics of each dataset: • For our simulations we propose a pre-processing strategy aiming at removing bias in the data. This step is very important in a setting in which we train to discriminate different classes belonging to different datasets: a neural network-based model might learn the distinction between the different dataset biases and from them "learn" the classification task. The proposed pre-processing chain is summarized in Figure 1 and is based on the following steps: • Histogram equalization: when acquiring a CXR, the so-called radiographic contrast depends on a large variety of factors, typically depending on subject contrast, receptor contrast or other factors like scatter radiations [25] . Hence, the raw acquisition has to be filtered through Value Of Interest transformation. However, due to different calibrations, different range dynamics can be covered, and this potentially is a bias. Histogram equalization is a simple mean to guarantee quite uniform image dynamic in the data. • Lung segmentation: the lung segmentation problem has been already faced and successfully tackled [26] [27] [28] . Being able to segment the lungs only, discarding all the rest of the CXRs, potentially prunes away possible bias sources, like for example the presence of medical devices (typically correlated to sick patients), various text which might be embed in the scan etc. In order to address this task, we train a U-Net [29] on Montgomery County X-ray Set and Shenzhen Hospital X-ray Set. The lung masks obtained are then blurred to avoid sharp edges using a 3 pixel radius. An example of the segmentation outcome is shown in Figure 2 . After data have been pre-processed, a deep model is trained. Towards this end, the following choices have been taken: • Pre-training the feature extractor (i.e., the convolutional layers of the CNN). In particular, the pre-training is performed on a related task, like pneumonia classification for CXRs. It has been shown that such an approach can be effective for medical imaging [11] , in particular when the amount of available data is limited as in our classification task. Clearly, pre-training the feature extractor on a larger dataset containing related features may allow us to exploit deeper models, potentially exploiting richer image feature. • The feature extractor is then fine-tuned on COVID data. Freezing it certainly prevents over-fitting the small COVID data; however, we have no warranty that COVID related features can be extracted at the output of a feature extractor trained on a similar task. Of course, its initialization on a similar task helps in the training process, but in any case a fine-tuning is still necessary [30] . • Proper sizing of the encoder to-be-used is an issue to be addressed. Despite many recent works use deeper architectures to extract features on the COVID classification task, larger models are prone to over-fit data. Considering the minimal amount of data available, the choice of the appropriate deep network complexity significantly affects the performance. • Balancing the training data is yet another extremely important issue to be considered. Unbalanced data favor biases in the learning process [31] and the choice of the data to include in the learning process is critical. • Data augmentation techniques should be carefully used in such context. No generic plastic deformations for the CXR images can be safely introduced since the basic lung structure is typically the same for any human subject, and should be consistently realistic through all the augmented samples. Towards this end, rigid transformations (translation, rotation) are the only data augmentation transformations safely applicable in such context. • Testing with different data than those used at training time is also fundamental. Excluding from the test-set exams taken from patients already present in the training-set is important to correctly evaluate the performance and to exclude the deep model has not learned a "patient's lung shape" feature. Of course many other issues have to be taken into account at training time, like the use of a validation-set to tune the hyper-parameters, using a good regularization policy etc. but these very general issues have been exhaustively discussed in many other works [32] [33] [34] . An overall summary of pre-training, training and testing is summarized in Figure 3 . The experiments discussed in the following have been designed to investigate three key aspects: • Pre-training of the feature extractor: the feature extractor can be pre-trained on large generic CXR datasets, or can not be pre-trained. • Composition of the training-set: the CORDA dataset is unbalanced (in fact, there is a prevalence of positive COVID cases) and some data balancing is possible, borrowing samples from publicly available non-COVID datasets. A summary of the dataset composition is displayed in Table 1 . For all the datasets we used 70% of data at training time and 30% as test-set. Training data are then further divided in training-set (80%) and validation-set (20%). Training-set data are finally balanced between COVID+ and COVID−: where possible, we increased the COVID−cases (CORDA&ChestXRay, CORDA&RSNA), where not possible we sub-sampled the more populated class. This percentages were not used for the COVID-ChestXRay dataset: in this case only 15 samples are used for testing in order to compare with other works [16] [17] [18] that use the same partitioning. Please notice that, through all the datasets, test data are mutually exclusive with training ones, and are never used at training time. • Testing on different datasets: in order to observe the possible presence of hidden biases, testing on different, qualitatively-similar datasets is a necessary step. train 207 105 ---102 --207 207 test 90 45 ---45 --90 90 AC train 207 105 -102 ----207 207 test 90 45 -45 ----90 90 AD train 116 105 ----49 24 165 129 test 90 45 ----10 5 100 50 D train ------98 24 98 24 test ------10 5 10 5 Figure 3 . Summary of the training strategy. The feature extractor is (optionally) pre-trained on CXR pathology datasets and then fine-tuned on the COVID datasets. The presence of gears involves training/fine-tuning for the specific part, while the lock implies that part is not modified. A summary of the most salient experimental results obtained on a combination of different datasets (Table 1) is reported in Table 2 . The complete results are reported in Appendix A. All the simulations have been run on a Tesla T4 GPU using PyTorch 1.4. The source code is available at https://github.com/EIDOSlab/unveiling-covid19-from-cxr. In Table 2 we compare four alternative neural network architectures, i.e., ResNet-18 [36] , Resnet-50 [36] , COVID-Net [19] and DenseNet-121 [37] with different combinations of datasets used for pre-training, training and testing, respectively (see columns 2-4 where datasets are identified according to labels in Table 1 ). As many work in the literature [17, 19] , we observe that, using the same COVID dataset as source for training and testing images, the performance of the deep learning models looks amazingly good. As an example, DenseNet-121 trained on images from COVID-ChestXRay dataset (D) yields BA as high as 0.9 when testing is done on corresponding testing set D. However, when testing on extra data still belonging to the same domain (they are still CXR images which undergo the same pre-processing as the training images), the performance drops significantly. In the DenseNet-121 case, we report BA of 0.53 when the same model is tested with images from CORDA datasets (A). In this section we analyze the results found in Section 3.3. Considering the complexity and the importance of the considered topic, we divide our analysis into three main aspects: • impact of pre-training for COVID detection (Section 4.1) and how should it be performed (Section 4.2); • effect of augmenting the COVID datasets with negative cases (Section 4.3); • selection of the proper architecture for the COVID detection (Sections 4.4 and 4.5). One very important issue to pay attention to is whether to pre-train the feature extractor or not. Given the large availability of public data for pneumonia classification (for example, in this scope we used ChestXRay and RSNA), it could be a good move to pre-train the encoder, and effectively this is what we observe looking at Table 2 . For example, if we focus on the results obtained training on the CORDA dataset, without a pre-trained encoder, BA and DOR are lower than pre-training with ChestXRay or RSNA. Despite the sensitivity remains very similar, pre-training the encoder helps in improving the specificity: on the test-set extracted from CORDA, using a pre-trained encoder on RSNA, the specificity is 0.80, while it is only 0.58 with no pre-trained feature extractor. Similar improvements in the specificity can be observed also on test-sets extracted from all the other datasets, except for ChestXRay. In general, a similar behavior can be observed when comparing results for differently pre-trained encoders trained on the same dataset. Pre-training is important; however, we can not just "freeze" the encoder on the pre-trained values. Since the encoder is pre-trained on a similar, but different task, there is no warranty the desired output features are optimal for the given classification task, and a fine-tuning step is typically required [38] . Focusing on pre-trained encoders, we show results for encoders pre-trained on two different datasets: ChestXRay and RSNA. While RSNA is a more generic pneumonia-segmentation dataset, ChestXRay contains information also about the type of pneumonia (bacterial or viral); so, at a first glance it looks a better fit for the pre-training. However, if we look at training on the CORDA dataset, we see that for the same sensitivity value, we get typically higher specificity scores for RSNA pre-training. This is not the same we observe when we compare results on the publicly-available COVID-ChestXRay: in this case, sensitivity and specificity are higher when we pre-train on ChestXRay. Looking at the same pre-trained encoder, let us say ChestXRay, we can compare results training on CORDA and on COVID-ChestXRay, which are the two COVID datasets: CORDA shows a lower sensitivity, but in general a higher specificity, except for the ChestXRay dataset. Having very little data at training time, pre-training introduces some priors in the choice of the features to be used, and depending on the final classification task, performance changes, yielding very good metric in some cases. Pre-training on more general datasets, like RSNA, in general looks a slightly better choice than using a more specific dataset like ChestXRay. For each and every simulation, performance on different test-sets is evaluated. This gives us hints on possible biases introduced by different datasets used at training time. A general trend can be observed for many COVID−augmented training-sets: the BA and DOR scores measured on the test-set built from the same dataset used at training time are typically very high. Let us focus on the ChestXRay pre-trained encoder. When we train on CORDA&ChestXRay, the BA score measured on the test-set from the same dataset is 0.9 and the DOR is 122.67. However, its generalization capability for a different composition of the test-set, let us say, CORDA&RSNA, is way lower: the BA is 0.56 and the DOR 2.26 only. The same scenario can be observed when we train on CORDA&RSNA: on its test-set the BA is 0.90 and DOR 122.64, while on the test-set of CORDA&ChestXRay the BA is 0.59 and DOR 2.47. The key to understand these results lies again in the specificity score: this score is extremely high for the test-set of the same dataset the training is performed on (for example, for CORDA&RSNA is 0.95 and for CORDA&ChestXRay is 0.94) while for the others is extremely low. Such a behavior is due to the presence of some common features in all the data belonging to the same augmenting dataset. This can be observed, for example, in Figure 4a , where the extracted features from an encoder pre-trained on ChestXRay and trained on CORDA&ChestXRay are clustered using t-distributed stochastic neighbor embedding (t-SNE) [39] (blue and orange dots represent ChestXray and CORDA data samples respectively, regardless of the COVID label). T-SNE is a popular nonlinear dimensionality reduction algorithm which models each high-dimensional object to a low-dimensional point for visualization purposes: the more a low-dimensional point is far from another, the more the two objects are different. It can be noted that CORDA samples, regardless the COVID+ or COVID−label, are clearly separable from ChestXRay data. Of course, all ChestXRay images have COVID−label, so someone could argue that the COVID feature has been captured. Unfortunately we have a counterexample: in Figure 4b we compare CORDA vs. RSNA samples, using the same ChestXRay pre-trained encoder and now RSNA and CORDA samples no longer form clear clusters. Hence, the deep model specializes not in recognizing COVID features, but in learning the common features in the same dataset used at training time. We would like to remark that for all the data used at training or at test time, all the pre-processing presented in Section 3.2 has been used. We ran the same experiments without that pre-processing and performance on different datasets than the one used at training time gets even worse. For example, pre-training and training on CORDA&ChestXRay without pre-processing lowers the BA to 0.73 and the DOR to 8.31 on CORDA, while from Table 2 we have higher scores on the test set (BA of 0.91 and DOR of 122.67). Dealing with generality of the results is a very delicate matter: what it is possible to see in Table 2 is that augmenting data with COVID−data needs to be very thoughtful since the classification performance may vary from very high accuracy down to almost useless discriminative power. Nonetheless, training using only COVID datasets yields some promising scores: for example, using ChestXRay pre-trained encoder and CORDA for training and testing, the BA we achieve is 0.56 and the DOR is 1.64. Including also COVID-ChestXRay for training (which consists in having more COVID+ and COVID−examples) improves the BA to 0.62 and the DOR to 2.93. In this case, however, the specificity is an issue, since we lack of COVID−data. However, these results show some promise that can be confirmed only by collecting large amount of data in the next months. After reviewing results on ResNet-18, we move to similar experiments run on the deeper ResNet-50 and DenseNet-121 shown in Table 2 . The hope is that a deeper network could extract more representative features for the classification task. Given the discussion in Section 4.1, we show only the cases with pre-training of the feature extractor. Using this deeper architecture, we can observe that all the discussions made for ResNet-18 still holds. In some cases performance impairs slightly: for example, the DOR score on CORDA&ChestXRay for ResNet-18 was 122.67 while for ResNet-50 and DenseNet-121 drops to 73.35 and 86.56 respectively. This is a sign of over-fitting: given the very small quantity of data currently available, using a small convolutional neural network is sufficient and safer. Taking an opposite approach, we tried to use a smaller artificial neural network, made of 8 convolutional layers and a final fully-connected layer, which takes inspiration from the ALL-CNN-C architecture [40] . We call this architecture "Conv8". The results on this smaller architecture are similar to those observed in Table 2 . For example, training the model on CORDA dataset, on Conv8 we have a BA of 0.61 and DOR of 2.38 while for ResNet-18 with encoder pre-trained on RSNA we have BA of 0.67 and DOR 4.78. We can conclude that using a smaller architecture than ResNet-18 does not give relevant training advantages, while by using larger architectures we might over-fit data. All the observations on train and test data made above are also valid for the recently published results on the COVID classification from CXR [16] [17] [18] [19] . One very promising approach is COVID-Net [19] . They also share the source code and the trained model, available at https://github.com/lindawangg/ COVID-Net. In Table 2 we compare the classification metrics obtained with COVID-Net and our ResNet-18 and DenseNet-121 models: all of the models have been trained using COVID-ChestXRay, and tested on both CORDA and COVID-ChestXRay. In line with the discussion above we can note that all of the three models yield surprising results when the same dataset is used for training and testing. The performance of COVID-Net on the COVID-ChestXRay test-set (the same dataset used at training time) is very high (BA of 0.85 and DOR of 36.0) while it drops significantly when tested on CORDA, where BA is 0.55 only and DOR is 6.68. This drop can be explained by looking at the sensitivity and specificity values: it is evident that the model classifies as COVID−almost all the data. A similar behavior can also be observed in the ResNet-18 and DenseNet-121 models: the observed performance apparently is extremely high (since that the BA on the test-set reaches 1.0 for ResNet-18), and similar numbers are also claimed in the other works on ResNet-like architectures [16] [17] [18] . However, testing on CORDA reveals that deep models likely learn some hidden biases in COVID-ChestXRay and tend to misclassify COVID−samples as COVID+ (given that the specificity is here 0.20 for ResNet-18 and even 0.07 for DenseNet-121, having a similar phenomenon like what observed in Section 4.3). Despite some claim of having a deep model properly designed to extract the COVID feature [19] , the currently low data availability limits the possibilities for the deep learning to succeed in this task. Certainly, having a pre-trained encoder to extract features from radiographic images is the most promising direction to move through. One of the very recent challenges for both clinical and AI community is to use deep learning to discriminate COVID from cheap and widespread CXR. Some recent works [16] [17] [18] [19] highlighted the possibility of successfully tackling this problem, despite the currently small quantity of publicly available data. In this work we have highlighted many obstacles towards a successful training of a deep model. Removing known biases like medical devices or textual information in the radiography and providing the deep model information strictly related to the lung content is the first practice necessary to remove some biases. Also having larger and more heterogeneous datasets could help in removing more non-trivial biases, like different settings for the acquisition machines, age, gender and ethnicity-related biases. Unfortunately, given the complexity only when the available data will scale-up by at least a factor two, or even more. Currently, the limited quantity of available data, prevents the use of large models: indeed, training smaller models is a safer choice since they are less prone to over-fit data. Very large models like DenseNet-121, if not properly regularized, tend to memorize the whole dataset with negative effects on the generalization capability. The ongoing collection and sharing of large amount of CXR data is the only way to further investigate if promising CNN results can aid in the fight against COVID pandemic. The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. The following abbreviations are used in this manuscript: In this Appendix we report the full results obtained on the variety of datasets presented in Table 1 . Additionally to what presented in Section 3.3, we also include, as performance metrics, the accuracy and the F-score. Coronavirus disease 2019 (COVID-19): A perspective from China ACR Recommendations for the Use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection Laboratory diagnosis and monitoring the viral shedding of 2019-nCoV infections Utilizzo Della Diagnostica Per Immagini Nei Pazienti Covid 19 The Role of Chest Imaging in Patient Management during the COVID-19 Pandemic: A Multinational Consensus Statement from the Fleischner Society Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: A descriptive study Frequency and Distribution of Chest Radiographic Findings in COVID-19 Positive Patients A role for CT in COVID-19? What data really tell us so far A clinicopathological study of three cases of severe acute respiratory syndrome (SARS) Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning Convolutional Neural Network for Categorization of Lung Tissue Patterns in Interstitial Lung Diseases Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network Computer Aided Detection of SARS Based on Radiographs Data Mining Texture classification of SARS infected region in radiographic image Detection of coronavirus Disease (COVID-19) based on Deep Features Automatic detection from X-Ray images utilizing Transfer Learning with Convolutional Neural Networks. arXiv 2020 Automatic Detection of Coronavirus Disease A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. arXiv 2020 COVID-19 image data collection Pre-trained convolutional neural networks as feature extractors for tuberculosis detection Labeled optical coherence tomography (oct) and chest X-ray images for classification ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases Chest X-ray analysis of tuberculosis by deep learning with segmentation and augmentation Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration Automatic lung segmentation for accurate quantitation of volumetric X-ray CT images A generic approach to pathological lung segmentation U-net: Convolutional networks for biomedical image segmentation Convolutional neural networks for medical image analysis: Full training or fine tuning? Survey of resampling techniques for improving classification performance in unbalanced datasets Early stopping-but when? On early stopping in gradient descent learning Post-synaptic potential regularization has potential Measures of diagnostic accuracy: Basic definitions Deep residual learning for image recognition Implementing efficient convnet descriptor pyramids. arXiv Best practices for fine-tuning visual classifiers to new domains Visualizing data using t-SNE Striving for simplicity: The all convolutional net