key: cord-0593944-f2rvh5gb authors: Ridzuan, Muhammad; Bawazir, Ameera Ali; Navarette, Ivo Gollini; Almakky, Ibrahim; Yaqub, Mohammad title: Challenges in COVID-19 Chest X-Ray Classification: Problematic Data or Ineffective Approaches? date: 2022-01-16 journal: nan DOI: nan sha: 7c738c61778eced797510bb70350c4e990d14b35 doc_id: 593944 cord_uid: f2rvh5gb The value of quick, accurate, and confident diagnoses cannot be undermined to mitigate the effects of COVID-19 infection, particularly for severe cases. Enormous effort has been put towards developing deep learning methods to classify and detect COVID-19 infections from chest radiography images. However, recently some questions have been raised surrounding the clinical viability and effectiveness of such methods. In this work, we carry out extensive experiments on a large COVID-19 chest X-ray dataset to investigate the challenges faced with creating reliable AI solutions from both the data and machine learning perspectives. Accordingly, we offer an in-depth discussion into the challenges faced by some widely-used deep learning architectures associated with chest X-Ray COVID-19 classification. Finally, we include some possible directions and considerations to improve the performance of the models and the data for use in clinical settings. On January 30, 2020, the World Health Organization (WHO) declared a global health emergency due to the coronavirus disease 2019 (COVID-19) outbreak (Burki, 2020) . Five times more deadly than the flu, SARS-CoV-2 viral infection's main symptoms are fever, cough, shortness of breath, and loss or change of smell and taste (Struyf et al., 2021) . The fast-paced rise in infections and the rate at which it spread around the globe exposed many challenges with diagnosis and treatment. Access to screening strategies and treatment was minimal due to the depletion of resources, especially at the beginning of the crisis (Coccolini et al., 2021) . Polymerase Chain Reaction (PCR) became the closest gold-standard assay for COVID-19 screening. Nevertheless, the limited number of tests and high rate of false negatives i.e. 100% false negative on infection day, which decreases to 38% on day 5 when first symptoms appear, gave radiographers ground to define chest imaging, not as a routine screening standard, but as an integral tool for assessing complications and disease progression (Inui et al., 2021) . Chest imaging is especially necessary for symptomatic patients that develop pneumonia, which is characterized by an increase in lung density due to inflammation and fluid in the lungs (Cleverley et al., 2020) . The Radiological Society of North America (RSNA) developed a standard nomenclature for imaging classification of COVID-19 pneumonia composed by four categories: negative for pneumonia, typical appearance, indeterminate appearance, and atypical appearance of COVID-19 pneumonia (Simpson et al., 2020) . The presence of Ground Glass Opacities (GGOs) and the extent to which they cover lung regions allows radiologists to diagnose COVID-19 pneumonia in chest radiographs. In such manner, the RSNA classifies a case as "typical" if the GGOs are multifocal, round-shaped, present in both lungs, and peripheral with a lower lung-predominant distribution. In an "indeterminate" case, there is an absence of typical findings and the GGOs are unilateral with a predominant distribution in the center or upper sections of the lung. If no GGO is seen and another cause of pneumonia (i.e. pneumothorax, pleural effusion, pulmonary edema, lobar consolidation, solitary lung nodule or mass, diffuse tiny nodules, cavity) is present, the case is categorized as "atypical" (Litmanovich et al., 2020) . The possibility of using Artificial Intelligence (AI) for aiding the fight against COVID-19 motivated researchers to turn to deep learning approaches, especially convolutional neural networks (CNNs), for the detection and classification of COVID-19 infections (Alghamdi et al., 2021) . Many studies have reported high performing classification approaches using Chest X-ray Radiographies (CXRs) (Dongsheng et al., 2021 , and Computed Tomography (CT) (Barstugan et al., 2020) (Pathak et al., 2020) (Jia et al., 2021) . Despite high reported accuracies by these methods (above 96%), questions have been raised regarding their clinical usefulness due to the bias of small datasets, poor integration of multistream data, variability of international sources, difficulty of prognosis, and the lack of collaborative work between clinicians and data analysts (Roberts et al., 2021) . Therefore, in this work we delve deep into the obstacles hindering the development of a clinically viable AI based solution for COVID-19 infection classification from chest radiographs. In this paper, we utilize available large chest X-ray dataset of COVID-19 patients to train deep learning models that have proven effective on computer vision benchmarks, including deep CNNs, Vision Transformers, and most recently ConvMixer. Following from this, the models are evaluated and the results are analysed to identify potential weaknesses in the models or training approaches. The nature of the data and classes are also analysed keeping in mind the clinical needs for the development of such models. Finally, we present an indepth discussion into the main challenges associated with this task from the architecture, training approach, and data perspectives. The SIIM-FISABIO-RSNA COVID-19 Detection dataset was curated by an international group of 22 radiologists (Lakhani et al., 2021) . It includes data from the Valencian Region Medical ImageBank (BIMCV) (de la Iglesia Vayá et al., 2020) and the Medical Imaging Data Resource Center (MIDRC) -RSNA International COVID-19 Open Radiology Database (RICORD) (Tsai et al., 2021) . The dataset available for training is composed of 6,054 individual cases (6,334 radiographs), with each case being labelled as negative, typical, indeterminate, or atypical appearance of pneumonia. Class imbalance is a challenging aspect of this dataset, but it reflects the distribution of cases in reality. The two bigger classes negative and typical account for about 75% of the total number of samples, with 1,676 and 2,855 samples respectively. Indeterminate and atypical samples account for the remaining 25% of samples with 1,049 and 474 samples respectively. In this work, we discuss the impact of this data imbalance on classification performance. The results in this paper are reported on a stratified train-test data split of 80-20% afterwhich the best performing models were tested on a 5-fold stratified cross-validation split. In addition to the SIIM-FISABIO-RSNA dataset, the CheXpert dataset (Irvin et al., 2019) was also used for pre-training. It is composed of 224,316 chest radiographs of 65,240 patients with the presence of 14 chest abnormalities. With such large number of chest radiographs containing various manifestations in the lungs, opicities, pleural effusion, and consolidation, this dataset was chosen for pre-training. In this work, we trained different deep model architectures to classify each input X-ray image into one of four classes: negative for pneumonia or typical, indeterminate, or atypical for COVID-19. This section details the data preprocessing, augmentations, and various models trained in both supervised and self-supervised approaches. Medical images are inherently different from natural images: radiographs are larger, grayscale, and present similar spatial structures across images. Therefore, not all traditional augmentations are appropriate (Eaton-Rosen et al., 2018) . Starting with the most commonly used and clinically-validated data preprocessing and augmentations for chest X-rays, we experimentally determined the best preprocessing and augmentations to be winsorization at 92.5-percentile, horizontal flip, rotation up to ±10 degrees, and scaling up to 20%. A mirrored lung replacement strategy (generating new images with a mirrored lung that presents GGOs) is proposed ( Figure 1a) . We also developed a left and right (L/R) lung replacement strategy (replacing the L/R lung with the opposite lung of a different patient within the same class) (Figure 1b) . Another approach to tackle the imbalanced categories problem is to perform over and under-sampling of the set. We particularly aim to balance all the classes to the negative class. Hence, we undersample the typical and oversample the atypical and indeterminate class to the size of the negative class. Finally, adding class weight to the loss function allows the model to assign higher weight to the minority classes. The baseline was chosen from four CNN architectures to explore the performance of lightweight models like MobileNet (Howard et al., 2017) and EfficientNet (Tan and Le, 2020) against dense models such as ResNet (He et al., 2015) and DenseNet (Huang et al., 2018) . DenseNet-121 was selected for comparison and evaluation of the different approaches due to its balance between accuracy and training speed. The model was trained using the following hyperparameters: image size of 224 × 224, batch size of 16, cross-entropy loss, ADAM optimizer, and learning rate of 0.001. 3.3. Self-supervised pre-training Self-supervised pre-training has proven effective in numerous medical tasks with a scarcity of labelled training data. Self-supervised deep CNN models have also been employed to classify COVID-19 cases from chest X-ray images and to deal with the class imbalance problem (Gazda et al., 2021) . In this section, we discuss our SSL approaches using deep CNN models on a large unlabelled dataset, CheXpert (Irvin et al., 2019) , and then fine-tune the model to classify the above-mentioned four classes. Adding to the work of (Sowrirajan et al., 2021) , we increased the augmentations used in the MoCo architecture by adding horizontal translation, random scaling, and decreasing the color temperature value. Further, we also pre-trained the DenseNet-121 model using the modified MoCo-CXR approach (Sowrirajan et al., 2021) , MoCo-V2 (Chen et al., 2020) , and MoCo-V2 with balanced data. Inspired by the work of (Pathak et al., 2016) , we explored the impact of focused lung masking inpainting on the model's ability to learn effective representations to identify chest abnormalities. We applied targeted lung masking by approximating its location for both lungs with varying sizes up to 32 × 32. Using this as a pretext task, we substituted the original AlexNet encoder with DenseNet-121, used the Mean Squared Error (MSE) loss, and omitted the adversarial loss to focus on transferability rather than fine reconstruction. Additionally, center inpainting was also explored, where a center 100 × 100 mask is created on the X-ray images, and the model is tasked with reconstructing the original masked region. Figure 5 in Appendix A shows the center mask and left and right targeted lung mask with the reconstructed images. Vision Transformer (ViT) models (Dosovitskiy et al., 2021) emerged recently to outperform CNN models on many vision tasks including medical ones in some settings (Matsoukas et al., 2021) . In this work, we explored the performance of ViT models, where we fed 16 × 16 patches of the chest X-ray images to both pre-trained and fine-tuned ViT models. For pre-training, a wide varieties of X-ray dataset were used including CheXpert (Irvin et al., 2019) , tuberculosis radiography data (Rahman et al., 2020) , and NIH data , while SIIM-FISABIO-RSNA dataset is used for fine-tuning. For pre-training, we created a mask over the input image and tasked the ViT with reconstructing the masked region patch. Exploration of the recent ConvMixer model (Anonymous, 2022) was also done, where standard convolutions operate directly on the patches as input to achieve mixing between spatial and channel dimensions. We explore various architectures of ConvMixer, ConvMixer-1024/20, ConvMixer-1536/20, and ConvMixer-768/32, with different patches and kernel sizes while starting from ImageNet pre-trained weights and from random initialization. Tables 1 and 2 summarize our experimental results on the SIIM-FISABIO-RSNA dataset. Assessing the best performing models from the 80-20% split (Table 2) on the 5-fold crossvalidation set (Table 1 ) allows for better judgment surrounding the generalization ability of the model. We also focus on the F 1 -macro score to ensure fair comparison between the models considering the unbalanced nature of the classes. Our best performing baseline architecture was DenseNet-121, consistent with its reported success with CXRs in literature (Rajpurkar et al., 2017) . Interestingly, the lung augmentation strategies to address class imbalance lead to a deterioration of the performance of the baseline model. This can be explained with unnatural-looking X-rays that were introduced to the dataset (Appendix A Figure 6 ). The best performing model uses the MoCo and inpainting weights pre-trained on CheXpert with an average F1-score of 0.4583 and 0.4794, respectively. ViT performs the worst, likely due to the dissimilarity between ImageNet and CXRs and the lack of a large enough X-ray dataset used for pre-training. For a qualitative comparison of our results, we present the GradCAM (Selvaraju et al., 2016) heatmap outputs of the baseline against the SSL models ( Figure 2) . The effectiveness of self-supervised learning is evident in that the model is better able to focus on the lung regions when using the MoCo and inpainting pre-trained weights, while the decision-making appears to be more sporadic on the baseline model. This suggests that using a SSL model pre-trained on a larger, related dataset may result in better predictions and fewer false positives than self-supervised. Comparing MoCo and inpainting SSL methods, we have not observed a consistent trend distinguishing the qualitative outputs of the two. Nevertheless, the gains from these methods are still inadequate for practical clinical use in the classification of COVID-19 appearances, with the maximum F 1 -macro score not exceeding 0.5. We outline below some challenges with the image, class, and labels of this dataset for consideration. Figure 2 : GradCAM heatmaps for "Negative" (Class 0) and "Typical" (Class 1) appearance of pneumonia (importance increases from red to blue). Bounding boxes show the ground truth radiologists' annotations. Given the complexity of chest X-rays where 3D anatomical features are superimposed in 2D format, and the abstract appearance of diseases, it is difficult even for the experienced radiologists to precisely distinguish different pathological patterns on CXR. The pathologies usually do not have well-defined shapes, sizes, or edges, but rather are characterized by intensity variations and locations relative to other organs. An important consideration that has been made in the curation of this dataset is to differentiate between the detection of visual symptoms versus the inference of a disease. For typical and indeterminate, the descriptions of their appearance suggest that location is a de- termining factor to distinguish between the classes, where typical is bilateral and primarily found in the lower lung, and indeterminate is unilateral but primarily found in the central or upper lung (Litmanovich et al., 2020). However, some inconsistencies are demonstrated in the labeling of the dataset (Figure 3 ; Appendix A Figure 7) , resulting in uncertainty of the ground truth labels that may hinder the learning of the model and contribute to poorer performance. As for the atypical class, the challenge comes from the loaded terminology. "Atypical" is an umbrella term that consists of an array of abnormalities uncommonly reported for COVID-19 pneumonia, including "pneumothorax or pleural effusion, pulmonary edema, lobar consolidation, solitary lung nodule or mass, diffuse tiny nodules, [and] cavity" (Litmanovich et al., 2020) . Given the multiple possible appearances of atypical, coupled with the lack of available data belonging to this class (8%), the model may not have adequate samples of each abnormality to learn meaningful representations from. This may be the reason the self-supervised models pre-trained on CheXpert are also able to perform better than fully supervised, especially since CheXpert contains images with some of these abnormalities. For future consideration, it is suggested that such loaded class label is avoided or untangled to preserve the integrity of the ground truths. It is common in clinical practice to designate "indeterminate", "suspected", or the likes to uncertain cases. According to (Simpson et al., 2020) , "indeterminate features are those that have been reported in COVID-19 pneumonia but are not specific enough to arrive at a relatively confident radiologic diagnosis." Typically, such cases warrant further tests or investigation to verify the status and condition of the patient due to insufficient visual cues to confidently reach the true diagnosis. From a machine learning perspective, the incorporation of such classes is counteracting its need to have highly accurate ground truths for effective loss calculation and performance evaluation. We hypothesized the removal of such class would improve the predictive power of the model. However, Table 3 shows similar performance in the binary classifications of all pairs of positive classes, including typical-atypical; while the accuracy is higher, this is due to class imbalance and the F 1 -score remains similar. A recent study suggests that over 25% of COVID-19 patients exhibit co-occurring symptoms, thus blurring the distinction between classes, adding variability and uncertainty to the ground truth labels (Kim et al., 2020) . Most SOTA methods have been developed for use with natural images. Our work has shown that the current state of data curation and SOTA machine learning architectures are still insufficient for the accurate classification of COVID-19 pneumonia from CXRs. We have outlined several challenges in creating a viable AI solution for COVID-19 classification. While it is easy to suggest better curation of datasets, such effort is time-consuming and challenging especially with the lack of available resources (Allyn, 2020) . As long as manual annotators are involved, there will always be room for errors and subjectivity. It is thus important for the community, both clinicians and machine learning researchers, to acknowledge that unless a golden standard is used to annotate the labels, a 100% accuracy is unlikely and in fact undesired. In the case of COVID-19, a golden standard is yet to exist (Hernández-Huerta et al., 2021) . In terms of machine learning, the challenge surrounding visual cues calls for a shift in approach where effort has to be put in developing methods that focus on intensity variations rather than edge detections. The superiority of DenseNet-121 over other CNNs suggests that a method that exploits and propagates earlier low-level features to later layers in a CNN may be of particular benefit for the development of a successful medical machine learning model. The presence of co-infections calls for a re-evaluation of the labeling strategy and perhaps the use of multilabel classification for COVID-19 appearances. Additionally, to pave a path to certainty for the indeterminate class, we suggest emulating the clinicians' workflow and integrating the clinical context of the patients along with other relevant information through multimodal learning. Such solution will diffuse the uncertainty in the labelling and have the added benefit of being able to increase the model's explainability and the potential end-user clinicians' confidence. Deep learning approaches for detecting covid-19 from chest x-ray images: A survey International radiology societies tackle radiologist shortage Patches are all you need? In Submitted to The Tenth International Conference on Learning Representations Coronavirus (covid-19) classification using ct images by machine learning methods Outbreak of coronavirus disease 2019. The Lancet Infectious Diseases Improved baselines with momentum contrastive learning The role of chest radiography in confirming covid-19 pneumonia A pandemic recap: lessons we have learned Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients Research on classification of covid-19 chest x-ray image modal feature fusion based on deep learning An image is worth 16x16 words: Transformers for image recognition at scale Improving data augmentation for medical image segmentation Self-supervised deep convolutional neural network for chest x-ray classification Deep residual learning for image recognition Should rt-pcr be considered a gold standard in the diagnosis of covid-19 Mobilenets: Efficient convolutional neural networks for mobile vision applications The role of chest imaging in the diagnosis, management, and monitoring of coronavirus disease 2019 (covid-19) Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Classification of covid-19 chest x-ray and ct images using a type of dynamic cnn modification method Rates of co-infection between sars-cov-2 and other respiratory pathogens The 2021 siim-fisabio-rsna machine learning covid-19 challenge: Annotation and standard exam classification of covid-19 chest radiographs Review of chest radiograph findings of covid-19 pneumonia and suggested reporting language Is it time to replace cnns with transformers for medical images? Context encoders: Feature learning by inpainting Deep transfer learning based classification model for covid-19 disease. Irbm Classification of covid-19 chest x-rays with deep learning: new models or fine tuning? Reliable tuberculosis detection using chest x-ray with deep learning, segmentation and visualization Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization Radiological society of north america expert consensus document on reporting chest ct findings related to covid-19: endorsed by the society of thoracic radiology, the american college of radiology, and rsna Moco-cxr: Moco pretraining improves representation and transferability of chest x-ray models Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has covid-19 Rethinking model scaling for convolutional neural networks The rsna international covid-19 open radiology database (ricord) Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases Figure 4: Mask region constraint for targeted L/R inpainting. This is performed by posing the following constraints: 10% from the left and right, 15% from the top, and 20% from the bottom of the chest X-rays.