key: cord-0725650-bhgwmarx authors: Dorr, Francisco; Chaves, Hernán; Serra, María Mercedes; Ramirez, Andrés; Costa, Martín Elías; Seia, Joaquín; Cejas, Claudia; Castro, Marcelo; Eyheremendy, Eduardo; Slezak, Diego Fernández; Farez, Mauricio F. title: COVID-19 Pneumonia Accurately Detected on Chest Radiographs with Artificial Intelligence date: 2020-11-19 journal: Intell Based Med DOI: 10.1016/j.ibmed.2020.100014 sha: a584c4c57c471633bf2abe98c217d5d56944c44a doc_id: 725650 cord_uid: bhgwmarx PURPOSE: To investigate the diagnostic performance of an Artificial Intelligence (AI) system for detection of COVID-19 in chest radiographs (CXR), and compare results to those of physicians working alone, or with AI support. MATERIALS AND METHODS: An AI system was fine-tuned to discriminate confirmed COVID-19 pneumonia, from other viral and bacterial pneumonia and non-pneumonia patients and used to review 302 CXR images from adult patients retrospectively sourced from nine different databases. Fifty-four physicians blind to diagnosis, were invited to interpret images under identical conditions in a test set, and randomly assigned either to receive or not receive support from the AI system. Comparisons were then made between diagnostic performance of physicians working with and without AI support. AI system performance was evaluated using the area under the receiver operating characteristic (AUROC), and sensitivity and specificity of physician performance compared to that of the AI system. RESULTS: Discrimination by the AI system of COVID-19 pneumonia showed an AUROC curve of 0.96 in the validation and 0.83 in the external test set, respectively. The AI system outperformed physicians in the AUROC overall (70% increase in sensitivity and 1% increase in specificity, p<0.0001). When working with AI support, physicians increased their diagnostic sensitivity from 47% to 61% (p<0.001), although specificity decreased from 79% to 75% (p=0.007). CONCLUSIONS: Our results suggest interpreting chest radiographs (CXR) supported by AI, increases physician diagnostic sensitivity for COVID-19 detection. This approach involving a human-machine partnership may help expedite triaging efforts and improve resource allocation in the current crisis. HIGHLIGHTS -An AI system predicted COVID-19 pneumonia with AUROC of 0.96 and 0.83 in the validation and external test sets respectively. -The AI system outperformed physicians in the AUROC by an overall (70% increase in sensitivity and 1% increase in specificity, p<0.0001). -When the AI support was present, physicians increased their sensitivity (47% vs 61%, p<0.001) but decreased their specificity (79% vs 75%, p=0.007). Committee on Taxonomy of Viruses [4] . During the first two months of 2020, the virus causing the disease known as COVID-19 spread worldwide, showing evidence of human-to-human transmission between close contacts [5] . The World Health Organization declared the coronavirus outbreak a pandemic on March 11, and countries around the world struggled with an unprecedented surge in confirmed cases [6] . SARS-CoV-2 causes varying degrees of illness, the most common symptoms of which include fever and cough. However, acute respiratory distress syndrome may develop in a subset of patients, requiring their admission to intensive care and mechanical ventilation support, some of whom may die from multiple organ failure [7, 8] . Current COVID-19 guidelines rely heavily on clinical, laboratory and imaging findings to triage patients [9] [10] [11] [12] . The World Health Organization interim guidance for laboratory testing has recommended use of nucleic acid amplification tests such as real-time reverse transcriptase-polymerase chain reaction (RT-PCR) for COVID-19 diagnosis in suspected cases [13] . However, due to overwhelming levels of demand, RT-PCR kit shortages have been widely reported [14, 15] . Also, RT-PCR from nasopharyngeal and oropharyngeal swabs (the most common respiratory tract sampling sites) obtained within the first 14 days of illness onset, show varying sensitivity rates ranging between 29.6 to 73.3% and take several hours to process [16] . Although chest radiographs (CXR) and computed tomography (CT) are key imaging tools for pulmonary disease diagnosis, their role in the management of COVID-19 has J o u r n a l P r e -p r o o f not been clearly defined. Formal statements have been issued by both a multinational consensus from the Fleischner Society, proposing CXR as surrogate to RT-PCR in resource constrained environments [12] , and by the American College of Radiology which recently recommended avoiding chest CT as a first-line test for COVID-19 diagnosis, endorsing use of portable CXR instead, in specific cases [17] . Artificial intelligence (AI) has proven useful for CXR analysis in numerous clinical settings [18] [19] [20] [21] [22] , including preliminary work on COVID-19 [23] [24] [25] [26] . However, performance of these algorithms and their impact on clinical practice has not been thoroughly evaluated. Thus, we aimed to investigate the diagnostic performance of a fine-tuned AI system for detection of COVID-19 using DenseNet 121 architecture and compare results to those of radiologists and emergency care physicians working with or without AI support. J o u r n a l P r e -p r o o f For training and validation, a total of 302 CXR images from adult patients were randomly sourced from nine different databases, eight of them public and published online, and one from a local institution (patient age range:17-90 years; gender: 97 female, 156 male, 49 not available). CXR images collected conformed three distinct groups, those corresponding to COVID-19 pneumonia (n=102) diagnosis, a second set of non-COVID-19 pneumonia (n=100) cases, and a third group including normal CXR images and other non-pneumonia findings (n=100). For inclusion to the COVID-19 group, prior confirmatory RT-PCR (retrospective study) was required. The final database was curated by a radiologist who reviewed every CXR for quality eligibility criteria (i.e.: adequate exposure and no major artifacts). In cases for which age data was not available (n=51/302, see appended database) CXR images were double-checked for complete ossification. An independent test set including 60 CXR (age range: 20-80 years; gender: 29 female, 25 male, 6 not available), equally distributed among the three groups, was put together and curated using similar criteria. We based our COVID-19 CXR detection model on a pre-existing deep learning (DL) CXR model, previously trained for the CheXpert competition, and applied in a wide range of pathologies including pneumonia, pleural effusion, pneumothorax, and cardiomegaly, among others [27] . The model was trained using DenseNet 121 architecture [28] , in which final outputs (i.e., labels) were assigned by the last fully connected layer, with one neuron for each label resulting in a multi-label prediction. To perform transfer learning, we replaced the last layer with another fully connected layer To exploit the limited number of COVID-19 cases, we used the whole training set and applied to it a 5-fold cross-validation method, splitting 80% of the dataset for training and 20% for internal validation on each fold. We calculated the area under the receiver operating characteristic (AUROC) curves in the three groups for each fold. Once the training was done for each fold, we selected the epoch that had the best metric average among all the cross-validation folds (epoch 20) and retrained the algorithm with those best parameters using the whole training set. The performance of the algorithm was then validated using a completely independent test set [60]. We evaluated the performance of the algorithm on this dataset using sensitivity and specificity, as well as AUROC curve measures. Given that model output was multilabel, we selected the output class with the highest probability to convert it to a multiclass problem and calculate the metrics. For example, if the multilabel sigmoid output prediction was (0.2, 0.6, 0.9) we took the maximum probability (0.9) and returned the vector (0, 0, 1). We found that by doing this, instead of retraining the model explicitly with a multi-class loss and a softmax output, the performance was better and avoided a bias to label almost everything as COVID positive. To evaluate diagnostic performance of physicians interpreting CXRs, with and without support of the DL-model, we conducted an online survey. Physicians (radiologists [n=23] and emergency care physicians [n=31]) had to decide whether CXR findings were compatible with COVID-19 pneumonia, non-COVID-19 pneumonia or neither. Sixty cases in total (i.e., the entire test set: 20 COVID-19 pneumonia, 20 non-COVID-19 pneumonia and 20 non-pneumonia CXRs) were shown to each survey responder. An AI prediction was shown in randomized fashion to half the cases in each subset. Physicians J o u r n a l P r e -p r o o f had a maximum of 20 minutes to complete the survey. A full set of answers is available online 1 . To evaluate AI system performance, AUROC was estimated using the normalized Mann-Whitney U statistic. We then compared sensitivity and specificity of physicians, to the optimal cutoff point of the AI system. To establish the effect of AI support on physician performance, we constructed a mixed model with a repeated-measures design, including presence or absence of AI support, seniority level (junior vs senior, based years since specialty degree, under or over 5 years), type of specialty (radiologists vs other specialists); with interactions as independent variables and sensitivity and specificity as dependent variables (Supplementary Table) . Statistical analyses were conducted using Python scikit-learn library and Stata version 12.1. Unless noted, mean ± standard deviation is reported. Two-tailed P values <0.05 were considered statistically significant. Because the DL system source code used for this analysis contains proprietary information, it cannot be made fully available for public release. However, non-privative code parts have been released in a public repository that can be found in https://bitbucket.org/aenti/entelai-covid-paper. All study experiments and implementation methods are described in detail and the tool itself is available online at: https://covid.entelai.com, to enable independent replication. Local datasets and links to image repositories used in the study are publicly available online 2 . We fine-tuned a pre-established AI system using a dataset of 302 CXR of COVID-19, other pneumonia, and other non-pneumonia cases. After 20 epochs of training, we obtained a mean AUROC curve among the 5 cross-validation folds of 0.96 ± 0.02 (see Figure 2 and Table 1 ). the models are using to support their predictions [29] . We analyzed activation maps for COVID-19 and compared them to other pneumonias, to validate the model and identify potential sources of information. The activation maps were obtained by taking the output of the average pooling layer and taking the mean across channel dimension [30] . As shown in Figure 3 , activation maps generated using this AI system relied heavily on lower pulmonary lobes, and on peripheral lung regions in particular. Of note, peripheral infection patterns have recently been described as a key feature in COVID-19 [8, 31] , suggesting the AI system was able to predict COVID-19 diagnosis using relevant information from CXRs. Since training can overfit prediction to a particular dataset, we generated an independent test set comprising 60 images (20 per category) to evaluate AI system performance. AUROC, Brier and Mean Absolute Error scores were obtained on a one-vs-rest basis. Brier scores in particular are widely used in medical research to assess and compare model prediction accuracy [32] . Values range from 0 to1, with 0 being the best possible outcome. Although they can be used as a single multiclass score, in this study we reported Brier scores by class, to obtain a better idea of how well the model performed for each one. As shown in Table 2 and Figure 4 , performance of the model was not as good, but nevertheless acceptable, since this AI system was able to predict COVID-19 with a sensitivity and specificity of 80% and an AUROC of 0.84. This difference between the cross-validation and the test results could be explained by the data sets used. Since the number of instances of each dataset is low, it is almost impossible to obtain a perfect J o u r n a l P r e -p r o o f generalization. The model could be learning certain particularities of the training set that, in spite of doing a cross-validation and having regularization by dropout, the overfitting to the specific dataset could not be completely overcome. More data will be needed to achieve a similar score between the cross-validation and the test set. J o u r n a l P r e -p r o o f Receiver operating characteristic (ROC) curve and area under the curve (AUC) of the AI system on the train and test sets. Physician performance with and without AI support is compared. We next analyzed whether identification and separation of COVID-19 by physicians was adequate, given the novelty of the disease and the lack of worldwide experience. To this end, we tested the performance of 60 physicians from several different referral centers in South America. Six physicians were excluded for not completing the survey in time prediction had been for 50% of the images (which could be correct or incorrect as per its performance on the same Test Set). AI system prediction was shared with physicians as a likelihood percentage for each condition. Physicians would then have to give the most likely diagnosis, given the AI suggestion. As shown in Figure 4 , sensitivity and specificity for COVID-19 prediction based on CXR by physicians was 47% and 79% respectively, with an increase in sensitivity to 61% (p<0.001) and a decrease in specificity to 74% (p=0.007) when using AI support. No significant differences between radiologists and emergency care physicians were observed, nor did years of training affect overall performance results (data not shown). J o u r n a l P r e -p r o o f In the setting of the COVID-19 pandemic, it is probable that RT-PCR tests will become more robust, quicker, and ubiquitous. However, due to the actual shortage and limitations of RT-PCR kits, diagnostic imaging modalities such as CXR and CT have been proposed as surrogate methods for COVID-19 triage. Some researchers have even reported chest CTs as showing higher sensitivity for COVID-19 detection than RT-PCR from swab samples [33, 34] . Mei et al., went further and used AI to integrate chest CT findings with clinical symptoms, exposure history and laboratory testing achieving an AUROC of 0.92 and had equal sensitivity as compared to a senior thoracic radiologist [35] . However, the American College of Radiology currently recommends CT be reserved for hospitalized, symptomatic patients with specific clinical indications [17] . CT also increases exposure to radiation, is less cost-effective, not widely available and requires appropriate infection control procedures during and after examination, including closing scanning rooms for up to 1 hour for airborne precaution measures [36] . This is why CXR (the most commonly performed diagnostic imaging examination) has been proposed as first-line imaging study when COVID-19 is suspected, especially in resource-constrained scenarios [11, 12] . Portable X-ray units are particularly suitable, as they can be moved to the emergency department (ED) or intensive care unit and easily cleaned afterwards [17] . Most clinicians have less experience interpreting CXRs than radiologists. In the ED setting however, physicians with no formal radiology training are the ones most often reporting CXR findings. Gatt el al., found sensitivity levels as low as 20% for CXR evaluation results by emergency care physicians [37] . One would expect this sensitivity sensitivities as high as 100% in experienced radiologists [39] . In our study we noted a much lower sensibility (both in radiologist and emergency care physicians) for the diagnosis of COVID-19 pneumonia. This could be explained by the fact that, at the time of the clinical study, most physicians that participated in the survey, have been exposed to few COVID-19 cases. Low sensibility could also be related to the online survey design, as physicians evaluated CXR in a different fashion as they do in their clinical practice, with a limited amount of time to give a diagnosis. We also noted decreased specificity, due to increased numbers of false positives in the AI-supported group. In every case, false positives arose from doubts over the "Other Pneumonias" category; although the AI model correctly predicted and presented the label "Other Pneumonias", physicians were still inclined to favor a COVID-19 diagnosis. The significance and clinical impact of this effect is unclear and deserves further evaluation. AI has proven useful in CXR analysis for many diseases [18] [19] [20] [21] [22] . In the setting of [40] . They compared the performance of the AI system to radiologist performance but did not evaluate the change in diagnostic accuracy of radiologist without and with AI support as we did. Considering the prevalence of adults in the COVID-19 group, we chose to exclude pediatric databases to avoid major bias in training and testing. J o u r n a l P r e -p r o o f Early diagnosis, isolation and prompt clinical management are the three public health strategies collectively contributing to contain the spread of COVID-19. AI models building on the first of these premises might be significant [41] . J o u r n a l P r e -p r o o f In conclusion, our data suggests physician performance can be improved using AI systems such as the one described here. We showed an increase in sensitivity from 47% to 61% for COVID-19 prediction based on CXR. Future prospective studies are needed to further evaluate the clinical and public health impact of the combined work of physicians and AI systems. J o u r n a l P r e -p r o o f The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health -The latest 2019 novel coronavirus outbreak in Wuhan Outbreak of pneumonia of unknown etiology in The mystery and the miracle Clinical features of patients infected with 2019 novel coronavirus in Wuhan Severe acute respiratory syndrome-related coronavirus: The species and its virusesa statement of the Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia World Health Organization. Coronavirus disease (COVID-19) outbreak Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Coronavirus Disease 2019 in China Manejo en urgencias del COVID-19 The British Society of Thoracic Imaging, BSTI NHSE COVID-19 Radiology The Role of Chest Imaging in Patient Management During the COVID-19 Pandemic: A Multinational Consensus Statement From the Fleischner Society Coronavirus disease (COVID-19) technical guidance: laboratory testing for 2019-nCoV in humans ASM Expresses Concern about Coronavirus Test Reagent Shortages The New Yorker. Why Widespread Coronavirus Testing Isn't Coming Anytime Soon Evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-nCoV infections. Infectious Diseases (except HIV/AIDS) ACR Recommendations for the use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected COVID19-Infection Deep learning in chest radiography: Detection of findings and presence of change Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists Deep Learning Applications in Chest Radiography and Computed Tomography: Current State of the Art Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning How far have we come? Artificial intelligence for chest radiograph interpretation COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images CoroNet: A Deep Neural Network for Detection and Diagnosis of Covid-19 from Chest X-ray Images COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Densely Connected Convolutional Networks. ArXiv:160806993 [Cs Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. ArXiv:191010045 [Cs Learning Deep Features for Discriminative Localization Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients Use of Brier score to assess binary predictions Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Artificial intelligenceenabled rapid diagnosis of patients with COVID-19 Policies and Guidelines for COVID-19 Preparedness: Experiences from the University of Chest radiographs in the emergency department: is the radiologist really necessary? Frequency and Distribution of Chest Radiographic Findings in COVID-19 Positive Patients Chest x-ray in the COVID-19 pandemic: Radiologists' real-world reader performance COVID-19 on Chest Radiographs: A Multireader Evaluation of an Artificial Intelligence System Artificial Intelligence (AI) applications for COVID-19 pandemic Humanmachine partnership with artificial intelligence for chest radiograph diagnosis Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.J o u r n a l P r e -p r o o f ☐ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☒The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:Mauricio F. Farez has received professional travel/accommodations stipends from Merck-Serono Argentina, Teva Argentina and Novartis Argentina. The rest of the authors declare no competing interests.