key: cord-1026021-ilc2bzkx authors: Zhou, M.; Chen, Y.; Wang, D.; Xu, Y.; Yao, W.; Huang, J.; Jin, X.; Pan, Z.; Tan, J.; Wang, L.; Xia, Y.; Zou, L.; Xu, X.; Wei, J.; Guan, M.; Feng, J.; Zhang, H.; Qu, J. title: Improved deep learning model for differentiating novel coronavirus pneumonia and influenza pneumonia date: 2020-03-30 journal: nan DOI: 10.1101/2020.03.24.20043117 sha: dd6fdf2cc09a1ccaf015c0c74902bc2d3b297d4f doc_id: 1026021 cord_uid: ilc2bzkx Background: Chest CT had high sensitivity in diagnosing novel coronavirus pneumonia (NCP) at early stage, giving it an advantage over nucleic acid detection in time of crisis. Deep learning was reported to discover intricate structures from clinical images and achieve expert-level performance in medical image analysis. To develop and validate an integrated deep learning framework on chest CT images for auto-detection of NCP, particularly focusing on differentiating NCP from influenza pneumonia (IP). Methods: 35 confirmed NCP cases were consecutively enrolled as training set from 1138 suspected patients in three NCP designated hospitals together with 361 confirmed viral pneumonia patients from center one including 156 IP patients, from May, 2015 to February, 2020. The external validation set enrolled 57 NCP patients and 50 IP patients from eight centers. Results: 96.6% of NCP lesions were larger than 1 cm and 76.8% were with intensity below -500 Hu, indicating less consolidation than IP lesions which had nodules ranging 5-10 mm. The classification schemes accurately distinguished NCP and IP lesions with area under the receiver operating characteristic curve (AUC) above 0.93. The Trinary scheme was more device-independent and consistent with specialists than the Plain scheme, which achieved a F1 score of 0.847, higher than the Plain scheme (0.774), specialists (0.785) and residents (0.644). Conclusions: Our study potentially provides an accurate early diagnosis tool on chest CT for NCP with high transferability, and shows high efficiency in differentiating NCP and IP, helping to reduce misdiagnosis and contain the pandemic transmission. The spread of novel coronavirus pneumonia (NCP) induced by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has now entered a new phase in which new confirmed cases continue to decline in China while the novel virus rapidly spreading across the world with infected patients surging in hot spots of countries, such as European and Eastern Mediterranean region. Early diagnosis is critical for both epidemic control and prompt medical intervention. Notably, pneumonia identified outside high-incidence areas are likely to be induced by a broad spectrum of pathogens, especially influenza which has high incidence in winter and spring. Influenza pneumonia (IP) brings huge burden to healthcare system due to its high morbidity and mortality rate. In United States, influenza accounted for more than 29 million infections and 16,000 deaths in 2019 (1) . It was reported that oral oseltamivir accelerates symptom alleviation and reduces risks of lower respiratory tract complications in influenza (2) . Therefore, early diagnosis and separation of IP patients from NCP patients will improve prognosis and optimize the allocation of medical resources. However, apart from overlapping symptoms and laboratory abnormalities, IP and NCP manifest similar chest CT findings (3), making it difficult to differentiate these two kinds of viral pneumonia. The diagnostic efficiency of nucleic acid detection (4) is constrained by following limitations: 1) high false negative rate owing to low virus load at early infection stage (5) or possible genetic mutations (6) ; 2) shortage of detection reagents; and 3) long waiting time. It is found that some early-onset NCP patients who had already presented abnormal chest CT findings still got negative results on the initial nucleic acid test. As a result, the category of clinically diagnosed NCP was added in the fifth version of diagnosis and treatment scheme released by Chinese National Health . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint Commission (7), referring to suspected cases showing characteristics of viral pneumonia on chest CT, with the intention to reduce mortality rate and occurrence of cross infection while patients wait for laboratory confirmation. Additionally, CT has the advantage of evaluating the severity and surveilling the dynamic progress of pneumonia (8) . The key issues in improving the capability to distinguish NCP from IP on chest CT scan are how to find the lesions quickly and make accurate differential diagnosis. The problem could be alleviated by deep learning, a technique that has witnessed striking advances in healthcare applications (9, 10) . It could achieve expert-level performance in medical image analysis with minimal time and labor cost, like detection of diabetic retinopathy and classification of skin cancer (11, 12) . Deep learning is also widely used to automatically detect pneumonia based on chest X-ray images (13, 14) , and discriminate usual interstitial pneumonia from nonspecific interstitial pneumonia based on chest CT images (15) . In this study, we developed and validated an integrated deep learning framework on chest CT images for auto-detection of NCP, particularly focusing on differentiating NCP from IP, ensuring prompt implementation of isolation. To alleviate transferability problem that a well-trained deep learning model performs poorly on data from unseen sources (16), we proposed a novel training scheme (Trinary scheme) to encourage the model to learn device independent features. This retrospective study was conducted in eight tertiary referral centers (Center 1~8). In three designated hospitals for NCP screening, 35 confirmed NCP cases were . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (17, 18) . The external validation set enrolled 57 NCP patients and 50 IP patients from eight hospitals. Inclusion and exclusion criteria, distribution of patients and flow chart of this study was shown in Figure 1 . More details were in the Method E1. In our study, the lesion regions of each CT image were annotated by two radiologists, who have more than 10 years of experience in pulmonary-thoracic disease and were aware of the clinical history of infection. We used YOLOv3 to perform lesion detection on the selected images (19) . The structure of YOLOv3 was presented on Figure E1 and detail information and CT slice thickness were presented on Table E1 and Method E2. Because of the limited number of annotations, we chose VGGNet as the classification model (20) . It is improved on the basis of AlexNet. To better fit our problem, we made some modifications on the original VGGNet and used transfer learning (16, 21) . based on previous reports. We denoted the normal training process as the Plain scheme. To better solve the transferability problem of deep learning, we proposed a possible device-specific solution, named as the Trinary training scheme. The process for the Trinary scheme was described in the Method E3. Patient level classification was based on lesion level classification results. By . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint taking sum of the predicted probabilities for all the lesions of a patient and then normalized between NCP and IP, we got the patient level classification. This simple averaging step could be considered as a model ensemble (22) for patient level classification. To compare the performance of the deep learning framework and radiologists in the external validation group, a panel of ten radiologists were recruited. They were instructed to independently provide a classification decision on NCP or IP each time. We also classified the lesion by radiologists to determine which scheme was closer to the judgement of human experts. Details on evaluation were shown in the Method E4. To better understand the performance difference between the Plain scheme and the Trinary scheme in different CT devices, we divided the NCP data on the external validation set into two categories. The first category contained 20 cases from centers (Center 1~3) that also appeared in the training set. The second category contained the remaining 37 cases from Center 4~8. We compared the performance between two schemes on the two categories. The classification metrics used included area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy, precision and F1 score. Details of statistics were in the Method E5. Details of clinical information for patients in the training, validation, test and external validation set were shown in Table E2 and the Result E1. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03.24.20043117 doi: medRxiv preprint We further performed a joint analysis of imaging features for the 35 NCP patients and 156 IP patients with 499 (1178) NCP (IP) lesions. 96.6% of NCP lesions were larger than 10 mm and 35.3% of the lesions were inhomogeneous, which was significantly different from that of IP (p=0.0094). Lesions with intensity less than -500Hu accounted for 76.8% of lesions in NCP indicating less consolidation than IP. 5.4% lesions in IP were nodules (Hu>0) and 21 (5.6%) nodules of IP were 5-10 mm. Detailed information was presented on the Result E2, Table E3 and Figure Figure E5 . The results showed that the detection performance was not sensitive to the confidence score as long as the cutoff for confidence score was in a reasonable range (Table E4 ). The detection model achieved F1 score 0.742 under confidence cutoff 0.1. We further used the annotated lesions to train and evaluate the model. Trinary scheme (with AUC 0.95) performed better than the Plain scheme (with AUC 0.93) ( Figure E6 ). More performance measures can be found in Table E5 . Two experienced specialists classified the lesions on which two schemes made very different predictions (with probability difference no less than 0.5) ( Figure E7 ). 366 (or 174) out of 540 NCP lesions were identified by Trinary (or Plain) scheme correctly. Detailed analysis showed that the Plain scheme tend to yield unreasonably high or low probability of lesion predictions depending on the lesions from centers in the training set or not. The results indicated the Trinary scheme was more consistent with specialists than Plain scheme on the lesion level classification. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint Detailed information is presented in the Results E3-E5. The performance of human experts for patient classification was shown in Table E6 and the Result E6. Both of the specialist group and the resident group reached good consistency, with intraclass correlation coefficient (ICC) of 0.899 and 0.798, respectively. Correlation for 10 radiologists was presented on Table E7 . For the Plain and Trinary scheme, it took 10 seconds to detect and classify all detected lesions for a single patient on average. Figure 3 showed the ROC curves for the test and the external validation set of both training schemes. On the test set, the Plain and Trinary scheme performed similarly good with AUC of 0.99 ( Figure 3A ). The AUCs were much higher than the AUCs on the lesion level owing to the ensemble effect. For both schemes, the sensitivity is 100%. The specificities were 92.5% and 95% for the Plain and Trinary scheme, respectively (Table E8) Figure 3C ). Importantly, Trinary scheme performed better (AUC 0.91) than Plain scheme (AUC 0.87) on the second category (data was not from centers included in the training set) ( Figure 3D ). In terms of F1 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03.24.20043117 doi: medRxiv preprint measure, the Trinary scheme achieves score 0.847, which is higher than the Plain scheme (0.774) and also much higher than the specialist group (average 0.785) and the resident group (average 0.644). Trinary scheme was better correlated with specialists in both categories (Table E9 ). More details were in the Result E7. Table 1 summarized the CT devices on which both schemes and ten radiologists made wrong classification on cases from the external validation set. We first observed that the IP cases are from 10 CT devices, despite the fact that they were from the same center. The majority of the tested IP cases have been correctly classified by both schemes. The only exception was uCT 528, a new CT device. On uCT 528, eight patients were examined, seven of them were IP from Center 1, from which six and five cases were misclassified by Plain and Trinary scheme respectively. Yet more than three patients were also misdiagnosed by the specialists. The main manifestations were peripheral single or multiple ground grass opacities with or without patchy consolidation in the lower lobe or bilateral distribution, which mimic the findings of NCP, leading to the misclassification ( Figure E8 ). Another one was from Center 8 which was an NCP but misdiagnosed by all specialists ( Figure 4A ). The Trinary scheme performs better than the Plain scheme in this situation. Table 1 , the Plain scheme misclassified 16 cases and the Trinary scheme reduced it to 10. The error rates of both schemes for the two CT devices (SOMATOM Definition Flash and LightSpeed VCT) from Center 6 were exceptionally high compared to all other CT devices. As both of them only contributed IP training cases for the model, the classification model may learn the device specific features, and wrongly treated these features as specific to IP during training. During testing, the schemes would therefore tend to wrongly classify NCP. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03.24.20043117 doi: medRxiv preprint Similar problems have been observed in previous study (23). The Trinary scheme performed better than the Plain scheme on these devices, implying that the Trinary scheme is less influenced by the device specific features. Detailed information was in the Results E8. The escalating crisis caused by SARS-CoV-2 with high infectivity and multiple routes of transmission is complicated by its co-occurrence with seasonal influenza, exactly as the things happening in the United States that some COVID-19 deaths have been misdiagnosed as influenza. The similarities in clinical symptoms between NCP and IP, along with shortage and high false negative rate of nucleic acid detection kits, make the differential diagnosis difficult (24) (25) (26) , prompting clinicians to investigate new diagnostic methods. Chest CT had high sensitivity in diagnosing NCP at early stage, giving it an advantage over nucleic acid detection in time of crisis. This is the reason why Hubei Provincial Government adopted characteristic chest CT finding as an important criterion for diagnosis of NCP at the peak of outbreak. However, the similar chest CT manifestations of NCP and IP will inevitably lead to inaccurate diagnosis even for experienced physicians, and increase the risk of over-diagnosis and cross infection (3, 8) . The main challenge to employ CT as a predominant diagnostic tool is to improve the accuracy and speed in identifying specific lesions on chest CT images. We first annotated NCP and IP lesions and analyzed the difference of their chest CT features. We found that 76.8% of lesions in NCP are less than -500 HU, 96.1% of NCP patients had bilateral lung damage and 33.3% had all five lung lobes affected, consistent with the pathophysiology of NCP. SARS-CoV-2 is presumed to bind to . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint angiotensin converting enzyme 2 (ACE2) receptor (27) concentrated on alveolar type-2 epithelial cells, which will undergo apoptosis after infection, leading to diffuse alveolar damage and interstitial fluid absorption disorder (28) . Pathological findings of NCP showed pulmonary edema and hyaline membrane formation (29) . While influenza viruses primarily cause damage to the trachea epithelial cells, leading to necrotizing bronchitis and diffuse alveolar damage to the upper respiratory tract (3) . It was reported that the size of the nodules helps to differentiate different types of infections, for instance, the nodules of viral infection are ordinarily less than 10 mm (30) . Consistent with previous reports, we found that nodules are present more often in IP with their sizes ranging 5-10 mm. Based on above observations, we constructed an integrated artificial intelligence (AI) framework consisting of two deep learning models. The YOLOv3 model is applied to identify lesions, followed by lesion classification by the modified VGGNet. During developing the deep learning model, the first problem we met lies in transferability (5, 6) . The model performs better on cases from CT device appearing in the training set than cases from CT devices not included. To address this problem, especially when classifying image data from multiple CT devices, we proposed a Trinary classification scheme to penalize the network from extracting device specific features during learning. By doing so, it would lead to high cost on the random region inputs, forcing the model to extract more lesion specific features. Although it is impossible to exclude all device specific features, we observed a visible improvement in performance (AUC from 0.85 to 0.89) on patient level classification. Such a performance is comparable with the judgement of experienced specialists. 13 (22.8%) NCP patients presenting uncommon CT findings, such as a small ground-glass opacity (GGO) in the central part were correctly classified by our Trinary scheme, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03.24.20043117 doi: medRxiv preprint instead, were misdiagnosed by three specialists. We have verified the clinical applicability of our developed AI model by including data from multiple machines and centers. We first demonstrated that the AI model performed well using training and test data from four machines of three centers with an AUC of 0.99. Similar performance of the model with specialists on independent verification data from fifteen machines of eight centers further suggests good clinical applicability. Although our AI system achieved good performance, it misclassified a small number of NCP and IP patients, which may be caused by poor spatial resolution of some of images. In this study, we used 5 mm instead of 1 mm layer thickness in CT reconstruction, which would limit our capability to detect small lesions. Nevertheless, 5 mm layer thickness is a standard parameter in most hospitals and is sufficient to identify major imaging differences between NCP and IP as demonstrated by our study. Therefore, it is worthy to sacrifice certain accuracy to provide wider applicability of the deep learning model. Currently, SARS-CoV-2 is wildly spreading around the world, efficient and accurate diagnosis of NCP is crucial for prevention and control. Our deep learning model potentially provides an accurate early diagnostic tool for NCP, especially when nucleic acid test kits are short of supply, which is a common problem during outbreaks. This could help reduce the missed diagnosis rate and diagnosis time, ensure prompt patient isolation and early treatment, improve prognosis and largely prevent transmission. The high efficiency of our model to differentiate NCP and IP could be very beneficial to reduce misdiagnosis rate and optimize the allocation of medical resources, particularly in areas with high prevalence of both NCP and IP. Trinary scheme not only improves the performance of the model in discriminating NCP from IP, but also behaves more similar to specialists than the Plain scheme. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint Because the proposed Trinary scheme is designed for general purpose, we believe that it can be applied to a wide range of medical image classification. We would like to thank all the radiologists who helped with the analysis and interpretation of the imaging data. None. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Train Test EA S1 S2 S3 S4 S5 R1 R2 R3 R4 R5 Plain Trinary IP Center1 iCT 256 8 3 3 1 1 1 1 0 0 1 1 1 0 1 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 SOMATOM Perspective 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 Center 8 uCT 528 0 0 3 1 1 1 1 1 1 1 1 1 1 1 0 uCT 510 0 0 2 0 0 0 0 0 0 0 0 0 0 2 1 Total 35 15 57 14 9 16 17 26 26 24 20 25 27 16 10 Train, Test, and EV individually shows number of NCP or IP cases in the training set, the test set and the external validation set. Number of misclassified patients is presented on the specialist group (S1~S5), the resident group (R1~R5) and both deep learning schemes on the external validation group. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint Coronavirus Disease 2019 and Influenza Oseltamivir treatment for influenza in adults: A meta-analysis of randomised controlled trials CT Imaging Features of 2019 Novel Coronavirus (2019-nCoV) Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding National Health Commission of the People's Republic of China. Guidelines on Diagnosis and Treatment of COVID-19 (Version 5) Coronavirus Disease 2019 (COVID-19): A Perspective from China A guide to deep learning in healthcare On the Prospects for a (Deep) Learning Health Care System Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs Dermatologist-level classification of skin cancer with deep neural networks Development and Validation of a Deep Learning-Based Automated Detection Algorithm for Major Thoracic Diseases on Chest Radiographs Deep learning . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint Deep learning for classifying fibrotic lung disease on high-resolution computed tomography: a case-cohort study A Survey on Transfer Learning Diagnosis and treatment of community-acquired pneumonia in adults: 2016 clinical practice guidelines by the Chinese Thoracic Society Infectious Diseases Society of America/American Thoracic Society consensus guidelines on the management of community-acquired pneumonia in adults YOLOv3: An Incremental Improvement Very Deep Convolutional Networks for Large-Scale Image Recognition International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Viral pneumonia Clinical Characteristics of Coronavirus Disease 2019 in China Receptor recognition by novel coronavirus from Wuhan: An analysis based on decade-long structural studies of SARS Expression of elevated levels of pro-inflammatory cytokines in SARS-CoV-infected ACE2 + cells in SARS patients: relation to the acute lung injury and pathogenesis of SARS Pathological findings of COVID-19 associated with acute respiratory distress syndrome Infectious Pulmonary Nodules in Immunocompromised Patients: Usefulness of Computed Tomography in Predicting Their Etiology International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint The performance of Trinary schemes is better than plain scheme (AUC 0.91 and 0.87).. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.03.24.20043117 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03. 24.20043117 doi: medRxiv preprint