key: cord-034686-y0y5ltxs authors: Gieraerts, Christopher; Dangis, Anthony; Janssen, Lode; Demeyere, Annick; De Bruecker, Yves; De Brucker, Nele; van Den Bergh, Annelies; Lauwerier, Tine; Heremans, André; Frans, Eric; Laurent, Michaël; Ector, Bavo; Roosen, John; Smismans, Annick; Frans, Johan; Gillis, Marc; Symons, Rolf title: Prognostic Value and Reproducibility of AI-assisted Analysis of Lung Involvement in COVID-19 on Low-Dose Submillisievert Chest CT: Sample Size Implications for Clinical Trials date: 2020-10-22 journal: Radiol Cardiothorac Imaging DOI: 10.1148/ryct.2020200441 sha: doc_id: 34686 cord_uid: y0y5ltxs PURPOSE: To compare the prognostic value and reproducibility of visual versus AI-assisted analysis of lung involvement on submillisievert low-dose chest CT in COVID-19 patients. MATERIALS AND METHODS: This was a HIPAA-compliant, institutional review board-approved retrospective study. From March 15 to June 1, 2020, 250 RT-PCR confirmed COVID-19 patients were studied with low-dose chest CT at admission. Visual and AI-assisted analysis of lung involvement was performed by using a semi-quantitative CT score and a quantitative percentage of lung involvement. Adverse outcome was defined as intensive care unit (ICU) admission or death. Cox regression analysis, Kaplan-Meier curves, and cross-validated receiver operating characteristic curve with area under the curve (AUROC) analysis was performed to compare model performance. Intraclass correlation coefficients (ICCs) and Bland- Altman analysis was used to assess intra- and interreader reproducibility. RESULTS: Adverse outcome occurred in 39 patients (11 deaths, 28 ICU admissions). AUC values from AI-assisted analysis were significantly higher than those from visual analysis for both semi-quantitative CT scores and percentages of lung involvement (all P<0.001). Intrareader and interreader agreement rates were significantly higher for AI-assisted analysis than visual analysis (all ICC ≥0.960 versus ≥0.885). AI-assisted variability for quantitative percentage of lung involvement was 17.2% (coefficient of variation) versus 34.7% for visual analysis. The sample size to detect a 5% change in lung involvement with 90% power and an α error of 0.05 was 250 patients with AI-assisted analysis and 1014 patients with visual analysis. CONCLUSION: AI-assisted analysis of lung involvement on submillisievert low-dose chest CT outperformed conventional visual analysis in predicting outcome in COVID-19 patients while reducing CT variability. Lung involvement on chest CT could be used as a reliable metric in future clinical trials. . The full spectrum of COVID-19 severity is still being clarified but appears to be wide, ranging from asymptomatic status or mild upper respiratory tract symptoms to severe viral pneumonia, multiple organ dysfunction and even death (4) . Chest computed tomography (CT) has emerged as an accurate tool for the initial diagnosis of patients with possible COVID-19 infection (5) . Additionally, CT may represent a non-invasive tool for patient prognostication as the extent of lung involvement on chest CT appears to be an important prognostic marker (6, 7) . Multiple Artificial Intelligence (AI) software packages are currently being developed to aid radiologists in the quantification of lung involvement in COVID-19. However, little is known about the reproducibility of these software packages and how they may improve outcome prediction. We hypothesized that the use of semiautomated AI may both improve CT reproducibility and allow for more accurate patient prognostication. We assessed COVID-19 patients who underwent chest CT at our institution by conventional visual and AI-based quantification of lung injury. We also determined the impact of chest CT variability on sample size estimates that would be applicable in a clinical trial (e.g., to determine the potential response to novel antiviral therapies). The aim of this study was I n p r e s s 6 therefore to determine reader and software variability in the measurement of lung injury in COVID-19 and assess its impact on patient prognosis. This retrospective study was compliant with the Health Insurance Portability and Accountability Act (HIPAA) and was approved by our institutional review board (Imelda Hospital, Bonheiden, Belgium). Informed consent was waived. From March 15 th to June 1 st 2020, 250 consecutive patients with clinical suspicion of COVID-19 pneumonia were tested with both RT-PCR and CT within a 2-hour interval of hospital admission. Epidemiological, demographic, clinical, and laboratory data at admission were obtained from the electronic patient management system. Two PCR platforms (Aries system, Luminex, Austin, USA and Rotorgene Q, Qiagen, Hilden, Germany) were used to detect SARS-CoV-2 in nasopharyngeal swabs (eSwab, Copan Diagnostics, Brescia, Italy), both using the E-gene as target. Primers and probe sequences for the E-gene were provided by the Belgian National Reference Center (University Hospitals Leuven, Belgium). No cross reactivity for other human Coronaviruses, Influenza or Respiratory Syncytial Virus (RSV) has been shown for both platforms. Part of the patient population has been previously reported in studies assessing the accuracy of chest CT for COVID-19 diagnosis and the impact of gender on the extent of lung injury (5, 8) . Adverse outcome was defined as death or intensive care unit (ICU) admission. In patients with multiple events, only the first event was considered for event-free survival analysis. Only patients with a final outcome (death or discharge) were included in the final I n p r e s s 7 analysis. No patients were excluded from analysis after initial inclusion. No adverse event occurred from the chest CT exams. All patients underwent non-contrast low-dose chest CT by using a Somatom Definition AS 64-slice 0.6 mm detector scanner (Siemens Healthineers, Forchheim, Germany). We used vendor-supplied software (CareDose 4D and CarekV, Siemens Healthineers) to calculate sizespecific radiation dose estimates for the low-dose chest CT protocol which was adapted from the protocol used for lung cancer screening with reference values in an average patient of 100 kVp and 20 mAs (9). We used a 0.5 second rotation time and a pitch of 1.2 to limit motion artifacts in dyspneic patients. Effective radiation dose was calculated by multiplying the doselength product (DLP) by 0.014 mSv/mGy · cm as the constant k-value for thoracic imaging (10). Reconstruction parameters were: 1 mm/0.7 mm slice thickness/increment with a standard lung-tissue kernel (I50f medium sharp) and 3 mm/3 mm slice thickness/increment with a standard soft tissue kernel (I31f medium smooth), sinogram-affirmed iterative reconstruction (SAFIRE) strength 3, 450 mm FOV and 512 x 512 matrix size. Visual analysis of lung involvement was performed by using a semi-quantitative scoring system as previously described (5) . In short, each lobe was scored from 0 to 5 with a total score ranging from 0 to 25: score 0, 0% involvement; score 1, <5% involvement; score 2, 5-25% involvement; score 3, 26-50% involvement; score 4, 51-75% involvement, score 5, 76-100% I n p r e s s 8 involvement. Involvement was visually defined as any area of GGO, crazy-paving or consolidation and percentage was estimated by combining axial, coronal, and sagittal reconstructions. For the semi-quantitative score, a higher number indicated a higher ranking and involvement (e.g., a score of >7 indicates all scores from 8 to 25). AI-powered analysis of lung involvement was performed at a dedicated workstation using CT pneumonia analysis v.2.0. (Siemens Healthineers, Forchheim, Germany). The algorithm uses non-contrast CT data to automatically identify and 3D-segment both the lung parenchyma and abnormal areas of ground-glass opacities (GGO) and consolidation (11) . The software outputs a percentage of total lung involvement (both GGO and consolidation). This percentage was translated to the same semi-quantitative scoring system used for visual analysis. Segmentation errors were manually corrected by trained readers. In cases of bacterial pneumonia coinfection, the total area of GGO and consolidation was included. The following outcome measures were thus evaluated by the readers: Semi-quantitative CT score (ranging from 0 to 25): CT scores from visual analysis, AI without manual correction (AI-auto), and AI with manual correction (AI-manual). Percentage of lung involvement (ranging from 0 to 100%): percentage scores of lung involvement (combined GGO and consolidation) from visual analysis, AI-auto, and AI-manual. Both metrics of lung involvement are reported, because there is precedence for both approaches to assess the extent of lung involvement in COVID-19 (6, 7) . The truly quantitative approach with percentages of lung involvement is likely more accurate and will increasingly become available through the rapid development of multiple AI-based software packages for COVID-19. However, we opted to include the semi-quantitative approach as it has been used in I n p r e s s 9 early COVID-19 studies with good prognostic value and may be only approach available to some institutions for the foreseeable future (6) . Intra-and interreader reproducibility were assessed for both visual analysis and AI-based analysis with manual correction. Six radiologists (C.G., A.Da., L.J., Y.D.B., A.De., and R.S.) independently scored the lung involvement on a subset of the patient population. Two cardiothoracic radiologists (C.G. and R.S. with 8 and 7 years of cardiothoracic imaging experience, respectively) assessed reproducibility. One reader (R.S.) reread a random sample of 50 scans after 1 week to assess intrareader reproducibility. Fifty randomly selected cases first read by another reader were reread by C.G. after 1 week to assess interreader reproducibility. All statistical analysis was performed by using R v.4.0.0. (Foundation for statistical computing, Vienna, Austria). Data were tested for normal distribution with the Shapiro-Wilk test. Summary statistics for all continuous variables are reported as means  standard deviations (SD) or as medians with interquartile ranges (IQR), as appropriate. Summary statistics for categorical variables are reported as absolute numbers and percentages. For continuous variables, a threshold that balances sensitivity and specificity, as identified by the Youden index, was calculated from receiver-operating characteristic (ROC) curve analysis (12) . It is important, however, to realize this is just one approach to cutting the ROC curve and future, larger studies are needed to determine optimal thresholds considering other predictors of adverse outcome. We assessed discrimination with the 5-fold cross-validated area under the ROC (AUROC), reported with corresponding 95% confidence intervals (13) . Survival curves were estimated using the Kaplan-Meier method and compared by using the log-rank test. Cox-model I n p r e s s 10 results were shown by hazard ratio (HR) estimates with 95% confidence intervals (CI). We checked the proportional-hazards assumption for each variable by testing Schoenfeld residuals and using the double-log plot method. In case of violation of the proportional-hazards assumption, the restricted mean survival time (RMST) was calculated as a measure of average survival from time 0 to a specified time point and estimated as the area under the survival curve (AUC) up to that point (14) . Intra-and interreader agreement were assessed by using intraclass correlation coefficients (ICCs), Bland-Altman analysis with 95% limits of agreement (LOAs), Spearman rank correlation r, and coefficient of variation (CV) (15) . A two-way model with measures of agreement was used to calculate the ICC values. ICCs of >0.75 and of 0.40-0.75 indicate strong and average agreement, respectively. A difference between ICCs was considered to be statistically significant when there was no overlap between their respective 95% CI limits. There were no missing data elements for the analyses. P<0.05 was considered to indicate a statistically significant difference. Sample size estimates were derived from the interreader SD of lung involvement as described by Machin and Altman (16, 17) . The sample size required by chest CT to show a change with 90% power and an  error of 0.05 was calculated by using the following formula: where  is the significance level, P is the study power, f is the value of the factor for different values of  and P (f = 10.5 for a P of 90% and an  error of 0.05),  is the interstudy standard deviation,  is the desired percentage difference to be detected, and n is the sample size needed (18) . Chest CT reproducibility and sample size were calculated for both a visual and an AI-assisted analysis, as defined above. Patient demographics, CT findings and dose parameters, and outcome data are summarized in Table 1 . The mean age for all patients was 67 years  17 years (SD) with fever, cough, and dyspnea as the most frequent clinical symptoms at presentation. Median time from symptom onset and ER presentation with RT-PCR and chest CT was 7 days (IQR: 4-10 days). Median time between CT scan acquisition and report was 20 minutes (IQR: 12-42 minutes). Median time for automated AI analysis was 9 minutes (IQR: 8-9 minutes), which increased to 12 minutes (IQR: 8-13 minutes) with manual correction. Manual correction was required in 154 patients (65.6%). However, manual correction changed the percentage of lung involvement with more than 1% in only 33 patients (13.2%), when compared to the automated AI analysis ( Figure 3F ). Mean DLP for all patients was 43.2±24.9 mGy.cm, resulting in an effective radiation dose of 0.60±0.35 mSv (Table 1) . Adverse outcome occurred in 39 patients ( for AI analysis with manual correction (Figure 1 ). AUROC values from automated AI analysis and AI analysis with manual correction were significantly higher than those from visual analysis for both semi-quantitative CT scores and percentages of lung involvement (all P<0.001). Kaplan-Meier curve analysis using the identified cutoffs showed that these values could be used to predict patient outcome (P<0.001 by log rank test for all analyses) ( Figure 2 ). Visually, it was clear that most adverse events occur within the first week after chest CT, which was confirmed by analysis of Schoenfeld residuals with violation of the proportional hazards assumption (20) . The restricted mean survival time (RMST) was estimated at 1 week, and the difference and ratio of RMST were estimated by bootstrap simulation (Table 2) . For example, for AI analysis with manual correction a percentage of lung involvement of more than 20.5% resulted in an RMST difference of -2.5 days (95% CI: -3.2;-1.7 days) and a RMST ratio of 0.640 (95% CI: 0.539-0.760), which significantly favored the group with less lung involvement (both P<0.001). Additional Kaplan-Meier curves with groups based on quartiles of lung involvement are presented in Figure E1 . Intrareader agreement was high for both visual and AI-assisted analysis with manual correction ( Figure E2 ). Interreader agreement was also high for both visual and AI-assisted analysis with manual correction ( Figure E3 ). For semi-quantitative CT scores, visual analysis demonstrated average agreement with AI-assisted analysis without and with manual correction (ICC: 0.670 and 0.682, respectively), whereas the agreement between both AI-assisted analyses was excellent (ICC: 0.990). Overall, no significant bias was observed with Bland-Altman analysis along the different types of CT analysis (Table 4, Figure 3 ). However, in patients with more extensive lung involvement, there was a tendency for visual analysis to yield higher semi-quantitative CT score when compared to AI-assisted analysis ( Figure 3A-3B) . For quantitative percentage of lung involvement, visual analysis demonstrated excellent agreement with AI-assisted analysis without and with manual correction (ICC: 0.873 and 0.871, respectively). Agreement between both AI-assisted analyses, however, was even better (ICC: 0.997). No significant bias was observed with Bland-Altman analysis along the different types of CT analysis (Table 4 , Figure 3 ). Example analyses are shown in Figures 4 and 5 . On the basis of the interreader variability of chest CT, we estimated sample sizes needed to detect significant decreases in lung involvement during a clinical trial ( Figure 6 ). For example, a clinical trial intended to show a change of 5% in lung involvement over time (i.e., a change from 20% to 15% in lung involvement) with a power of 90% would require 250 patients in each group for an AI-assisted analysis, whereas 1014 patients would be required in each group for a visual analysis. The extent of lung involvement on chest CT in COVID-19 patients has important prognostic value and is associated with short-term clinical deterioration. Improved risk stratification of COVID-19 patients is crucial for cost-effective patient management by prompting safe hospital discharge of low-risk patients and prolonged in-hospital and follow-up surveillance of high-risk patients. The role of chest CT as a potential tool for COVID-19 diagnosis has been extensively studied with conflicting recommendations, ranging from using CT as a first-line screening modality to warnings against its overuse and a false sense of security I n p r e s s 15 (21, 22) . Our results suggest that chest CT may be viewed as a risk stratification tool rather than a diagnostic tool per se. However, it is important to realize that chest CT should not be viewed as the sole prognosticator in COVID-19 subjects as multiple clinical and biochemical factors have been previously shown to be associated with adverse outcome (4, 8, 23, 24) . Importantly, we found that an AI-assisted approach improved patient risk stratification Previous studies, however, have suggested very low interstudy variability in lung volume and nodule assessment on chest CT exams (28, 29) . Therefore, interstudy variability can be Hosts and sources of endemic human coronaviruses Clinical features of patients infected with 2019 novel coronavirus in Wuhan World Health Organization. Coronavirus disease (COVID-19) outbreak Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region Accuracy and reproducibility of low-dose submillisievert chest CT for the diagnosis of COVID-19 CT image visual quantitative evaluation and clinical classification of coronavirus disease (COVID-19) Well-aerated lung on admitting chest CT to predict adverse outcome in COVID-19 pneumonia Impact of gender on extent of lung injury in COVID-19 Reduced lung-cancer mortality with volume CT screening in a randomized trial The 2007 recommendations of the International Commission on Radiological Protection Estimation of the Youden Index and its associated cutoff point Package 'cvAUC The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt Statistical methods for assessing agreement between two methods of clinical measurement Sample sizes for clinical, laboratory and epidemiology studies Practical statistics for medical research Coronary CT Angiography: Variability of CT Scanners and Readers in Measurement of Plaque Volume Frailty and mortality in hospitalized older adults with COVID-19: retrospective observational study Why Test for Proportional Hazards? Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases A role for CT in COVID-19? What data really tell us so far Clinical predictors of mortality due to COVID-19 based on an analysis of data of 150 patients from Wuhan, China Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area Automatic segmentation of MR brain images with a convolutional neural network Segmenting retinal blood vessels with deep neural networks Automated segmentation of lungs with severe interstitial lung disease in CT Lung volume reproducibility under ABC control and selfsustained breath-holding Evaluating variability in tumor measurements from same-day repeat CT scans of patients with non-small cell lung cancer Feasibility of Dose-reduced Chest CT with Photon-counting Detectors: Initial Results in Humans