key: cord-0833856-ulescamr authors: Wickstrom, K.; Vitelli, V.; Carr, E.; Holten, A. R.; Bendayan, R.; Reiner, A. H.; Bean, D.; Searle, T.; Shek, A.; Kraljevic, Z.; Teo, J. T.; Dobson, R.; Tonby, K.; Kohn-Luque, A.; Amundsen, E. K. title: Regional performance variation in external validation of four prediction models for severity of COVID-19 at hospital admission: An observational multi-centre cohort study date: 2021-03-26 journal: nan DOI: 10.1101/2021.03.26.21254390 sha: e6dea12675d4c386e897eb71a632e4e06b0249e0 doc_id: 833856 cord_uid: ulescamr Background: Several prediction models for coronavirus disease-19 (COVID-19) have been published. Prediction models should be externally validated to assess their performance before implementation. This observational cohort study aimed to validate published models of severity for hospitalized patients with COVID-19 using clinical and laboratory predictors. Methods: Prediction models fitting relevant inclusion criteria were chosen for validation. The outcome was either mortality or a composite outcome of mortality and ICU admission (severe disease). 1295 patients admitted with symptoms of COVID-19 at Kings Cross Hospital (KCH) in London, United Kingdom, and 307 patients at Oslo University Hospital (OUH) in Oslo, Norway were included. The performance of the models was assessed in terms of discrimination and calibration. Results: We identified two models for prediction of mortality (referred to as Xie and Zhang1) and two models for prediction of severe disease (Allenbach and Zhang2). The performance of the models was variable. For prediction of mortality Xie had good discrimination at OUH with an area under the receiver-operating characteristic (AUROC) 0.87 [95 % confidence interval (CI) 0.79-0.95] and acceptable discrimination at KCH, AUROC 0.79 [0.76-0.82]. In prediction of severe disease, Allenbach had acceptable discrimination (OUH AUROC 0.81 [0.74-0.88] and KCH AUROC 0.72 [0.68-0.75]). The Zhang models had moderate to poor discrimination. Initial calibration was poor for all models but improved with recalibration. Conclusions: The performance of the four prediction models was variable. The Xie model had the best discrimination for mortality, while the Allenbach model had acceptable results for prediction of severe disease. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was discovered in Wuhan, China in 80 December 2019. The virus was shown to cause viral pneumonia, later designated as coronavirus 81 disease 2019 (COVID-19) [1] . The disease has evolved as a pandemic with an extensive amount of 82 severe cases with high mortality [2] . Several biomarkers, clinical and epidemiological parameters 83 have been associated with disease severity [3, 4] . Practical tools for prediction of prognosis in 84 COVID-19 patients are still lacking in clinical practice [5, 6] . Prediction models can be crucial to 85 prioritize patients needing hospitalization, intensive care treatment, or future individualized therapy. 86 Since the onset of the pandemic, the number of prediction models for COVID-19 patients has 87 been continuously growing [7] . Prediction models should be validated in different populations with a 88 sufficient number of patients reaching the outcome before implementation [8] [9] [10] . A validation study 89 of 22 prediction models at one site was recently published [6] . Interestingly, this study found that 90 none of the models performed better than oxygen saturation alone, even though the performance at 91 the original study sites in most cases was much better. 92 This study aimed to validate published prediction models of severity and mortality for hospitalized 93 patients based on laboratory and clinical values in COVID-19 cohorts from London (United 94 The study is reported according to the guidelines in "Transparent reporting of a multivariable 96 prediction model for individual prognosis or diagnosis" (TRIPOD) [11] and has also followed 97 recommendations from "Prediction Model Risk of Bias Assessment Tool" (PROBAST) [12] . perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in with the words "COVID-19" and "prediction model" or "machine learning" or "prognosis model". 106 Prediction models included in the review by Wynants et. al. [7] in May 2020 were also investigated, 107 as well as search for articles/preprints citing Wynants et. al. using Google Scholar 18.05.2020. 108 The inclusion criteria for selection of multivariable prediction models were: (1) Symptomatic 109 hospitalized patients over 18 years with PCR confirmed COVID-19; (2) outcomes including 110 respiratory failure or intensive care unit (ICU) admission or death or composite outcomes of these. 111 (3) The predictive models had to include at least one laboratory test as we wanted to explore models 112 that combined clinical and laboratory variables (4 Predictive variables were collected from the admission to the emergency department (ED). If not 150 available in the ED, the first available values within 24 hours from hospital admission were used. 151 Missing values (i.e. no recorded values within 24 hours) were generally imputed using k-nearest 152 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in inclusion process is illustrated in Figure 1 . However, since one of the models was developed at KCH 177 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Information on the predictor variables and outcomes of the four models are summarized in 184 Table 1 . perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in All prediction models were based on multiple logistic regression and presented coefficients 195 and intercepts for the different variables that enabled the calculation of risk prediction for our 196 cohorts. Allenbach additionally provided an 8-point scoring system derived from the logistic 197 regression model. However, we chose to use the regression model for calculation as this retains as 198 much information as possible. 199 showed no differences between AUROCs calculated with different imputation methods (see Table 212 S2). Thus, the simple imputation method k-nearest neighbor was used for the rest of this paper. At 213 KCH the number of missing values was very high for LDH (87.8 %) and relatively high for SpO2 214 (33.3 %) and WHO scale (33.8 %). 215 The OUH cohort consisted of 307 patients while the KCH cohort consisted of 1295 patients 216 ( Figure S1 ). For the OUH cohort median age was 60 years with 57 % males, while in the KCH 217 cohort the median age was 69 with 59 % males. In the OUH cohort, 32 patients died in the hospital 218 The percentage of patients with hypertension and diabetes was higher in the KCH cohort (54 222 % and 35 %, respectively) than in the OUH cohort (34 % and 21 %, respectively). The patients at 223 KCH also had higher levels of CRP, creatinine, LDH, and possibly a lower number of lymphocytes 224 than the OUH patients; all of which are known predictors for severe COVID-19. 225 In Table 3 , univariate associations are presented for mild/moderate and severe groups for the 226 KCH and OUH cohorts. In general, the same variables were predictive for severe disease at KCH 227 and OUH; except for ischemic heart disease, temperature and platelets which were associated with 228 severe disease at OUH, but not KCH. 229 230 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The validation of the four prediction models with both the OUH and KCH cohorts is presented in 247 terms of discrimination (AUROC) and calibration (slope and intercept) in Table 2 Allenbach models, discrimination at OUH was similar to the development cohorts ( Figure 2) . And, 256 although the difference was not statistically significant at the 0.05 confidence level, we found better 257 discrimination for both of these models at OUH compared to KCH. 258 The calibration plots are shown in Figure 3 (after recalibration). Figure S3 in supplementary 259 shows the calibration results before and after recalibration for the Xie and Allenbach models. 260 Recalibration will not render models with poor discrimination more useful. Thus, we focused on the 261 recalibration of the Xie and Allenbach models as these had the best discrimination. Recalibration 262 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in In this study, we validated four prediction models for prognosis in hospitalized COVID-19 patients 283 from London, UK and Oslo, Norway. We found varying performance of the models in the two 284 cohorts. The models performed better in the OUH cohort with similar discrimination to the original 285 studies. The Xie and Allenbach models had the best performance for prediction of death and severe 286 disease, respectively. 287 Initial calibration was poor for all models, but improved after recalibration of the intercept 288 according to the frequency of the outcome in our cohorts. This improves the accuracy of the 289 prediction for each patient without affecting the discrimination and is recommended in several 290 publications [5, 11, 18]. Local or possibly regional/national recalibration is likely to be important for 291 COVID-19 prediction models since there is a large variation in the frequency of severe disease and 292 death in different studies. 293 In some cases, we found poorer discrimination in the validation cohorts compared to the 294 development cohorts. This is consistent with past evidence showing discrimination in development 295 cohorts to be better than at external validation due to overfitting and differences in characteristics of 296 the cohorts [23, 24] . The cohorts in the original studies and at KCH and OUH had many differences 297 such as mortality, age and frequencies of severe disease and comorbidities. UK and Norway differ in 298 the structures of their healthcare systems, and the incidence of COVID-19 has been far higher in the 299 UK. These factors may have affected the selection of patients for hospital and ICU admission, which 300 might have resulted in a more homogenous patient population in regards to severity at KCH. It is to 301 be expected that discrimination will be less good when the population is more homogenous. 302 The findings underline the importance of validation at several external sites. This is 303 particularly true for a new disease like COVID-19, with rapidly developing treatment guidelines, and 304 with an overwhelming effect on healthcare resources in some locations, but not at others. 305 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in which is not a substantial improvement over single univariate predictors of severity. Thus, the 328 finding that the Xie and Allenbach models perform well at both the original study site and at our 329 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in SpO 2 is a strong predictor for mortality, while LDH is probably a weaker predictor (6). The number 346 of missing values at OUH was low and probably did not affect the validation. 347 In conclusion, following the TRIPOD guidelines, our study validated developed models for 348 prediction of prognosis in COVID-19, and showed that these models have a variable performance in 349 different cohorts. The Xie model and Allenbach model clearly had the best performance, and we 350 suggest that these models should be included in future studies of COVID-19 prediction models. 351 However, the performance of these models at our two validation sites was not similar, which 352 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Clinical course and risk factors for mortality of adult 386 inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Clinical course and mortality risk of severe COVID-19 Hematologic, biochemical and immune 392 biomarker abnormalities associated with severe illness and mortality in coronavirus disease 2019 393 (COVID-19): a meta-analysis. Clinical chemistry and laboratory medicine Can we predict the severity of COVID-19 with 396 a routine blood test? Polish archives of internal medicine Performance of prediction models for COVID-19: the Caudine 399 Forks of the external validation Systematic 402 evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-403 19: An observational cohort study Prediction 406 models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Internal and external validation of 409 predictive models: a simulation study of bias and precision in small samples Prognosis and prognostic research: validating a 412 prognostic model Sample size considerations for the external validation of a 415 multivariable prognostic model: a resampling study Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis 420 (TRIPOD): Explanation and Elaboration A Tool to 423 Assess the Risk of Bias and Applicability of Prediction Model Studies CogStack -Experiences of 426 deploying integrated information retrieval and extraction services in a large National Health Service 427 Foundation Trust hospital Evaluation and improvement 430 of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study Biological responses to 433 COVID-19: Insights from physiological and blood biomarker profiles Review: a gentle introduction to 437 imputation of missing values Multiple imputation using chained equations: Issues and 440 guidance for practice 19. team Rc. A language and environment for statistical computing. R Foundation for Statistical 445 Computing Development and external 447 validation of a prognostic multivariable model on admission for hospitalized patients with Risk prediction for poor outcome and 450 death in hospital in-patients with COVID-19: derivation in Wuhan Multivariable 453 prediction model of intensive care unit transfer and death: a French prospective cohort study of 454 COVID-19 patients External validation of prognostic 459 models: what, why, how, when and where?