key: cord-0885718-wl3hk7n3
authors: Ottenhoff, M. C.; Ramos, L. A.; Potters, W.; Janssen, M. L.; Hubers, D.; Pina-Fuentes, D.; Thomas, R.; van der Horst, I. C.; Herff, C.; Kubben, P.; Elbers, P. W.; Marquering, H. A.; Welling, M.; Hu, S.; Simsek, S.; de Kruif, M. D.; Dorman, T.; Fleuren, L. M.; Schinkel, M.; Noordzij, P. G.; van den Bergh, J. P.; Wyers, C. E.; Buis, D. T.; Wiersinga, J.; van den Hout, E. H.; Reidinga, A. C.; Rusch, D.; Sigaloff, K. C.; Douma, R. A.; de Haan, L.; Fridgeirsson, E. A.; Gritters van de Oever, N. C.; Rennenberg, R. J.; van Wingen, G.; Aries, M. J.; Beudel, M.
title: Predicting mortality of individual COVID-19 patients: A multicenter Dutch cohort
date: 2020-10-13
journal: nan
DOI: 10.1101/2020.10.10.20210591
sha: ed8ca93f7298990ba9e789ce760afd8c85e1eab0
doc_id: 885718
cord_uid: wl3hk7n3

Objective: Develop and validate models that predict mortality of SARS-CoV-2 infected patients admitted to the hospital. Design: Retrospective cohort study Setting: A multicenter cohort across ten Dutch hospitals including patients from February 27 to June 8 2020. Participants: SARS-CoV-2 positive patients (age [≥] 18) admitted to the hospital. Main Outcome Measures: 21-day mortality evaluated by the area under the receiver operatory curve (AUC), sensitivity, specificity, positive predictive value and negative predictive value. The predictive value of age was explored by comparison with age-based rules used in practice and by excluding age from analysis. Results: 2273 patients were included, of whom 516 had died or discharged to palliative care within 21 days after admission. Five feature sets, including premorbid, clinical presentation and laboratory & radiology values, were derived from 80 features. Additionally, an ANOVA-based data-driven feature selection selected the ten features with the highest F-values: age, number of home medications, urea nitrogen, lactate dehydrogenase, albumin, oxygen saturation (%), oxygen saturation is measured on room air, oxygen saturation is measured on oxygen therapy, blood gas pH and history of chronic cardiac disease. A linear logistic regression (LR) and non-linear tree-based gradient boosting (XGB) algorithm fitted the data with an AUC of 0.81 (95% confidence interval 0.77 to 0.85) and 0.82 (0.79 to 0.85), respectively, using the ten selected features. Both models outperformed age-based decision rules used in practice (AUC of 0.69, 0.65 to 0.74 for age > 70). Furthermore, performance remained stable when excluding age as predictor (AUC of 0.78, 0.75 to 0.81) Conclusion: Both models showed excellent performance and had better test characteristics than age-based decision rules, using ten admission features readily available in Dutch hospitals. The models hold promise to aid decision making during a hospital bed shortage.

The first wave of the COVID-19 pandemic had a dramatic effect on our society and severely disrupted our daily lives, economies and healthcare systems. During the peak of the first wave, hospitals and intensive care units (ICU) throughout Europe were overwhelmed and resources were exhausted. Implementation of public health policies reduced the infection-rate; however, there is a considerable risk that relaxation of these policies lead to a second pandemic wave, of which the first signs are already seen in different European countries. [1] The progression of the Spanish flu pandemic learned that a second wave could impose an even higher demand on the healthcare system and result in a higher number of casualties than the first wave.

Given the novelty of the virus, accurate information about the clinical course and prognosis of individual patients is still largely unknown, which led to the use of crude limits to unilaterally withhold advanced life support measures to face the large numbers of pulmonary insufficient patients during the first wave. Although criticized, several hospitals in Europe have already solely used age as a triage criterion. [2] Many publications have developed and evaluated triage selection criteria, but a there remains a significant knowledge gap and the final criteria are subject to socio-ethical debate. [3] [4] [5] Preferably, triage is averted, but when necessary, the decision should be based on medical criteria with an evidence base. Since March 2020, many studies have been published regarding the clinical characteristics of patients suffering from a SARS-CoV-2 infection in both smaller (n=58 [6] , n=200 [7] ) and larger cohorts (n > 5000 [8] [9] [10] ). However, these studies have reported notable differences in clinical characteristics that were associated with an adverse outcome. Importantly, these studies only provide information about clinical characteristics and risk factors on group level, and therefore do not provide information about the prognosis for individual patients. To prevent reuse of crude limits when hospital and ICU resources are exhausted during a second wave, a prognostic model using multivariable analysis could be of great value. Such a model can provide information about the individual patients' chance of survival, despite largely unknown underlying risk factors.

Within the ongoing socio-ethical debate in the Netherlands, whether age should be included in the triage selection criteria, such a predictive model could allow to exclude age or include it in a combination with clinical characteristics. All of the published prognostic are continuously reviewed by Wynants et al. 2020. [11] , who identified 145 prediction models of which 23

were tailored towards predicting mortality. The authors identified that all studies were at high risk of bias and likely to underperform in clinical practice. However, a recent paper, not yet All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020. 10.10.20210591 doi: medRxiv preprint reviewed by Wynants et al., showed promising results on predicting mortality with excellent performance, using a very large cohort (n > 50.000) from the United Kingdom. [12] The uncertainty and risk of bias in almost all published COVID-19 related prognostic models, stresses the importance of thorough methodology in variable selection, internal and external model validation and performance evaluation. [11] In addition, it is important that a constant interplay between data-scientists and clinicians is in place during model development.

Furthermore, studies developed and performed independently with similar methodology are more valuable than ever to reduce the uncertainty of published models and risk of spurious publications. [13, 14] Therefore, a prognostic model was developed and evaluated that predicts 21-day mortality; utilizing data from 2273 SARS-CoV-2 infected patients from 10 hospitals across the Netherlands.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in crisis and in accordance with national guidelines and European privacy law, the need for informed consent was waived and opt out procedure was communicated by press release.

Despite this, individual centers used local guidelines to obtain consent retrospectively from patients or representatives. In all centers, measures were taken to ensure adequate and safe data pseudonymisation and storage.

To support the decision of (ICU) treatment during scarcity at hospital admission, we aim to predict unfavorable outcome of COVID-19 patients at hospital admission. Given the amount of data, predicting each possible outcome, such as mortality, palliative care, discharge, and hospitalized, could increase the risk of biased models and overfitting. Therefore, the prediction goal was modelled as a binary classification problem, where an unfavorable outcome corresponds to patients that either died or were discharged for palliative care within 21 days after hospital admission. Palliative discharge is end-of-life care that focuses on patient comfort rather than treatments with curative intentions. A favorable outcome corresponds to patients that are discharged to home, nursing homes or rehabilitation units All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint within 21 days and patients that are alive and still hospitalized at 21 days after hospital admission. Patients that were still hospitalized but shorter than 21 days, transferred to other hospitals (including transfers to participating hospitals), re-admitted or have an unknown outcome were excluded from further analysis.

The rCCRF was filled in manually by a large team of researchers and doctors the electronic Furthermore, collinearity was assessed by a Pearson correlation matrix.

The obtained models could ultimately change the clinicians' decision and thus directly influences the life of a patient. It is therefore of utmost importance that the obtained models are both robust and interpretable. [17] To comply with these requirements, two models with a fundamentally different modelling approach were selected: a logistic regression (LR) that fits the data linearly, and a tree-based gradient boosting algorithm that fits the data non-linearly.

The models were implemented using the python 3 libraries Scikit-learn [18] and XGBoost (XGB) [19] , respectively. Both models can be interpreted relatively easy and XGB often shows state-of-the-art results in multiple tasks. The models were trained and validated using leave-one-hospital-out cross-validation (LOHO-cv). By iteratively training the models on all but one hospital and testing performance on the left-out hospital, the performance of the model represents the ability to predict the outcome on independent data and thereby All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint incorporate possible data heterogeneity between hospitals. To prevent skewed performance on individual folds due to a small number of samples, we combined the data from the two hospitals with the smallest number of samples and considered them as a single hospital in LOHO-cv for further analysis. Additionally to LOHO-cv, internal 10-fold random subsampling cross-validation using data from all hospitals was performed to facilitate a comparison of the results to other studies that typically only perform internal cross-validation.

Features that had more than 50% missing values and subsequently patient records that had more than 80% missing values were removed. The remaining missing values were imputed using a multivariate iterative imputer (trained on the training set and applied to the test set).

The imputer, a Bayesian ridge regression, models the missing values in each feature as a function of all other features and therefore, provides a more sophisticated approach than the traditional imputation methods, such as using mean, median or mode imputation. [20] After imputation, each feature was scaled to its interquartile range (IQR). IQR scaling is known to be robust to outliers and often gives better results than z-score or minmax scaling. [21] The data was then split into folds using LOHO-cv, where each iteration consists of a training fold with nine hospitals and a test fold with one hospital. The data-driven feature selection of set (6) was performed on the training fold by selecting the ten features showing the highest ANOVA F-value. Because for each iterations, the training fold consists of nine different hospitals, the selected features with the highest F-values can differ due to heterogeneity between hospitals. To be able to describe the ten most predictive features in further analysis, the features selected most often over all iterations are presented. If two feature sets are selected equally often, the set with the highest summed F-values was chosen. Both missing value imputation and feature selection were performed independently on the training and test set. After feature selection, both models were fitted and parameters optimized by a 50iteration randomized grid search using a stratified shuffle split cross-validation. A schematic overview of all the processing steps is shown in figure 1 and the grid search parameters are shown in supplementary table 2. All code in the pipeline was implemented using the Scikitlearn python package. [18] To adhere to guidelines on transparent reporting of multivariable prediction models, the TRIPOD checklist is included in the supplementary table 3. [22] All code used in this paper is available at DOI:10.5281/zenodo.4077342 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint

Model performance was assessed using area under the curve (AUC), sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). Except for AUC, the metrics require a binary classification instead of likelihood and therefore the cutoff threshold was tuned to the shortest distance to the upper-left corner in the receiver operating curve (ROC) plot, which was named as the 'optimal' threshold in further analysis. In addition, a confusion matrix was derived over the complete dataset and for each center, also tuned to the optimal threshold 

During a large influx of patients suffering from life threatening lung infections, it is most likely that the ICU is to be exhausted first due to the low bed count and invasive ventilation capacity. It is therefore important to analyze whether the model also performs well on ICU admitted patients, as triage might be dependent on ICU capacity. In the Netherlands, triage was prevented by distributing patients to districts with fewer admissions or German hospitals.

However, possible bias may already be present in the selection of patients, because, for example, certain patients might not be admitted to the ICU because of old age, premorbid characteristics, presentation with multi-organ failure and patients' own treatment restraints wishes. For these reasons, both LR and XGB performances were assessed by training on the complete dataset and on ICU patient subgroup.

To compare the models to clinical practice, the performance was compared with two agebased decision rules that have been applied in practice during crisis. [2] The rules were All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint translated as follows: 1) If age is above 70 then the outcome is considered unfavorable and 2) If age is above 80 then the outcome is considered unfavorable.

Furthermore, it was assessed whether age is important for the final prediction to be able to contribute to the ongoing socio-ethical debate in the Netherlands. In July 2020, a discussion between ethicists, medical professionals and policy makers was started about criteria for triage to decide which patients receive ICU care during acute hospital care shortage. The main point of discussion was that the Dutch government was firmly opposed to using an age-based decision rule because it is in violation of the constitution, which states that everyone should be treated equal and discrimination on any ground is illegal. To contribute to this discussion, the effect of age on the best performing model was assessed, by retraining the model on the same feature set, while excluding age as feature.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint

The database included 2527 patients from ten different hospitals at June 8, 2020. 223 patients were excluded because they had no recorded outcome at the time of analysis (e.g. patient was still in the hospital, but shorter than 21 days). Subsequently, 31 patients were excluded because they did not have a confirmed COVID-19 infection, resulting in 2273 patients included for modelling. Of these patients, 1195 were discharged home and not re-admitted, 76

were discharged to a nursing home and 232 were discharged to a rehabilitation unit.

Furthermore, 509 patients died and 7 patients were discharged to palliative care. Of the remaining 254 patients that were still in the hospital at 21 days after admission, 112 patients were at the ward or medium care and 142 in the ICU. In total, the data included 516 unfavorable outcomes and 1757 favorable outcomes. To better balance the samples per hospital, the two smallest hospitals (n=59 and n=70) were combined. The resulting ratio of unfavorable outcome / total patients per hospital is 19% (n=261), 14% (n=169), 10% (n=118), 31% (n=317), 14% (n=113), 21% (n=401), 27% (n=325), 27% (n=440) and 19% (n=129). perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Figure 2A shows a comparison of the AUCs per feature set.

Sensitivity and specificity were comparable between the algorithms. Overall, the NPV was high and the PPV was low, as the number of patients with a favorable outcome was considerably higher than the number of patients with an unfavorable outcome. This implies that the model can make accurate predictions of favorable outcomes, but less accurate predictions of unfavorable outcomes. All results are shown in table 2. The results from internal cross-validation were comparable and shown in supplementary table 5.

The between-hospital performance variation was small for both algorithms, shown by the small 95% confidence intervals in AUC of 0.02 to 0.06 and a low standard deviation (0.01).

LR showed larger confidence intervals (0.04 to 0.07) with equal standard deviation (0.01).

The small confidence intervals indicate that the models fitted the data robustly, where XGB is more robust than LR. The robustness is supported by the relatively equal ratios between correct and incorrect predictions, as shown in figure 4 , which shows the confusion matrix per hospital for XGB-10 best predicting features using the optimal threshold derived from the complete dataset.

With increased duration of stay within the hospital, the uncertainty of the patients' outcome may also increase. The patients' chance of survival might change, because patients that have a longer hospital stay are likely to have a more complicated clinical course and/or get different types of treatments. Additionally, prolonged hospital stay simply allows more events to happen. To assess whether the models' performance changes based on the duration of hospital stay, the patients were split per duration of stay and subsequently the performance per group was assessed. The result, presented in figure 5, shows that model performance does not deteriorate as the hospital duration increases, as the relative correct predictions remain between 0.6 and 0.9 and no trend is shown.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. perpetuity.

preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in (AUC 0.86, 0.82 -0.89) . Compared with the results on the complete dataset, the performance dropped notably on ICU patients, decreasing in AUC by 0.04 to 0.20. The confidence intervals also increased, overall ranging from 0.03 to 0.18. Despite the discriminative power of the models decreasing, it is considered an acceptable decrease, as the initially best performing feature sets decrease only slightly and retained small confidence intervals. The decrease was expected, given that performance on a smaller subgroup is inevitably lower. In addition, the prognosis of the outcome of ICU admitted patients might change, for example, due to receiving distinct interventions only available at the ICU. When applied with caution, the models performance on ICU patients should not impede possible application in practice. perpetuity.

preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint

Of the 2273 patients, the age of 19 patients were missing and these were thus excluded for this analysis. The remaining 2254 patients, 1061 were older than 70 and 415 were older than 80. The age-based decision criteria therefore 'predicted' that of age > 70, 1193 will survive and 1061 will die. For age > 80 the prediction was 1839 and 415, respectively. Age > 70

showed an AUC of 0.69 (0.65 to 0.74) whereas age > 80 showed a lower AUC (0.61, 0.57 to 0.65). Figure 6 shows the confusion matrices of LR and XGB trained on the 10-best features and both age-based decision criteria. To compare both models with the age-based rules, the results were tuned to the shortest distance to the upper left corner in the ROC plot. Both LR and XGB show a higher AUC than either age-based decision criteria. The results show that the presented models can outperform earlier applied triage rules during crises and can thus provide better information based on individual medical data.

The best performing model, XGB-10, was retrained an evaluated without age as feature.

While expecting the performance to drop significantly, given that age was the most predictive feature by both the feature selection and SHAP analysis (Figure 3 it was decided not to exclude features beforehand. Nonetheless, the high VIF indicates that the information present in age is latently present in two or more other features, which could explain the retained performance.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint

The mortality of individual SARS-CoV-2 infected patients can be predicted at hospital admission with excellent performance using both linear (LR; AUC 0.81, 0.77 to 0.85) and nonlinear (XGB; 0.82, 0.79 to 0.85) models using 10 admission features that are readily available in most hospitals. Both models showed improved performance over age-based decision rules already used in practice during acute hospital bed shortage. [2] Although XGB trained on all 80 features and on the 10 best features performed comparable, the model using 10 best features may be preferred for easier translation to clinical practice.

To our best knowledge the presented models are based on one of the world's largest multihospital cohorts (www.covid-predict.org) of hospitalized COVID-19 patients for which the detailed admission and clinical course has been systematically tracked by clinical personnel.

The present cohort represents approximately 16% of the total hospital admissions in the Netherlands due to COVID-19 [NICE, consulted October 7th]. [24] We strived to develop robust models and reduce risk of bias as much as possible, for example by adhering to recommendations by Wynants et al. [11] . Hence, there was continuous interaction between data scientists and clinicians to achieve the best of both worlds. However, our data only concerns a Dutch cohort and therefore it is uncertain if the performance remains comparable when tested on patients from other countries, where different decision-making processed could be in place. Furthermore, our models were able to make accurate predictions about favorable outcomes, given the high negative predictive value, but less accurate about unfavorable outcomes. A higher positive predictive value would reduce the amount of false positives, though this skewed performance should not hurt the main intended purpose of the model, that is when patients need to be selected based on a good estimated prognosis.

Compared to other prognostic studies, the current study improves on having not only a larger population (N=2273), but also using external validation with data from multiple hospitals.

One recent study by Knight et al.[12] showed similar results using a very large multicenter cohort (n>50.000) in the United Kingdom. The authors presented two models that predicted mortality with excellent performance using comparable methodology and predictive models.

Given the worldwide pressure to publish new information quickly, also leading to retracted papers in high-end journals [13] [14] , independent studies showing similar results is important more than ever, reducing the risk of reporting overoptimistic or spurious results All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint

Ten features were derived that were considered the most predictive features for an unfavorable outcome: age, number of home medications, admission blood values urea nitrogen/LDH/albumin, oxygen saturation (%), blood gas pH and history of chronic cardiac disease. These measures are all readily available in all Dutch and most worldwide hospitals, making the model easily applicable for prognostication or part of hospital triage tool. These factors may reflect premorbid factors (medications, albumin and cardiac history), disease severity and duration (LDH, urea nitrogen) and hypoxic and respiratory burden (oxygen saturation, pH) at hospital presentation. A social-ethical consideration is whether age should be included. Based on our results, age can be considered as evidence-based predictor in combination with other features in times of crisis and scarcity. Nevertheless, our results also indicate that the models perform only slightly worse without age as predictor, enabling model deployment when the results of a socio-ethical debate prohibit the use of age.

The current models show predictive value for hospitalized SARS-CoV-2 infected patients in a large Dutch multicenter population. The next step towards application in practice is to validate the model by testing whether the models function as expected and if they truly add value to triage. Furthermore, this paper only utilized data from the Netherlands and thus it is unknown whether the models also work on data from other countries. It might be that certain prognostic factors are country specific, related to different organizational structures or decision making in the healthcare systems.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint

Both LR and XGB showed excellent performance using the 10 best features, and outperformed age-based rules, with or without age included in the features. The results suggest that XGB using the 10 best features could significantly improve decision making during an acute hospital bed shortage during a COVID-19 crisis and this model is therefore recommended to be developed into a clinical tool.

Not all patients provided active informed consent, and therefore data cannot be shared. The code used in this study is made publicly available and can be found at DOI:10.5281/zenodo.4077342

The ethical boards of the Amsterdam University Medical Centers (20.131) and Maastricht University Medical Center approved the study protocol (MUMC: 2020-1323)

The lead author (the manuscript's guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as originally planned (and, if relevant, registered) have been explained.

No funding to declare All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint FIGURES Figure 1 -A schematic overview of all steps involved data acquisition to model evaluation. The dotted line depict the step only used during feature selection of the 10 best features. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint Figure 2 -Panel A: Overall performance of both models per feature set. All models perform well above chance level. XGB generally performs better than LR, except on the premorbid feature set, where both models performed equal. The highest performance was achieved by XGB on both all features and the 10 selected features. Panel B: The confusion matrix of the best performing models, XGB trained on the 10 selected features. The prediction threshold was tuned to the shortest distance to the upper left corner of the AUC plot to create the 'optimal' binary prediction.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint Figure 3 -SHAP values of XGB trained on all features. To prevent readability issues, only the top 20 features are shown and the SHAP value range is set from -1.5 to 1.5, visually cutting of a few outliers. The color of each data points depicts the height of the value, where red corresponds to high values and blue to low values. SHAP values above 0 suggest a positive association with the outcome. Given the outcome is defined as mortality within 21 days, the positive SHAP values translate to association with higher mortality.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint Figure 4 -Confusion matrix per center as predicted by XGB trained on the 10 selected features. Prediction threshold is optimized by the shortest distance to upper-left corner in ROC plot of the complete dataset. All matrices show comparable distributions, though center 4 shows relatively many false positives.

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint Figure 5 -Performance per day for the XGB trained on the 10 selected features. The left yaxis shows the absolute number of correct predictions and the right y-axis the relative number correct predictions. Relative performance was calculated by correct / (correct + incorrect) and was well above chance level (0.5) for all days. The results indicate robust performance as the relative performance showed no decrease over time, while varying between 0.6 and 0.9. The absolute performance shows that most patients have an outcome (both favorable or unfavorable within one week after admission. A high number of patients is seen at day 21, which is caused by the aggregation of all patients that are in the hospital 21 days or longer. LR on the 10 best features shows similar performance (Figure not shown) .

All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.10.20210591 doi: medRxiv preprint Figure 6 -LR and XGB trained on the 10 selected features compared with two age-based decision rules. Both LR and XGB showed a higher AUC than both age-based rules. 19 patients did not have a value for age and were excluded for this analysis All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 13, 2020. .

Beware of the second wave of COVID-19 Global coalition to accelerate COVID-19 clinical research in resource-limited settings

Triage during the COVID-19 epidemic in Spain : better and worse ethical arguments

Ethical dilemmas due to the Covid-19 pandemic. Ann Intensive Care

Practices in Triage and Transfer of Critically Ill Patients: A Qualitative Systematic Review of Selection Criteria

Adult ICU Triage During the Coronavirus Disease 2019 Pandemic: Who Will Live and Who Will Die? Recommendations to Improve Survival*

Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a singlecentered, retrospective, observational study

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan , China : a retrospective cohort study

Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City : prospective cohort study

Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area

Features of 16 , 749 hospitalised UK patients with COVID-19 using the ISARIC WHO

Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol : development and validation of the 4C Mortality Score

Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19

Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis

Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR Yicheng

World Health Organization. Novel coronavirus (covid-19) -Rapid version

Towards trustable machine learning

Scikit-learn : Machine Learning in Python

A Scalable Tree Boosting System. KDD '16 Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min

Multivariate Imputation by Chained

Robust scale estimators and confidence intervals for location

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis ( TRIPOD ): the TRIPOD Statement

From local explanations to global understanding with explainable AI for trees. Nat Mach Intell

COVID-19 infecties op verpleegafdeling