key: cord-0741464-ai4kswga authors: Ye, Jiru; Hua, Meng; Zhu, Feng title: Machine Learning Algorithms are Superior to Conventional Regression Models in Predicting Risk Stratification of COVID-19 Patients date: 2021-07-29 journal: Risk Manag Healthc Policy DOI: 10.2147/rmhp.s318265 sha: a8ccbc1ff3d93be1af056ba96bc78dd9e09a7aea doc_id: 741464 cord_uid: ai4kswga BACKGROUND: It is very important to determine the risk of patients developing severe or critical COVID-19, but most of the existing risk prediction models are established using conventional regression models. We aim to use machine learning algorithms to develop predictive models and compare predictive performance with logistic regression models. METHODS: The medical record of 161 COVID-19 patients who were diagnosed January–April 2020 were retrospectively analyzed. The patients were divided into two groups: asymptomatic-moderate group (132 cases) and severe or above group (29 cases). The clinical features and laboratory biomarkers of these two groups were compared. Machine learning algorithms and multivariate logistic regression analysis were used to construct two COVID-19 risk stratification prediction models, and the area under the curve (AUC) was used to compare the predictive efficacy of these two models. RESULTS: A machine learning model was constructed based on seven characteristic variables: high sensitivity C-reactive protein (hs-CRP), procalcitonin (PCT), age, neutrophil count (Neuc), hemoglobin (HGB), percentage of neutrophils (Neur), and platelet distribution width (PDW). The AUC of the model was 0.978 (95% CI: 0.960–0.996), which was significantly higher than that of the logistic regression model (0.827; 95% CI: 0.724–0.930) (P=0.002). Moreover, the machine learning model’s sensitivity, specificity, and accuracy were better than those of the logistic regression model. CONCLUSION: Machine learning algorithms improve the accuracy of risk stratification in patients with COVID-19. Using detection algorithms derived from these techniques can enhance the identification of critically ill patients. receive timely intervention to minimize the disease's progression early. For this purpose, some COVID-19 risk prediction models have been developed and studied, but most of them were established using conventional regression models. [7] [8] [9] With the development of science and technology, various machine learning algorithms and artificial intelligence technologies have been widely used in patient tracking, vaccine development, and patient screening due to their better extensibility and faster processing ability. 10, 11 However, applying machine learning algorithms and artificial intelligence in identifying disease progression and estimating the risk of death is relatively rare. 12 Based on the general data and laboratory indexes of COVID-19 patients, we developed two multivariate prediction models using machine learning algorithms and multivariate logistic regression analysis to predict the risk stratification of COVID-19 patients and compared the prediction performance of the two models. This was a retrospective cross-sectional study. We analyzed the medical record of 170 patients treated for novel coronavirus infection in the negative pressure ward of Wuxi Fifth People's Hospital between January and April 2020. This study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Ethics Committee for retrospective analysis (No. 2021-001-1). Since the patients' medical information was anonymous, informed consent from the participants was not a requirement. Patient inclusion criteria: ① Novel coronavirus nucleic acid positive detected by Real-time fluorescence RT-PCR. ② In line with the diagnostic criteria of the Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia (trial version 7). 13 Patient exclusion criteria: Patients under 15 years old were excluded (nine patients). The general data, complications, and routine laboratory test results of all patients were recorded. The clinical classification of COVID-19 patients was mostly based on symptoms and imaging findings. According to the standard established in the Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia (trial version 7) published by the General Office of the National Health Commission of China, 13 COVID-19 patients were classified as mild, moderate, severe, and critical cases. Patients with mild COVID-19 showed mild clinical symptoms without any pneumonia sign on images. Moderate cases showed fever, respiratory symptoms, and imaging findings of pneumonia. Patients who met any of the definitions described below were considered severe cases: respiratory distress (RR ≥30 breaths/min); oxygen saturation ≤93% at rest; arterial oxygen partial pressure/fraction of inspired oxygen ≤300 mmHg. Patients who met any of the criteria described below were considered critical cases: respiratory failure requiring mechanical ventilation; shock; combined with other organ failures that required ICU care. Patients whose pulmonary imaging showed that the lesions significantly progressed over 50% during 24-48 hours were managed as severe cases. For the treatment of COVID-19 patients, antiviral therapy such as interferon-α, lopinavir/ritonavir, and ribavirin can be used. Effective oxygen therapy can be given in time, and traditional Chinese medicine can also be used selectively. For severe and critical cases, in addition to the treatment to relieve the symptoms, extra care should be given to actively prevent and treat complications, treat basic diseases, prevent secondary infection, and provide organ function support in time. White blood cell (WBC, reference range: 3.50-9.50×10 9 /L), the percentage of neutrophils (Neur, reference range: 40.00-75.00%), neutrophil count (Neuc, reference range: 1.80-6.30×10 9 /L), the percentage of lymphocytes (Lymr, reference range: 20.00-50.00%), lymphocyte count (Lymc, reference range: 1.10-3.20×10 9 /L), the percentage of monocytes (Monr, reference range: 3.00-8.00%), monocyte count (Monc, reference range: 0.10-0.60×10 9 /L), red blood cell count (RBC, reference range: female, 3.80-5.10×10 12 /L; male, 4.30-5.80×10 12 /L), hemoglobin (HGB, reference range: female, 115-150 g/L; male, 130-175 g/L), hematocrit (HCT, reference range: female, 35.0-45.0%; male, 40.0-50.0%), platelet count (PLT, reference range: 125-350×10 9 / L), red blood cell distribution width (RDW, reference range: 11.50-14.90%), plateletcrit (PTC, reference range: 0.108-0.272 L/L), mean platelet volume (MPV, reference range: 6.00-11.50 fL), platelet distribution width (PDW, reference range: 15.50-18.10 fL), high sensitivity C-reactive protein (hs-CRP, reference range: 0-10 mg/L), procalcitonin (PCT, reference range: 0-0.05 ng/mL). The routine blood test was performed using Sysmex XN9000 hematology analyzer (Sysmex Corporation, Hyogo, Japan), hs-CRP was determined using the protein analyzer HP-083/4 (Hipro Biotechnology, Shijiazhuang, China), PCT was detected using an Autobio A2000PLUS automatic chemiluminescence analyzer (Sym-Biotechnology, Suzhou, China). 14 Continuous variables were presented as the mean ± SD or median (Q1-Q3), and categorical variables were presented as frequency (%). Unpaired Student's t-test or Mann Whitney nonparametric test was used to comparing continuous variables. Pearson chi-square test and Fisher exact test were used to analyzing categorical variables. Use general data and laboratory indicators for predicting risk stratification (see Tables 1 and 2) to train a machine learning model (eXtreme Gradient Boosting, XGBoost). 15 The optimized model hyperparameters are set as follows: booster = gbtree, objective = binary: logistic, eta = 0.3, gamma = 5, max_depth = 6, min_child_weight = 1, subsample = 1, colsample_bytree = 1. The XGBoost model had been proven to provide the most advanced results for various medical applications and had won numerous awards in machine learning algorithms. To evaluate the importance of features developed by the model, three importance scores of Gain, Cover, and Frequency are calculated, where Gain is the most relevant attribute that explains the relative importance of each feature. The indicators selected by the machine learning algorithm are used as the model parameters (independent variables), the patient risk stratification is used as the dependent variable, the multivariate Logistic regression method is used to establish a prediction model of all independent variables, and the best model parameters (including intercept, regression coefficients of each independent variable). The XGBoost model will predict each case and generate the predicted probability (P) whether the patient is diagnosed as severe or above; use different cutoffs for the predicted probability to determine the stratification of the patient, and for each cutoff, the corresponding sensitivity and specificity and draw a receiver operating characteristic (ROC) curve, calculate the area under the curve (AUC) and 95% confidence interval (CI). For the Logistic regression model, the ROC curve is drawn by calculating each case's Logit (P), and the AUC and 95% CI are calculated. Use the DeLong test to compare whether the AUC of the two models is significantly different. R software was used for all statistical analyses, version 3.4.3 (http://www.R-project.org). P<0.05 was considered that the difference was statistically significant. Our study eventually included 161 COVID-19 patients, including 90 males and 71 females, with an average age of 46.7 ± 16.1 years (range: 15-91 years). Based on the symptoms and imaging findings at hospital admission, the patients were classified as 12 asymptomatic cases, 32 mild cases, 88 moderate cases, 27 severe cases, and 2 critical cases of COVID-19 (15 patients with moderate COVID-19 at admission progressed to severe cases two days after admission and thus they were classified as severe cases). Based on the clinical significance of treatment, we divided the patients into two groups: an asymptomatic-moderate group (132 cases) and a severe or above group (29 cases). We compared the general data between the two groups ( Table 1 ) and found that compared with the asymptomatic-moderate group, the patients in severe group were older (P<0.001), and the number of patients with hypertension, diabetes, and cerebrovascular disease was remarkably higher (all P<0.05). We noticed that the number of patients who had tumor history was also higher in severe group, but the increase was not statistically significant (P=0.084). Comparing the laboratory indexes between the two groups, we found that the percentage of neutrophils, hs-CRP, and PCT were markedly higher in severe group than in asymptomatic-moderate group (P<0.001), but the percentage of lymphocytes, percentage of monocytes, lymphocyte count, monocyte count, red blood cell count, hemoglobin and hematocrit were all remarkably lower in severe group than in asymptomatic-moderate group ( Table 2 , P<0.05). The machine learning algorithms were applied to predict the risk stratification using the general data and laboratory indexes of COVID-19 patients. The goal of classification was to identify critically ill COVID-19 patients. We sorted various factors according to their importance to the risk prediction (Table 3 and Figure 1 ) and found that hs-CRP, PCT, and age were the top three risk factors, followed by four routine hematological indexes: Neuc, HGB, Neur, and PDW. Among these factors, hs-CRP and PCT were more important in COVID-19 risk prediction than the four routine hematological indexes. Next, we established another prediction model using multivariate logistic regression analysis with the above seven parameters as independent variables, and whether the patient was diagnosed as severe or above case as the dependent variable: Logit (P) = −7.05139 +2.31599 × procalcitonin + 0.00264 × high sensitivity C-reactive protein + 0.06364 × age + 0.14735 × neutrophil count + 0.02677 × percentage of neutrophils − 0.00751 × hemoglobin + 0.02564 × platelet distribution width. The P value in the above formula was the probability that the patient was diagnosed as severe or above case. Comparing the ROC curve of machine learning model and logistic regression model in predicting clinical classification of COVID-19 (Figure 2 ), we found that the AUC value was significantly different between the two models (0.978 vs 0.827, P=0.002), and the sensitivity, specificity, and accuracy of machine learning model were all the better than those of logistic regression model (Table 4 ). Our research showed that the machine learning model constructed with seven characteristic variables, including hs-CRP, PCT, age, Neuc, HGB, Neur, and PDW, has good COVID-19 risk prediction ability, which can be helpful for physicians to predict the progress of disease and intervene the disease in time. According to the importance of characteristic parameters of XGBoost model, hs-CRP and PCT are important parameters to predict severe COVID-19 patients. The expression level of hs-CRP is usually low, but it increases rapidly and significantly during acute inflammation. Therefore, hs-CRP is a sensitive biomarker of inflammation, infection, and tissue damage. 16 Previous reports have also suggested that hs-CRP is an important biomarker for poor prognosis in COVID-19 patients, revealed an enduring status of inflammation, 17, 18 which may deeply interact with the inflammatory storm, leading to lung injury and pulmonary edema in COVID-19 patients. 19, 20 PCT is a glycoprotein with no hormonal activity. It is a precursor of calcitonin 21,22 that can be used as a biomarker to assess the severity of sepsis and the prognosis of patients with sepsis. 23 It can be used to guide antibiotic treatment. Some studies have shown that the level of PCT is positively correlated with the progression of COVID-19. 24, 25 It has been reported that angiotensin converting enzyme 2 (ACE2) receptor is expressed in vascular endothelial cells. SARS-CoV-2 can bind to ACE2 receptors and invade host cells, which leads to endothelial dysfunction, increases the possibility of cytokine storm, and produces a series of immune responses, 3,26-28 resulting in adverse clinical outcomes and death. Since the immune function of severe COVID-19 patients is low, these patients are more prone to infection, 8 which leads to the increase of inflammatory markers. Wang et al 29 also suggested that 81.7% of the deaths in COVID-19 patients were associated with a bacterial infection. Although there is a correlation between the above inflammatory markers and the severity of COVID-19 patients, these inflammatory markers' role in the pathogenesis of COVID-19 is not fully understood and needs further verification and in-depth study. Age is also an important determinant in XGBoost model. SARS-CoV-2 infection may deteriorate the chronic inflammation in elderly patients, such as hypertension, diabetes, and cardiovascular disease, leading to death. Therefore, old age is also a risk factor for severe COVID-19 patients. 30 Also, routine blood indexes: Neuc, HGB, Neur, and PDW are also COVID-19 risk prediction factors in the model, and the key roles of these markers have been confirmed in other reports. 32, 33 However, the importance of hs-CRP and PCT in COVID-19 risk prediction is much higher than that of routine blood indexes. It has been reported that machine learning algorithms can maximize clinical parameters and improve the accuracy of diagnosis. 34 Our results showed that XGBoost model has a good prediction performance, the specificity, sensitivity, and accuracy are all above 90%, and the diagnostic efficiency is better than the logistic regression method established using the same parameters, indicating that machine learning algorithms are more accurate and sensitive than conventional logistic regression analysis, which is consistent with the results of Pan et al. 35 Fernandez et al 36 also pointed out that XGBoost method is more reliable, especially when the sample size is limited. Moreover, the seven key parameters of the risk prediction model we established in this study can be obtained at admission. Therefore, early detection of these parameters can help identify patients with severe COVID-19 to receive timely intervention and appropriate intensive care to minimize disease progression. There are some limitations to our study. Firstly, this is a retrospective, single-center study, which may lead to biased conclusions. Secondly, since we did not conduct external verification on the model, it is necessary to establish a prospective study cohort to further verify the model's accuracy. Finally, the data used to build the COVID-19 risk prediction model is completely from China, which may not apply to other regions. In this study, seven characteristic variables, namely hs-CRP, PCT, age, Neuc, HGB, Neur, and PDW, were used to construct XGBoost model and logistic regression model. The machine learning method improved the accuracy of risk stratification for patients with COVID-19 and could effectively assess the severity of patients with COVID-19. The diagnostic efficiency was better than logistic regression method based on the same parameters. It is helpful for clinicians to identify critical patients in the early stage, but further verification is needed to make our findings applied in clinical practice. All data generated or analyzed during this study are available from the corresponding author upon reasonable request. This study was approved by the Institutional Ethics Committee of Wuxi Fifth People's Hospital for retrospective analysis (No. 2021-001-1). Feng Zhu and Meng Hua are co-corresponding authors. All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work. The authors report no conflicts of interest in this work. Clinical features of COVID-19-related liver functional abnormality Liver diseases in COVID-19: etiology, treatment and prognosis Factors associated with mortality in patients with COVID-19. A quantitative evidence synthesis of clinical and laboratory data Laboratory findings and a combined multifactorial approach to predict death in critically ill patients with COVID-19: a Retrospective Study The laboratory tests and host immunity of COVID-19 patients with different severity of illness Prompt predicting of early clinical deterioration of moderate-to-severe COVID-19 patients: usefulness of a combined score using IL-6 in a Preliminary Study Establishing a model for predicting the outcome of COVID-19 based on combination of laboratory tests Simple nomogram based on initial laboratory data for predicting the probability of ICU transfer of COVID-19 patients: Multicenter Retrospective Study Combined use of the neutrophil-tolymphocyte ratio and CRP to predict 7-day disease severity in 84 hospitalized patients with COVID-19 pneumonia: a Retrospective Cohort Study Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (COVID-19): a systematic review Multi-criterion intelligent decision support system for COVID-19 Clinical and inflammatory features based machine learning model for fatal risk prediction of hospitalized COVID-19 patients: results from a Retrospective Cohort Study Diagnosis and treatment protocol for novel coronavirus pneumonia (trial version 7) Application of a prediction model with laboratory indexes in the risk stratification of patients with COVID-19 Boosted tree model reforms multimodal magnetic resonance imaging infarct prediction in acute stroke Clinical and laboratory predictors of in-hospital mortality in patients with coronavirus disease-2019: a Cohort Study in Wuhan, China Tocilizumab for the treatment of severe COVID-19 pneumonia with hyperinflammatory syndrome and acute respiratory failure: a Single Center Study of 100 patients in Brescia Clinical and biochemical indexes from 2019-nCoV infected patients linked to viral loads and lung injury COVID-19 autopsies C-reactive protein enhances murine antibody-mediated transfusion-related acute lung injury Procalcitonin levels predict infectious complications and response to treatment in patients undergoing cytoreductive surgery for peritoneal malignancy Novel applications for serum procalcitonin testing in clinical practice Procalcitonin-guided diagnosis and antibiotic stewardship revisited Procalcitonin levels in COVID-19 patients Prognostic value of interleukin-6, C-reactive protein, and procalcitonin in patients with COVID-19 Pathophysiology, transmission, diagnosis, and treatment of coronavirus disease 2019 (COVID-19): a review Angiotensin-converting enzyme 2: SARS-CoV-2 receptor and regulator of the renin-angiotensin system: celebrating the 20th anniversary of the discovery of ACE2 Coronavirus disease 2019 (COVID-19) infection and renin angiotensin system blockers Coronavirus disease 2019 in elderly patients: characteristics and prognostic factors based on 4-week follow-up Risk factors of severe cases with COVID-19: a meta-analysis Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Novel coronavirus disease 2019 (COVID-19): relationship between chest CT scores and laboratory parameters Laboratory features of severe vs. non-severe COVID-19 patients in Asian populations: a systematic review and meta-analysis We thank Dr. Renyuan Li and his group for their technical guidance during the manuscript revision process.