key: cord-0859563-mqw7adpl authors: Josephus, Bernhard O.; Nawir, Ardianto H.; Wijaya, Evelyn; Moniaga, Jurike V.; Ohyver, Margaretha title: Predict Mortality in Patients Infected with COVID-19 Virus Based on Observed Characteristics of the Patient using Logistic Regression date: 2021-12-31 journal: Procedia Computer Science DOI: 10.1016/j.procs.2021.01.076 sha: 37f45c24e44b0eb8382ac84a555427fa7546a567 doc_id: 859563 cord_uid: mqw7adpl The spread of COVID-19 has made the world a mess. Up to this day, 5,235,452 cases confirmed worldwide with 338,612 death. One of the methods to predict mortality risk is machine learning algorithm using medical features, which means it takes time. Therefore, in this study, Logistic Regression is modeled by training 114 data and used to create a prediction over the patient’s mortality using nonmedical features. The model can help hospitals and doctors to prioritize who has a high probability of death and triage patients especially when the hospital is overrun by patients. The model can accurately predict with more than 90% accuracy achieved. Further analysis found that age is the most important predictor in the patient’s mortality rate. Using this model, the death rate caused by COVID-19 could be reduced. The new coronavirus family, COVID-19, is a zoonosis-type virus that spreads from animals to humans. COVID-19 first appeared in Wuhan City, China, at the end of 2019 and quickly spreads to various countries, including Indonesia 1,2 . Up until the day this paper writing, 5,235,452 people all around the world have been infected, 2,072,768 The new coronavirus family, COVID-19, is a zoonosis-type virus that spreads from animals to humans. COVID-19 first appeared in Wuhan City, China, at the end of 2019 and quickly spreads to various countries, including Indonesia 1,2 . Up until the day this paper writing, 5,235,452 people all around the world have been infected, 2,072,768 people have been recovered, and 338,612 people had died. In Indonesia's case, 21,745 people have been infected, 5,249 people have been recovered, and 1,351 people had died 3 . Looking at how quickly the virus spread, scientist around the world trying to collect clinical features from many infected patients by using a machine learning-based prognostic model with clinical data in Wuhan and finally conclude that these three key clinical features, lactic dehydrogenase (LDH), lymphocyte and High-sensitivity C-reactive protein (hs-CRP) play a huge role in a severe COVID-19 patients survival 4 . However, those clinical features need a doctor to get the appropriate result. Only doctor can examine patient and determine the severity using the three features. Artificial Intelligence (AI) has shown being an effective tool in predicting medical conditions 5 . In this study, a predictive algorithm based on Artificial Intelligence (AI) was used to predict the death of COVID-19 patients. The algorithm used observed characteristics of the patients to predicts the mortality risk. With the help of the algorithm, the hospital can triage appropriate patients especially when the hospital is overcrowding. The proposed algorithm used in this paper is Logistic Regression. Logistic regression is a mathematical model which describe the relation between one or more independent variables and a qualitative dependent variable. This dependent variable has two or more categories. If the dependent variable has two categories, then the model is called a binary logistic regression model. If the dependent variable has more than two categories, the model is called a multinomial or ordinal logistic regression model 6, 7 . Other modelling approaches are possible also, but the most popular of these approaches is the logistic model, which is estimated by maximum likelihood 7 . What makes it so popular is the logistic function, which describes the mathematical form on which is an extremely flexible and easily used function 8 . Logistic Regression has been used in several studies related to COVID-19. This method has been used to predict the total number of people with COVID-19 9 , to model the spread of COVID-19 in China 10 , and to predict the trend of the COVID-19 epidemic 11 . XGBoost machine learning algorithm also was used to predict the mortality risk and used the LDH, lymphocyte, and hs-CRP 4 . With these three features, patients have to do medical test. Therefore, this paper used Logistic Regression to predict mortality risk with several nonmedical features. The goal of this paper is to create a binary logistic regression model that can accurately predict mortality risk in COVID-19 patients to help hospitals and medical facilities prioritized the patients who have the highest death probability and give the appropriate treatment. We hope this study could minimize mortality due to COVID-19 and led more people to aware of the disease especially for the people who have a high probability of death classified by our model. Assume there is one independent variable, , and one dependent variable, , that have two categories. Let ( ) = ( = 1| = ) = 1 − ( = 0| = ). The logistic regression model is The extended model in (2) applies for multiple binary logistic regression. The β 0 , 1 , . . . , are the parameters for the model. The estimation for the parameters determined by the maximum likelihood estimation. The first step to estimate the parameter is defining the likelihood function. Assume there are 1 , 2 , … , binomial random variables. As the observations are assumed to be independent, the likelihood function for these binomial random variables can be seen in the following formula. ( Taking the natural logarithm of (β), Differentiating (β) with respect to β 0 , 1 , . . . , . Set the result of this differentiation equal to zero. This result is not a closed form formula, so we require iterative methods to get the estimated coefficients, for example the Iterative Weighted Least Squares method. A Chi-Square test for independence is a test used to check the independent relation between two categorical variables. This test will make use of contingency tables, i.e. tables with cells corresponding to cross-classifications of attributes or events. The hypothesis is 0 : ℎ ℎ ℎ . 1 : ℎ . The chi-square test statistic for this test is The count of the elements in cell (i,j), that is, the cell in row I and column j (where i = 1, 2, …, r dan j = 1, 2, …c), is denoted by . The is the expected count in cell (i, j) which defined by = The and are the total count for row i and the total count for column j. Analysis of Variance (ANOVA) is a statistical method for determining the existence of differences among several population means. In one-way ANOVA, to analyze variation towards the goal of determining possible differences among the group means, you partition the total variation into variation that is due to differences among the groups and variation that is due to differences within the groups. The within-group variation (SSW) measures random variation. The among-group variation (SSA) measures differences from group to group. The symbol n represents the number of values in all groups and the symbol c represents the number of groups. The hypothesis for doing ANOVA analysis is 0 : 1 = 2 =. . . = 1 : ( = 1,2, . . . , ) . In this paper, the dataset is obtained from Kaggle with a total of 1085 cases/data and 25 features including gender, age, and location from January 20 th to February 25 th , 2020. Figure 1 shows how is the data pre-processing done. First, unnamed and useless columns such as id, summary, source, and link are removed. Data with no target value are removed from the dataset leaving 222 data. To deal with missing values, a column that has more than 60% missing values is removed, a row with missing categorical value is removed, and data imputation is performed. Next, the date of the symptom onset and date of hospital visits are combined into the time gap between symptom onset and hospitalization. The combined value of hospitalization time gap with negative value -which means hospital visit date preceded symptom onset date -is removed. The result of this process leaving 114 valid data. Feature selection is useful for building simpler and more comprehensible models, improving data-mining performance, and preparing clean, understandable data 12 . Some of the methods include the ANOVA test and the chisquare test. The features of the dataset contain both numerical and categorical data, therefore both methods are used. Both methods will give a corresponding p-value for each feature which will be used to compare to a significance value. The initial features are Country, Gender, Age, Hospitalization Time Gap, From Wuhan, and Visit Wuhan. First, Chi-square is used to calculate the relation between a categorical feature and categorical target 13 , in this case, Country, Gender, From Wuhan, and Visit Wuhan. Author name / Procedia Computer Science 00 (2019) 000-000 Next, ANOVA is used to calculate the relation between a numerical feature and categorical target, which is Age and Hospitalization Gap. Table II shows the p-values of each numerical feature. Using the same significance value and hypothesis as in Chi-square test, all numerical feature is taken. Taking all the selected features, the final features are Age, Hospitalization Gap, Country, Visit Wuhan, and From Wuhan. The final dataset is trained using Logistic Regression using the LogisticRegression package available in python (sklearn). The dataset is randomly split into data train and data test with ratio 70:30, where 70% of the final dataset is used for data train and 30% of the dataset is used for data tests. As the dataset is small, a Liblinear solver is used for the training. The model is then evaluated using Precision, Recall, F1 Score, Confusion Matrix, and Area Under the Receiver Operating Characteristics (ROC) Curve. Out of 1085 cases/data, 114 data are picked with Age, Hospitalization Gap, Location, and Country as the predictor. The final data composed of 80 patients recovered and 34 patients died. The training dataset is randomly picked containing 70% of the data and 30% of the data is used for testing. The performance evaluation after testing is shown and discussed below. Figure 2 present the confusion matrix for the data. From this figure we know that the model can 100% accurately predict the patient's death and 96% accurately predict a patient's survival, as only 1 case is missed. Precision and recall are also used to evaluate performance. Table III shows the precision, recall, f1 score, and corresponding support for each class. The precision score of death is the only one that lower than 0.9, while survival, accuracy, macro averages, and weighted averages score are larger than 0.9. This indicates the good performance of the model. While the result above takes only one threshold for the evaluation, ROC Curve plot the True Positive Rate, as yaxis, and False Positive Rate, as x-axis, but with a various threshold. To construct a ROC curve, we calculate the True Positive Rate (TPR) and False Positive Rate (FPR) for each threshold with the following formula.All the variables (True Positive, False Negative, False Positive, True Negative) used in the formula can be taken by constructing a confusion matrix, similarly as in Figure 2 . By using the roc_curve package in python, the TPR, FPR, and threshold are calculated automatically. The calculated TPR and FPR values are then plotted to the graph to make a ROC Curve. Figure 3 shows the ROC curve graph. The point here is to calculate the area under the curve score. It ranges from 0.0 to 1.0. Higher the score, better the model. Surprisingly, the area under the curve score of the trained model is 1.0 which means the model has a very good prediction of COVID-19 patients mortality risk. Feature importance is used to find out which features are the most important in predicting the outcome/output. Figure 4 shows the result of the calculated feature importance score of each feature, from the highest score to the lowest. Age seems to be the most important feature in predicting the patient's survival, following by hospitalization time gap, from Wuhan, country, and visit Wuhan. COVID-19 patient's mortality risk could be predicted by developing a Logistic Regression model with Age, Hospitalization Time Gap, From Wuhan, Country, and Visit Wuhan as the predictors. The model developed has shown a good performance based on all metrics. It can help hospitals prioritize patients who really in need and reduce the mortality rate. However, the predictive model just being feed by a small amount of data, which may result in a lack of recognizing the pattern. In future studies, gathering more data for training is expected. Laboratory Testing for Coronavirus Disease 2019 (COVID-19) in Suspected Human Cases: Interim Gui. World Health Organization Clinical Findingd in a Group of Patients Infected with the 2019 Novel Coronavirus (SARS-Cov-2) Outside of Wuhan, China: Retrospective Case Series Prediction of Survival For Severe Covid-19 Patients with Three Clinical Features: Development of A Machine Learning-Based Prognostic Model with Clinical Data in Wuhan Predicting Mortality Risk in Patients with COVID-19 Using Artificial Intelligence to Help Medical Decision-Making. medRxiv Logistic Regression and Growth Charts to Determine Children Nutritio nal and Stunting Status: A Review Logistic regression using SAS: Theory and application Applied Logistic Regression Estimation of the final size of the coronavirus epidemic by the logistic. medRxiv Logistic Growth Modelling of COVID-19 Proliferation in China and Its International Implications Feature Selection: A Data Perspective Chi-Square Test is Statistically Significant: Now What? Practical Assessment, Research, and Evaluation