key: cord-0844727-kuc03ejq authors: Acheme, Ijegwa David; Vincent, Olufunke Rebecca title: Machine-learning models for predicting survivability in COVID-19 patients date: 2021-05-21 journal: Data Science for COVID-19 DOI: 10.1016/b978-0-12-824536-1.00011-3 sha: 54d55a5d32c7869250a74c66ce0c42490e552446 doc_id: 844727 cord_uid: kuc03ejq COVID-19 is a disease currently ravaging the world, bringing unprecedented health and economic challenges to several nations. There are presently close to five million reported cases in over 200 countries with fatalities numbering over 300,000 persons. This study presents machine-learning models for the prediction and visualization of the significant factors that determine the survivability of COVID-19 patients. This study develops prediction models using a decision tree, logistic regression (LR), gradient boosting, and LR algorithms to identify the significant factors and predict the survivability of COVID-19 patients. The results of the simulation showed that the LR model had the lowest prediction accuracy. The other three showed over 95% correct accuracy and indicated that the essential factors in determining patients' survivability were underlying health conditions and age. The findings of this study agreed with the medical claims that patients with underlying health challenges and those advanced in age are liable to have complications; hence, providing a research-based credence to this belief. This proposed model thus serves as a decision support system for the management of COVID-19 patients, as well as predicts a patient’s chances of survival at the first presentation at the hospitals. 23 predictor variables and one dependent variable, "survival," which represents the survival of the patient. Prediction models were built using decision trees, RF, neural networks, logistic regression (LR), and support vector machines (SVMs). The models' results showed close outcomes in terms of accuracy with decision trees giving the lowest accuracy of 79.8%, while RF gave an accuracy of 82.7%. Furthermore, the model revealed the most correlated variables hence the most important in determining survivability; these are the stage of cancer, size of the tumor, number of axillary lymph nodes removed, number of positive lymph nodes, types of primary treatment, and methods of diagnosis. The study of Ref. [24] also presented a survivability model for breast cancer patients. The research utilized the Surveillance Epidemiology and End Results (SEER) dataset covering about 30 years, containing a total of 433,272 records of breast cancer incidences. The data after preprocessing to remove redundancies and missing fields resulted in 202,932 records, which were classified into two groups of "survived" and "not survived." Machine-learning algorithms were then applied to identify the dependent field from the 16 predictor fields. The results of the prediction of survivability reported were over 93% accurate. Ref. [25] showed an approach for predicting survivability in malignancy. The main factor used for predicting survival time is the initially evolved tumor-incorporated clinical feature, which is a combination of tumor stage, tumor size, and age at diagnosis. The research utilized datasets from corresponding breast cancer, which were integrated using document-oriented graph databases. The applied machine-learning methods of linear Support Vector Regression, Lasso regression, Kernel Ridge regression, K-neighborhood regression, and Decision Tree regression showed promising results in terms of accuracy of survival time prediction. Ref. [26] presented a multimodel ensemble technique for lung, stomach, and breast cancer prediction. The ensemble technique utilized several deep learningebased classifiers for predicting cancer occurrence. Ref. [27] used clinical data of patients of the Iranian Center for Breast Cancer from 1997 to 2008. The dataset with 1189 records, 22 predictor variables, and one outcome variable. They implemented three machine-learning models for prediction of cancer in the patients; these are Decision Trees, SVM, and ANN. The research objective was to compare the performance of these three well-known algorithms by sensitivity, specificity, and accuracy analysis. Comprehensive reviews of several machine-learning techniques that have been applied to disease prediction and survivability are found in Refs. [28, 29] . Other researches that have also reported the survivability prediction in known diseases using machine-learning methods are found in Refs. [12,30e35] . This research deploys data science techniques using machine learning classification algorithms trained by existing clinical data of COVID-19 cases to predict the survivability of patients, thereby leading to a better understanding of the factors most responsible for fatalities. ML which is a branch of artificial intelligence utilizes tools in statistics and probabilistic optimizations to allow computers learn from data and hence able to detect patterns that are hard to discern from noisy, complex, and large datasets; this capability of ML models has positioned its applications suitable for medical research especially in applications which depends on complex proteomic and genomic measurements. The paper is organized as follows: Section 1 presents related ML survivability models deployed to study other diseases. In Section 2, the procedure for data collection, wrangling, and prepossessing and feature selection are presented. Section 3 presents the results. In Section 4, we present a discussion and conclusion in Section 5. The duration of time that a patient had COVID-19 virus is essential to his survivability of the virus. This study presents a framework for the survival analysis of the COVID-19 pandemic. In this case, it is crucial to know the population of the COVID-19 population that would be expected to survive the pandemic and at what rate. For the patients who are unable to survive the virus, it is essential to note the rate of death and what could be another underlying ailment. The particular circumstances and characteristics increase or decrease in the probability of survival are also of interest. This study utilizes the dataset of COVID-19 cases in Nigeria as a case study, which is daily tallied by the Nigerian Center for disease control NCDC. The study follows the wellknown data science research methodology, as proposed by Ref. [36] , which is illustrated in Fig. 16 .1. Presents a step level of the survivability analysis. The ML models used in this study are decision tree, RF, LR, and gradient boosting ML classifiers have been used, while the Area under the receiver operator characteristic (ROC) curve and F1 measure and other established evaluation metrics were used for evaluation as it applies to binary classification problems. Data collected consisted of the fields presented in Table 16. 1. An exploratory data analysis were then carried out to discover hidden patterns and gain further insights from the data leading to the removal of fields that were considered not very relevant to the prediction of survivability in the feature's selection. With the data cleaned and relevant features selected, the data was then spilled in a 70:30 ratio for training the chosen machine-learning algorithm and testing the model, respectively. The results of the model were then evaluated using standard metrics of ROC area under curve (AUC) curve, F1 Measure, and log loss. The proposed COVID-19 survivability model comprises of following phases: data collection, data prepreprocessing, feature selection, building ML models, and comparative analysis of the models. Fig. 16 .2 describes the stages. From Fig. 16 .2, datasets containing records of COVID-19 cases in Nigeria are collected from the Nigerian Center for Disease control. The data are preprocessed and cleaned; the next exploratory data analysis is carried out to gain initial insights into the distributions of the variables. Four machine-learning models are then built for comparative analysis and decision support. The dataset after cleaning consisted of 1400 multivariate instances with attributes related to patient's age, marital status, race, occupation, gender, education level, employment Data preprocessing is an iterative process for the transformation of the raw data into understandable and useable forms. Raw datasets are usually characterized by incompleteness, inconsistencies, lacking in behavior, and trends while containing errors [37] . The preprocessing is essential to handle the missing values and address inconsistencies. In this work, the data gathering was carried out to avoid out-of-range values, impossible data combinations such as (Sex: Male, Pregnant: Yes) were handled, missing values and redundancies were also treated during the data preprocessing stage resulting in a more reliable and relevant dataset fit for knowledge discovery. Transforming data into suitable formats for a particular machine-learning problem is an essential consideration at the beginning of the project. The presence of irrelevant, redundant information, noisy, and unreliable data significantly affects the model outcomes and knowledge discovery, making the training phase more difficult. The data preparation and filtering steps take the most amounts of time spent on an ML project but worth it. The steps involved include cleaning, instance selection, normalization, transformation, feature extraction, and selection. The product of data preprocessing is the training set. Feature selection is among the essential steps in a machine-learning project, and this is also referred to as variable and attribute selection since the interest is in the most critical attributes that influence the predicted variable, a good selection of features ensures, simplified models enhancing more natural interpretations by researchers and users, shorter training time-saving computational resources, the avoidance of the curse of dimensionality, and the avoidance of overfitting [38] . Since this process involves the reduction of the number of input variables for the development of the model, it will lead to a reduction in the computational cost of the model as well as increase the model's performance. Statistical-based feature selection method was employed in this work which involved the evaluation of the relationship between the target variable and the input variables and selecting the variables with the strongest correlation. The summary of the selected features is presented in Table 16 .1. In ML, artificial intelligence is applied through different statistical, probabilistic, and tools for optimization, which learns from patterns in training data to classify new data presented after training [39] . ML techniques have been applied to statistical problems for analysis and interpretation of data. However, ML extends statistical methods by the usage of programming constructs such as Boolean logic, conditional statements if.else, and conditional probabilities for optimization, classification, and clustering problems. The foundation of ML is firmly rooted in statistics and probability. Still, it offers more robust results as it allows inferences and decisions to be drawn from models that may not be possible with conventional techniques [40, 41] . Statistical methods, for example, used in multivariate regression or correlation analysis assumes variable independence as such a strict statistical model with build linear combinations of such variables, in this kinds of scenario, statistical models are limited by nonlinear, interdependent and conditional variables characteristic of most biological systems, in this kinds of situation, ML models offer better results [42] . The success of a good ML model depends on the understanding of the problem and the data used, understanding the assumptions and limitations of the chosen algorithms as the best models are dependent on the quality of training dataset [43] . Other problems are classified under the dimensionality of variables, overtraining, and overfitting of models [44] . Step 2: Start treatment and record the changes to calculate the Entropy (H) and Information Gain (IG) on the daily treatment of attribute S. Step 3: Select the attribute with the smallest entropy or highest information gain. Step 4: Split S to produce a subset of the data. Step 5: Continue iteration on each subset utilizing only unused attributes. The entropy E(S) measures the randomness of the information of the medical changes in the patients, and it is defined by In Eq. (16.1), S represents the current state of the patient, P i is the probability of survival for any even s of state S. The information gain is computed as 2) , B is the dataset before splitting, K is the number of subsets generated, and ( j, after) is the jth subset after splitting. RFs build on simple decision trees, hence comprise of several numbers of separate decision trees operating as an ensemble system. In a RF model, each tree produces a prediction for a class, the class with the majority of predictions; therefore, it becomes the final predicted value [46] . RFs seek to deploy the power in numbers as very large units of decision trees which are uncorrelated but operating in a RF to produce better results than the individual constituent tree. The total essential features in a RF, thus, are the average of all the trees, such that In Eq. (16.3), RFfi i is the importance of the feature, normfi sub(ij) is the normalized importance i in tree j, and T is the total number of trees. Nelder and Wedderburn to provide a means of using linear regression to the problems which were not directly suited for application of linear regression. It is a classification algorithm widely used for building predictive models that utilize probabilities. It can be seen as a linear regression model with an associated cost function called the sigmoid or logistic function. This function maps predicted class values to the probability values between 0 and 1. The generalized equation is given in Eq. (16.4) gðEð yÞÞ ¼ a þ bx1 þ gx2 (16.4) where gðÞ is the link function, Eð yÞ is the expectation of the predicted variable, and a þ bx1 þ gx2 are the predictors. Gradient boosting algorithms are machine-learning techniques for classification and prediction problems. Gradient boosting works as an ensemble and optimization of several weaker models, such as decision trees. This classifier comprises three elements: a loss function which is optimized, a more inadequate learner such as decision tree to make predictions, and an additive function for adding up of weak learners to minimize the loss function. The model development involved the use of the entire dataset comprising of 1400 (n ¼ 1400) records, which had eight predictors of the survival rate variable. The dataset was split in the ratio 70:30 for training and testing, respectively. The four chosen models were built using IBM Watson studio, and each was evaluated with its accuracy, sensitivity, precision, F1 score, log loss, the receiver operating characteristic curve (AUC) and recall curve, and finally. The decision tree was implemented utilizing the entire dataset. It processed the input data and yielded the tree with the optimal result with an accuracy of 95% correct prediction. The node of the DT signified the essential variable; these are followed by decision nodes that had percentages of classification. Fig. 16 .7A shows the feature importance of the decision tree classifier. In building the RF model, 70% of the dataset was utilized for training. The RF model comprised of independent trees with the default number of trees set to (ntree ¼ 500) to assess the model accuracy, the final prediction using the testing dataset (30%) yielding over 96% correct prediction. Next was the LR model, this is a gaussian distribution with odds ratio, where the odds of the predicted variable (survivability) was modeled as a linear combination of all the predictor variables. The LR is useful in predicting binary depended on variables, in this case, the survivability, which is replaced in the dataset with 1 for alive and 0 for death. The LR model reported the least accuracy with the testing dataset. The gradient boosting classifier, which is an ensemble of RF classifies, reported the highest accuracy. In this work, the model was built by converting the testing and training data into a matrix as xgboost for evaluation. The gradient boosting algorithm appeared to be the most suitable model for the prediction of survivability in COVID-19 patients. The four machine-learning models were built, trained, and evaluated using IBM Watson studio's AutoAI tool on the IBM cloud. The complete dataset comprising of eight predictor variables and one target variable were used to build four machine-learning models. For the evaluation of the model, the average precision, the area under ROC curve, precision, recall, F1 measure, normalized Gini coefficient, and log loss were the metrics used. Exploratory data analysis is the process of initial exploration and investigation of the dataset to gain initial insights. In these ways, patterns and anomalies can be discovered. The results are presented as summary statistics and graphical representations Figs. 16.3e16.5. The age distribution shown in Fig. 16 .3 reveals the age bracket of the most infected cases were between 50e55 and 60e70. While the minimum reported age was 15, and the maximum reported age was 89, indicating that the reported cases where well-spread across the different ages in the population. Further exploration of the data Fig. 16 .4A revealed about 33% of the patients admitted were business owners and self-employed, about 33% were retired from active service, the student population made up about 11%, and the fully employed were about 22%. Furthermore, in Fig. 16 .4B, 59.6% of the reported cases had a travel history in the last three months, while 40.39% do not. Fig. 16 .5 is the frequency distribution of the patients with underlying health conditions and the total number of reported survivors after admission for one month. 16 .5A reveals that 86.32% of the total cases had no known health conditions, 4.89% suffered from diabetes, 2.28% suffered from hypertension, 3.26% suffered from Asthma, 1.95% suffered from diabetes and hypertension. In contrast, about 1.3 suffered from other heart diseases. Fig. 16 .5 shows that about 87% of admitted cases survived and were discharged within one month of admission, while about 13% of the cases were fatal. The results of the decision tree, RF, and gradient boosting classifiers showed over 95% prediction accuracy, while LR showed an accuracy of 78.6%. See Table 16 .4. Furthermore, a comparison of the feature importance of each algorithm is investigated, as presented in Fig. 16 .7, revealing that survivability of COVID-19 patients depended mostly on underlying health issues followed by age and occupation. The performances of the models were evaluated with the AUC-ROC, F1 Score, precision, and recall. These are summarized in Tables 16.3 and 16.4. The AUC-ROC, which is one of the most commonly used and reliable metrics, represents the extent or measure of separability, and it reveals the degree to which the models are capable of identifying classes. Higher values of AUC indicate better predictive accuracy. The ROC is plotted with the true positive rates on the y-axis against false positives rates and the x-axis. These values are estimated by Eqs. (16.5e16.7). TP represents true positives, while FN is false negatives. TN represents true negatives, and FP denotes false positive. Fig. 16 .6 is the AUC-ROC curve for the gradient boosting classifier. This study implemented machine-learning models using the COVID-19 dataset as on the April 29, 2020, from the Nigerian Center for Disease Control (NCDC) to identify the most important factors responsible for the survival of infected patients. Of the four chosen machine-learning models, three (decision trees, RF, and gradient boosting algorithms) yielded prediction accuracies of over 95% with LR with 70% accuracy. The models also revealed the two most important factors that determine patients survivability, and these are underlying health conditions and age of the patients, Patients' occupation and education were distant far from the top two. At the same time, gender, race, travel history, and marital status did not influence patients' survivability. Considering the increasing need for predictive medicine and the rising dependence on models of ML and data science, this work presents this approach in the study of the current outbreak of the coronavirus that has brought unprecedented difficulties and for which there is still no known cure or vaccines. The intent is to identify the most influencing factors responsible for fatalities among patients ( Fig. 16.7) while demonstrating the usability of clinical data as training datasets for different types of ML algorithms and comparatively analyzing their efficiencies. Since the objective of the research was to develop machine-learning models that predicted the survivability among COVID-19 patients using clinical data sourced from the NCDC, it is crucial to consider the efficiencies of the chosen algorithms. The performance of each algorithm is evaluated using the receiver operator characteristic (ROC) curve, the F1 score, average precision, and log loss. Table 16 .4. Furthermore, in terms of accuracy during testing with blinded datasets, the reliability of the models showed promising results, the LR model reported the lowest accuracy (78.6), this is followed by decision tree classifier (95.5), the RF (96.4). The gradient boosting algorithm reported 99.3% correct prediction making it the most reliable of all the models. One of the significant strengths of this work, therefore, was the use and comparison of different machine-learning classification algorithms to determine the model with the best performance. The accuracies of the four models on the sample of the dataset are presented in Table 16 .4. The feature importance of all the models is shown in Fig. 16.7 . The gradient boosting model, RF, and decision tree all indicated well-calibrated predictions as their curve was almost diagonal; this is not the case with the linear regression model. The COVID-19 clinical dataset appeared to be sufficiently reliable as the calibration measures were close to the identity. The highest accuracy is found with the gradient boosting algorithm (99%). The training dataset, which is 70% of the entire dataset, was used to train and fit the variables. Once the model was processed using the training dataset, predictions were made using the testing dataset (30%). To avoid overfitting, the validation dataset stopped training as errors increased. As such, the training set indicated an error rate of 0.4e0.5, while the testing data indicated an error rate of 0.1e0.3 during prediction. The summary of the models' outcomes (accuracies and performance metrics) is presented in Tables 16.3 This study has presented a predictive model for the survivability of COVID-19 patients using ML, which is a distinction from disease diagnostic systems. Predicting survivability involves efforts toward determining the outcome after an individual has been infected, and this is helpful for a better understanding of the risk factors. In this study, we identified significant predictors of survival of COVID-19 patients using four machinelearning models trained with clinical data. This provides evidence-based information, and the system can hence serve as decision support for better understand and individualize hospital management of patients of COVID-19 to improve survival rate. The research also compares and assesses the performance of four different machinelearning algorithms to determine the most efficient algorithm; the gradient boosting ML algorithm showed the best results when compared to decision trees, RF, and LR models. The result reveals that ML methods can be effectively utilized in the prediction of survivability in diseases that rely on several factors and promises higher accuracies when compared to conventional statistical or expert-based systems. Furthermore, the study reveals the two most important variables for patients' survivability; these are underlying health conditions and age. These findings aligned with the long-held scientific belief that patients with underlying health conditions hardly survived such pandemic infections. Though the result of these models showed high accuracies in prediction, further studies could consider extending the data set to other continents and get datasets from different countries and different ethnicities. In such cases, environmental conditions and geo-political reasons could be considered to reveal other factors that may be based on the ethnicity or geo-economic analysis for the survivability of COVID-19 patients. First case of 2019 novel coronavirus in the United States The origin, transmission, and clinical therapies on coronavirus disease 2019 (COVID-19) outbreakean update on the status World Health Organization declares global emergency: a review of the 2019 novel coronavirus (COVID-19) CRISPR-based surveillance for COVID-19 using genomically-comprehensive machine learning design, bioRxiv World Health Organization Real estimates of mortality following COVID-19 infection Molecular immune pathogenesis and diagnosis of COVID-19 Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak The Global Macroeconomic Impacts of COVID-19: Seven Scenarios Temperature and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19 Applications of machine learning in cancer prediction and prognosis Intelligent and effective heart disease prediction system using weighted associative classifiers Prediction of Chronic Kidney Disease Using Random Forest Machine Learning Algorithm, Semantic Scholar Intelligent Parkinson disease prediction using machine learning algorithms Treatment selection for cancer patients: application of statistical decision theory to the treatment of advanced ovarian cancer Using neural networks to diagnose cancer Neural networks and diagnosis in the clinical laboratory: state of the art SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cancer Detection of single and clustered microcalcifications in mammograms using fractals models and neural networks Predicting survival in pulmonary arterial hypertension insights from the registry to evaluate early and long-term pulmonary arterial hypertension disease management (REVEAL) Predicting factors for survival of breast cancer patients using machine learning techniques Predicting breast cancer survivability: a comparison of three data mining methods Application of machine learning models for survival prognosis in breast cancer studies A deep learning-based multi-model ensemble method for cancer prediction Using three machine learning techniques for predicting breast cancer recurrence Machine learning applications in cancer prognosis and prediction A survey of machine learning based approaches for Parkinson disease prediction Heart disease prediction system using naive Bayes Disease Prediction by Machine Learning Over Big Data from Healthcare Communities Machine learning models in breast cancer survival prediction Machine learning approaches to predict 6-month mortality among patients with cancer Stage-specific predictive models for breast cancer survivability Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry For data science Advanced data preprocessing for intersites web usage mining Feature selection for classification Machine Learning A Bayesian based system for evaluating customer satisfaction in an online store Introduction to Machine Learning Machine Learning, Neural and Statistical Classification Supervised machine learning: a review of classification techniques Types of machine learning algorithms Induction of decision trees Classification and regression by random forest