key: cord-323582-7y8pt72r
authors: Ahamad, Martuza; Aktar, Sakifa; Rashed-Al-Mahfuz; Uddin, Shahadat; Lió, Pietro; Xu, Haoming; Summers, Matthew A.; Quinn, Julian M.W.; Moni, Mohammad Ali
title: A Machine Learning Model to Identify Early Stage Symptoms of SARS-Cov-2 Infected Patients
date: 2020-06-20
journal: Expert Syst Appl
DOI: 10.1016/j.eswa.2020.113661
sha: 
doc_id: 323582
cord_uid: 7y8pt72r

The recent outbreak of the respiratory ailment COVID-19 caused by novel coronavirus SARS-Cov2 is a severe and urgent global concern. In the absence of effective treatments, the main containment strategy is to reduce the contagion by the isolation of infected individuals; however, isolation of unaffected individuals is highly undesirable. To help make rapid decisions on treatment and isolation needs, it would be useful to determine which features presented by suspected infection cases are the best predictors of a positive diagnosis. This can be done by analyzing patient characteristics, case trajectory, comorbidities, symptoms, diagnosis, and outcomes. We developed a model that employed supervised machine learning algorithms to identify the presentation features predicting COVID-19 disease diagnoses with high accuracy. Features examined included details of the individuals concerned, e.g., age, gender, observation of fever, history of travel, and clinical details such as the severity of cough and incidence of lung infection. We implemented and applied several machine learning algorithms to our collected data and found that the XGBoost algorithm performed with the highest accuracy (>85%) to predict and select features that correctly indicate COVID-19 status for all age groups. Statistical analyses revealed that the most frequent and significant predictive symptoms are fever (41.1%), cough (30.3%), lung infection (13.1%) and runny nose (8.43%). While 54.4% of people examined did not develop any symptoms that could be used for diagnosis, our work indicates that for the remainder, our predictive model could significantly improve the prediction of COVID-19 status, including at early stages of infection.

There has recently been a rapid spread of the novel SARS-CoV2 coronavirus (Gorbalenya et al., 2020) (designated by the World Health Organization) which gives rise to a respiratory disease COVID-19 (WHO, 2020). The first human coronaviruses, 229E and OC43, were identified during the 1960s from human nasal secretions (Lippi & Plebani, 2020) . Other individual virus types classified in this family have been distinguished (such as HCoV NL63 and HKU1) and are thought to arise from zoonotic infections (Huang et al., 2020) as they are endemic in various bat populations. The coronavirus infections known were originally viewed as giving rise to innocuous respiratory human conditions that were not life-threatening. The development incidence of serious and deadly respiratory disorders attributed to beta-coronavirus subfamily members occurred in the last twenty years with the severe acute respiratory syndrome (SARS) and the middle east respiratory syndrome (MERS). The SARS-CoV infections arose first in Foshan, China in 2002 and MERS-CoV in 2012 in Saudi Arabia (Zhavoronkov et al., 2020) , both causing international alarm and containment efforts due to their rapid spread and high mortality rates. SARS and MERS were associated with mortality rates of 9.6% and 36%, respectively (Peeri et al., 2020) , among those diagnosed patients. These identified coronavirus infections as a significant threat to human health with the potential to cause extreme and lethal respiratory tract infections in people, particularly if person-to-person infection occurs easily (Chan et al., 2020) .

The development and spread of the novel coronavirus (Nishiura et al., 2020) causing COVID-19 has vastly outpaced the rate of vaccine and therapeutic development. Nev-ertheless, within weeks of the first observations of COVID-19 disease, the virus was isolated and characterised. One of the most significant SARS-CoV2 protein targets is a 3C-like protease for which the structure is already known. Much effort has been centred around re-purposing known clinicallytested drugs and virtual screening for possible targets using protein structure data (Zhavoronkov et al., 2020) . Priority has been given to the identification of infected individuals in order to isolate and (if necessary) treat them. Central to this is the use of clinical symptoms to optimise identification of infected individuals.

One of the earliest published studies (Tian et al., 2020) showed an analysis of 262 individuals confirmed as COVID-19 infected to determine their clinical and epidemiological characteristics in Beijing, China and found that respiratory and extra respiratory transmission routes may explain the rapid spread of disease.

In February 2020, the noted case fatality rate for COVID-19 in Wuhan, China, was 1.4% . However, accurate global estimates are far more challenging due to the vastly different response country to country. For example, in Italy during March 2020, it showed a case fatality rate of 7.2% (Onder et al., 2020) . This may partly reflect the demographic differences between nations, with 23% of the Italian population being over 65. However, even when stratified by age, infection rates remain higher in Italians over 70 years of age compared to China (Onder et al., 2020) . This highlights the critical need to have improved screening and prediction methods to stratify those at higher risk of infection in discrete populations in different Track changes is on 39 nations. To this end, machine learning algorithms are ideally suited for improving patient stratification and can be widely and rapidly applied as needed during a pandemic.

In this study, we developed a machine learning methodology to identify the most important and significant clinical symptoms that predict true COVID-19 positive cases. We validated these predictions using COVID-19 patient data from seven provinces in China. The primary features of this machine learning approach are:

• Extraction of features from unstructured raw data (hospitalized patient information in text format) using string matching algorithms and use of this data to construct a processed dataset.

• Identification of the significant symptoms of COVID-19 patients by analyzing their association using five different machine learning approaches.

• Developing a comprehensive predictive model to predict COVID-19 positive patients among suspected and confirmed individuals.

• Analyzing the relationship between patient age and COVID-19 confirmation.

• Identifying patient travel history and measure how it influences disease progression.

• Use statistical analysis to calculate the impact and contribution of particular patient features to COVID-19 diagnosis.

We collected raw hospital data, obtained through GitHub repository (COVID-19-tracker, 2020) . A record of their information is made available in anonymised form when a person has presented to hospitals and clinics for diagnosis and treatment. In our datasets, there were data from 6,512 patients from seven different provinces (Anhui, Guangdong, Henan, Jiangsu, Shandong, Shanxi, and Zhejiang) in China. The original dataset was written in Mandarin Chinese, which was translated by Google Translator, and was checked and validated by a native Chinese speaker and researcher (Haoming Xu) to confirm its accuracy.

With the spread of the novel coronavirus, the accumulation of related national epidemiology data, and its availability can be used for ML studies. However, much of this data was in the form of unstructured text information which can be difficult to process. The data used here were collected from a study by a group at Beijing University's Big Data High-accuracy Center. They collected these datasets from the official channels of the national government websites (COVID-19-tracker, 2020) . The detail of the dataset is as follows -basic information regarding gender, age, habitual residence, work and Wuhan/Hubei contact history; trajectory information is time, place, transportation and event up to February 20, 2020. We extracted important features of basic information (age, gender), symptoms (fever, cough, muscle soreness), diagnostic results (lung infection, radiographic imaging), prior disease/symptom history (pneumonia, diarrhea, runny nose) and some trajectory information (isolation treatment status, travel history) that are directly or indirectly related to COVID-19 disease.

The original Chinese datasets did not include information about which patients were suspected positive and which were confirmed for all patients. The definition of a suspected case is the patients who develop symptoms and have communication with confirmed COVID-19 patients but didn't confirm as COVID-19 after diagnosis. Moreover, confirmed cases defined as, the patients who are confirmed as positive for COVID-19 in the CDC approved test report or the doctors mentioned confirmed cases after diagnosis in the root dataset. The data contain patient symptoms in a text format. For this reason, we find symptoms of every individual patient and some trajectory information applying various string matching algorithms. In detail, we selected some keywords for each feature then we matched those keywords to text data and extract the features individually. Lastly, we generated our final dataset which contained the following features (described in the table-1): gender, age, fever, tussis (cough), rhinorrhoea (runny nose), pneumonia, lung in- 1% missing values only in the gender and age fields, and the propensity for the data point to be missing gender and age fields were completely random, i.e., Missing Completely at Random (MCAR) types of missing data. There's no relationship between whether a data point is missing and any values in the dataset. Thus we imputed the gender field with random values according to the male/female ratio fo the total data and impute age with random values within the interquartile range (IQR) values. In our dataset most of the values were binary, but the age field was as an integer value, so feature scaling was done on the age field by using standard scaling methods. Feature scaling is a technique to standardise the re-scaling technique which uses 0 as a mean value and 1 as variance (Gupta, 2019) . The new feature value for a feature is calculated by, = ( − )∕ . After those two steps, we obtained a structured, clean and preprocessed dataset.

Since identifying the most predictive symptoms is challenging at the early stages of disease, we used ML models to identify them. Our methodology is shown in figure 1. As indicated, using the training datasets we trained five ML algorithms that are described below -

Decision Tree algorithms can be utilized to optimize both classification and data regression (Karim & Rahman, 2013) . It utilizes tree representation in which each leaf node corresponds to a group of attributes and a branch corresponds to a value. This algorithm is developed in a recursive man-ner.Consider we have a variable whose potential values have probabilities 1 , 2 , ., . The estimations of on the observation is known as the entropy.

is characterised as (Li et al., 2009 ) -

This main idea of Decision Tree algorithms is to build a tree for the entire data and process a unique output at every leaf. According to the target classification, how well a given attribute separates the training set can be measured by a statistical property, known as information gain. An attribute at a node with high information gain can split the training data to achieve improve classification accuracy. We can calculate the information gain of an attribute , relative to a set of training data , where is Entropy, as-

Here, the set of values of the attribute is defined as ( ) and

is the subset of for which the attribute has value . For a particular node in the tree, information gain is calculated for all the attributes, and the attribute with the highest information gain is selected as the best attribute that splits the data properly.

Random Forest is an ensemble of regression and classification trees, which can train a similar size of training datasets called bootstraps, and at the end combine them for a more accurate result. The bootstraps are created by random resampling from the training dataset (Sarica et al., 2017) . Random Forests perform far better than a single tree. This approach can work with higher dimensional large datasets with comparatively greater accuracy. The model will be built with the following equations (Sing, 2019) -Calculate the constant value and initialise the model 

Here, ( ) is a model, ( , ) is a training set and ( , ( )) is differentiable loss function.

Gradient Boosting Machine (GBM) is a fixed size decision tree-based learning algorithm that combines many simple predictors (Biau et al., 2019) . It fabricates the model in a phase insightful manner as other boosting strategies do, and it sums them up by permitting enhancement of a selfassertive differentiable loss function. A definitive objective of the GBM is to discover a function ( ), which limits its loss function ( , ( )), through iterative back-fitting as - * = E , ( , ( )) By definition, a supported predicted model is a weighted straight of the base learners -

Where ( ; ) is a base learners parameter.

Extreme Gradient Boosting (XGBoost) is another decision tree-based machine learning algorithm that uses a gradient boosting framework. It is an end to end tree boosting scalable system widely used in data science. XGBoost can solve real-world scale problem utilizing comparatively fewer resources (Chen et al., 2016) . Suppose, a dataset consists with examples and features, = ( , ) where, | | = , ℝ , ℝ. So the decision tree model uses additive functions to forecast the output (Chen et al., 2016) .

Where indicates to the structure of each tree that maps a guide to the relating leaves nodes and is the amount of the leafs in the tree. Every relates to an autonomous tree structure and leaf loads .

Support Vector Machine (SVM) is one of the most wellknown, flexible supervised machine learning algorithms. It is utilized for both regression and classifications tasks. It is typically favoured for medium and little-measured informational collection. The primary target of SVM is to locate the ideal hyper-plane which directly isolates the information focuses on two-part by augmenting the edge. The SVM can guarantee the advancement capacity of the machine model, so it is generally utilized in different fields. The goal of the support vector machine algorithm is to discover a hyper-plane in -dimensional space ( -the quantity of highlights) that particularly classifies the information focuses (Wei & Hui-Mei, 2014 ).

There are various assessment parameters in our approach, for example, precision, recall, F1-score, Log loss, and area under the ROC curve (AUC). These parameters are used to estimate our prediction accuracy.

• Precision: Precision is a legitimate finding of assessment metric when we need to be extremely positive about our prediction. It measures the proportion of anticipated positives that are true positives. So it is dependant on True Positive (TP) and False Positive (FP) values (Agarwal, 2019) .

• Recall: Recall is another admissible decision of assessment metric when we need to identify the number of positives as could reasonably be expected (Agarwal, 2019). It indicates the ratio of actual Positives correctly classified. True positive (TP) and False negative (FN) values are used to measure recall.

• F1 Score: F1 score keeps up a harmony between the precision and recall for your classifier. The F1 score is a number somewhere in the range of 0 and 1 and is the consonant means of precision & recall (Agarwal, 2019) .

• Area Under the Curve (AUC): AUC is the area under the ROC curve and demonstrates, how well the probabilities from the positive classes are isolated from the negative classes. Where True positive rate or TPR is only the range of trues we are utilizing our calculation (Agarwal, 2019) .

• Log Loss: Log Loss is the most significant order metric dependent on probabilities. It's difficult to decipher raw log-loss values, yet log-loss is a decent measurement for looking at models. A lower log-loss value implies better predictions (Kiapour, 2018) . The function of log-loss is-

Where is the level of target variable, ( ) is the predicted probability of the point for the target value and ( ) is the calculated value of log loss.

In this study, some statistical analysis was also performed using the Statistical Package for the Social Sciences (SPSS) software version 25.0 (IBM Corp., Armonk, NY). The median age of the individuals studied was 43 years (range 0 years to 96 years), the interquartile range (IQR) was 32 to 55 years for 3,367 males (51.6%). In table-2, shows the association of patient COVID-19 confirmation and some selected demographic information including symptoms. We performed Mann-Whitney U test on age field and Chi-square test on the remaining fields and found that age, travel history, isolation treatment is significant as demographic information; and most of the symptoms including fever, cough, runny nose, pneumonia, and lung infection are significant with p-value <0.001. From those studied patients, there was 2,971 (45.6%) patient who displayed some symptoms whereas, among confirmed patient's 49.3% develops symptoms. It is also seen, 2,675 patients (41.1%) have a fever, which is the most frequent symptom, and their body temperature was equal or above 38-degree centigrade. Some patients had fatigue, dizziness and headache with fever. The cough was the second most common symptom, with 1,975 (30.3%) affected patients. Some of these patients had a dry cough, and some had coughing with sputum. Radio-graphic or pulmonary or chest imaging results showed that 855 patients (13.1%) had a lung infection. Only 26 patients (0.4%), 37 patients (0.57%), had muscle soreness and diarrhea. Travel history is another important issue in COVID-19 infection, 4,239 patients (65.1%) had travelled recently to one or more places in China or abroad. All patients were hospitalized for treatment, but among those 1,413 patients (21.7%) were received treatment in full isolation. The comparison of suspected and confirmed patients according to developing symptoms, we found that more confirmed patient's 1,466 (93.26%) develops symptoms than 1,505 (30.47%) suspected patients. There are 1,242 (79.01%) fever, 1,188 (75.57%) cough, 502 (31.93%) runny nose, 402 (25.57%) pneumonia, and 786 (50%) lung infection in confirmed patient's; on the other hand 1,433 (29.01%) fever, 787 (15.93%) cough, 47 (0.95%) runny nose, 85 (1.72%) pneumonia, and 69 (1.4%) lung infection is suspected patients; which is much lower than confirmed.

In figure-2 is illustrated the age-wise total number of patients. In the age range of 25 years to 65 years, the rate of individuals affected is high. In children and the older adults the affected rate is comparatively low. However, the death rate in older men is high.

In figure-3, is indicated the frequency of each feature, with most patients displaying fever, cough, lung infection and/or pneumonia. Some patients had a recent travel history; others received treatment in isolation.

Firstly we developed a model for our application. In figure 1 is shown the pictorial representation of our research. In our workflow, we divided our work into different sections. The first section is data collection, which was described ear- lier. We prepared our dataset that can be capable to work with different machine learning (ML) approaches. After preprocessing, we divided our dataset into four criteria (Age 0-20, Age 21-60, Age 61-96 and Age 0-96). We divided our dataset into two parts, one part (70%) for train-ing and another part (30%) for testing. Then we applied the five machine learning algorithms to train our models. The dataset was fitted to ML approaches using the Python programming language (Python 3) (Larose et al., 2019) . The algorithms used included Decision Tree, Random Forest, XG-Boost, Gradient Boosting Machine (GBM) and Support Vector Machine (SVM). Then we analyzed the performances of the algorithms. For each algorithm, we calculated the accuracy of the test dataset. To validate the accuracy, we find confusion matrix, precision, recall, F1-score, AUC and logloss values. Then we find the feature importance for every algorithm. We calculated the coefficient values for each feature that are significant for COVID-19 patients. Finally, we identified the six most significant features (shown in table-5) that are strictly related to COVID-19 positive status.

In our analysis results, we found that every algorithm achieved 88% (0.88) or above accuracy score. The performances of our used algorithms for the different datasets are described below.

• Age (0-to 20): In The coefficient values of every feature were consistent in finding the most significant features for this age range were lung infection, cough, fever, travel history, and pneumonia.

• Age (0-96) : On the accuracy measurement table-3, the results for individuals in the age range 0 to 96 years is indicated. We observed that the GBM and SVM algorithms achieved the highest accuracy 0.97 using precision evaluation metrics. XGBoost, Random Forest and Decision Tree showed 0.93, 0.92, & 0.91 accuracy. On the other hand, XGBoost gained the highest 0.91 score using recall evaluation metrics We also analyzed the same parameters using the whole dataset combined (age 0 to 96 years). We compared combined outcomes with individual outcomes, and we found that there were a few variations in the different age groups, such as lung infection and cough are most significant for all types of age groups. However, in age group 0-20, fever and isolation treatment, in the age group 21-60 and 61-96, fever and pneumonia, in the age group 0-96, age, runny nose and pneumonia were also significant with a lung infection and cough.

Figure-4 shows the feature ranking according to coefficient values for each applied algorithm. Every algorithm found almost the same sequence of features for all the age groups.

From the above analysis, we also found that among those who displayed a fever, they had body temperatures equal to or above 38-degree centigrade. A small number of individuals also presented with chest tightness. Some patients had a cough with sputum or dry cough, nasal congestion, fatigue, discomfort, pharyngeal discomfort, respiratory symptoms, shortness of breath, headache, dizziness, weakness, nausea, among other symptoms.

The development of the COVID-19 pandemic currently represents a dangerous threat to global health. The key to stopping this spread is the development of methods to identify infected individuals as early as possible. This can be challenging given the delay in symptom presentation; however, machine learning algorithms provide a promising approach to address this problem that can be rapidly and cheaply applied in a pandemic situation. In our study, we developed and tested a range of machine learning approaches and found the most significant clinical COVID-19 predictive features were (in descending order): lung infection, cough, pneumonia, runny nose, travel history, fever, isolation, age, muscle soreness, diarrhea, and gender. Our models were able to predict the stage of COVID-19 based on basic patient informa-tion (age and gender), travel and isolation, and clinical symptoms (including fever, cough and runny nose and pneumonia). The accuracy of our algorithms was highest for the age range 0-20 years, with the SVM algorithm with 93% accuracy, but it was notable that the other algorithms performed almost as well with greater than 85% accuracy. In the age range 21 to 60 years the situation was similar, with the highest accuracy of 90% of XGBoost, and others(e.g. SVM, Random Forest and GBM and Decision Tree algorithms) also performed well. In the age range of 61 to 96 years, again XGBoost achieved 86% accuracy but the others gave above 80% accuracy. As might be expected given similar results across different ages (indicating that the symptoms develop similarly in individuals of any age) this pattern is also seen when the whole range of 0 to 96 years was studied and also get above 85% accuracy of prediction. Accordingly, we were able to rank the features that are of importance to the disease prediction.

According to the statistics, the median age was 43 years with IQR 32-55, composed of approximately half males and half females. Most of the patients presented with fever, cough and radio-graphic chest imaging results that indicated that around 50% of confirmed patients had one or both lungs af-fected by the infection. In suspected patients, 29.01% were affected with fever, whereas 79.01% confirmed patients have fever & 75.57% have a cough. Travel history was notable for being one of the major associated features to COVID-19 infection, as would be expected with 65.1% of patients having recently travelled a long distance. Some other symptoms were also related to COVID-19 status but were less commonly seen, including muscle soreness and diarrhea; these features, particularly diarrhea, were much more prominent in the earlier SARS epidemic. However, it is striking that 6.74% of the confirmed COVID-19 positive and 69.53% of the suspected patients did not develop any type of symptoms. As these patients cannot be detected or predicted by symp-toms alone, our machine learning approach is of no use for assessing these people, although it is possible that they may have other factors that may lend themselves to detection in this way. However, the importance of particular social factors are likely to vary over time; notably, foreign travel may come to be less critical as local community transmission becomes the most common form of infection. Contact with infected individuals would be and remains an excellent predictor, but this relies on rigorous contact tracing and social network analysis. Mann-Whitney U test and chi-square tests indicated that all the features were impacted except muscle soreness and diarrhea. These significant symptoms matched with findings from our machine learning analysis.

We implemented machine learning algorithms on different clinical features of patients with COVID-19 infections in a new dataset from mainland China and utilized different classifiers to examine information criterion and assess performance. Our ability to predict the probability and course of COVID-19 infection will improve the capacity of doctors to identify infected patients at an early stage by utilizing predictor clinical features. Some of the classifiers did not, however, show reliable outcomes, presumably because while they demonstrated exactitude, they created one-sided results for these datasets. However, the size of the COVID-19 dataset was probably not extensive enough to give enough statistical power to resolve these issues. In future studies, using much larger datasets, we will have improved capacity to circumvent these limitations and further improve our predictive accuracy.

Accelerated gradient boosting

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster

XGBoost: A Scalable Tree Boosting System

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

Decision Tree and Naïve Bayes Algorithm for Classification and Generation of Actionable Knowledge for Direct Marketing

Bayes, E-Bayes and Robust Bayes Premium Estimation and Prediction under the Squared Log Error Loss Function

Data science using Python and R

Uncertain data decision tree classification algorithm

Procalcitonin in patients with severe coronavirus disease 2019 (COVID-19): A meta-analysis

The Extent of Transmission of Novel Coronavirus

Case-Fatality Rate and Characteristics of Patients Dying in Relation to COVID-19 in Italy

COVID-19) epidemics, the newest and biggest global health threats: what lessons have we learned?

Random Forest Algorithm for the Classification of Neuroimaging Data in Alzheimers Disease: A Systematic Review

Characteristics of COVID-19 infection in Beijing

An improved GA-SVM algorithm

Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China

Potential 2019-nCoV 3C-like Protease Inhibitors Designed Using Generative Deep Learning Approaches

The 5 Classification Evaluation metrics every Data Scientist must know

Naming the coronavirus disease (COVID-19) and the virus that causes it

Mathematics behind Random forest and XGBoost