key: cord-0814040-1fp3hx67
authors: Podder, Prajoy; Bharati, Subrato; Mondal, M. Rubaiyat Hossain; Kose, Utku
title: Application of machine learning for the diagnosis of COVID-19
date: 2021-05-21
journal: Data Science for COVID-19
DOI: 10.1016/b978-0-12-824536-1.00008-3
sha: c381c580ead5ca0c4dfae000bcec32f5716b6f79
doc_id: 814040
cord_uid: 1fp3hx67

This chapter focuses on the application of machine learning algorithms on the diagnosis of the novel coronavirus disease (COVID-19). First, data visualization is provided on increases in confirmed deaths and recovered cases of COVID-19 using currently available data from Johns Hopkins University. Next, the machine learning algorithms are used for the automatic diagnosis of COVID-19. Data-driven diagnosis is performed using a dataset of 5644 samples with 111 attributes provided by Hospital Israelita Albert Einstein, Brazil. As a preprocessing step, null values and categorical data are processed and standardization is performed. Next, feature selection is performed to find attributes that are most important for a COVID-19 diagnosis. A number of algorithms including random forest logistic regression, XGBoost, and decision tree are considered and their kernel parameters are optimized. The performance of classification algorithms is evaluated in terms of a number of factors including the testing accuracy, precision, recall, miss rate, receiver operating characteristic curve and area under the receiver operating characteristic curve. Experimental results show that serum glucose is the most influential attribute in predicting COVID-19. Our results also show that for the case of cross-validation, XGBoost has the highest accuracy value of 92.67% and logistic regressions have the second highest accuracy of 92.58%, whereas both XGBoost and LR have a 93% value for precision, recall, and F1 score. Moreover, for the case of the holdout method with 20% testing data, logistic regression with an accuracy of 94.06% outperforms other classifiers in terms of accuracy, precision, recall, and F1 score.

continentally reported cases for which confirmed deaths and recovered or active cases are discussed. The mortality rate per 100 people is also described. Fig. 9 .1 shows that the number of confirmed cases is the highest on the European continent (11, 65 , and 661, respectively). The mortality rate is also the highest in Europe (9.55%). Fig. 9 .2 depicts the list of top 10 countries of confirmed COVID-19 cases, recovered cases, death cases and active cases.

This dataset [12] was generated from patients at the Hospital Israelita Albert Einstein in São Paulo, Brazil. The samples are collected anonymously performing the SARS-CoV-2 reverse transcriptase polymerase chain reaction and additional laboratory tests. The data were standardized by converting the samples so that the mean value of the samples is zero whereas the standard deviation is unity.

The dataset has 5644 rows and 111 columns. The dataset is imbalanced. There are a number of missing values in the data samples. For this, features with more than 99.8% of null values in positive cases are dropped because they are unlikely to contribute to the prediction of COVID-19. Fig. 9 .3 shows that the dataset was originally imbalanced with 90.1% samples representing negative cases. After removing attributes with at least 99.8% null values, the dataset became balanced with 51.1% representing negative cases. Table 9 .1 shows the list of dropped features that have at least 99.8% null values. The details of feature filtering are illustrated in Fig. 9 . 4 .

The figures show the percentage of negative and positive cases after undersampling. Thus the dataset can be considered balanced. The new dataset has 1091 rows and 61 columns; it will only have numerical features. The target feature is converted to 0 or 1, in which 1 means positive and 0 means negative.

In this section, experiments are performed to classify normal and COVID-19 patients using samples in the dataset. This research work is implemented using the scikit-learn library of Python programming language. Steps followed in this implementation are shown in Fig. 9 .5. A number of processes are performed, including data labeling and data filtering, which are part of preprocessing. Next, important features are selected. Classification algorithms are then applied on the selected features. A number of popular classification algorithms such as random forest (RF), logistic regression (LR), decision tree (DT), and XGBoost are considered. Both cross-validation (cv) and holdout methods are considered. For, cv, the KFold() function, and for holdout, the train_test_split() function from scikit-learn library are used to split the dataset. Next, the classification models are fitted with the training data and the models are then used to predict COVID-19 samples. 

There are a number of feature selection algorithms. In this case, a univariate feature selection method is considered. For this, the SelectKBest() function of the scikit-learn library is used. obtained using SelectKBest(). Table 9 .3 shows the ranking of these top 25 features. For the dataset considered, serum glucose is the best-ranked attribute, or the most influential feature in predicting a COVID-19 patient.

After selecting top features by the feature selection method, the feature subset is then taken into the classifier training stage. In the training stage, XGBoost [15] , RF [16e18], LR [19, 20] , and DT are employed. Fig. 9 .5 illustrates the stages of this implementation. Fig. 9 .5 shows that the dataset is initially preprocessed, followed by the feature selection process. Next, the data samples are split into training and testing samples. Then, the training data is used to fit a classifier model. The testing data are then applied to the model to predict the target: in this case, COVID-19. Finally, the testing target value that is the actual value is compared with the predicted value. 

XGBoost is a popular form of gradient boosting algorithm designed for optimal hardware use. It is an implementation of gradient-boosted DTs. XGBoost can penalize a model for complexity using L1 and L2 regularization in which regularization prevents overfitting of the XGBoost model. Regularization helps prevent overfitting. Algorithm 1 describes how XGBoost is used to classify COVID-19 patients. Algorithm 1. Detection of positive COVID-19 patient using XGBoost Input: A list of features Output: Classification report, confusion matrix, receiver operating characteristic (ROC) curve Process:

1. Standardize the selected features using StandardScaler() function 2. Apply XGBoost classifier using XGBClassifier (base_score¼0.5, booster¼'gbtree', gamma¼0, learning_rate¼0.1, max_depth¼3, n_estimators¼100, objective¼'reg:linear', random_state¼0) function on the selected features 3. Train the model using selected features 4. Predict result using test dataset 5. Evaluate the accuracy of the classifier using accuracy_score() function 6. Use confusion_matrix() function to evaluate true negative (TN), false positive (FP), false negative (FN), and true positive (TP). 7. Use classification_report() function to calculate precision, recall, and F1 score

RF is a combination of multiple DTs. Two important concepts make this algorithm random: the randomness in the sampling of the training portion of the data and the randomness in the selection of features for the splitting nodes. The RF algorithm maintains the reliability of a large part of the dataset by handling any missing sample values. Algorithm 2 describes the stages of RF in classifying COVID-19 patients.

Algorithm 2. Detection of positive COVID-19 patient using RF Input: A list of features according to rank Output: Classification report, confusion matrix, accuracy Process:

1. Standardize the selected features using StandardScaler() function 2. Apply RF using RFClassifier (n_estimators¼100, criterion¼'gini') function with some parameter on the selected features 3. Train the model using selected features 4. K-fold parameters for K-fold cv: thresh ¼ 0.5, k_fold_seed ¼ 13, n_folds ¼ 10 5. Predict the result using test dataset 6. Evaluate the accuracy of the classifier function 7. Use confusion_matrix() function to evaluate TN, FP, FN, and TP 8. Use classification_report() function to calculate precision, recall, and F1 score

Other popular classification algorithms are DT and LR. Algorithm 3 shows important steps of DT and LR classifiers in predicting patients affected by COVID-19.

Algorithm 3. Detection of positive COVID-19 patient using DT and LR Input: A list of features according to rank Output: Classification report, confusion matrix, accuracy Process: Fig. 9 .6. Fig. 9 .6 shows a performance matrix and its mathematical illustration including precision, recall, TP rate, TN rate, accuracy, miss rate (FN rate), and F1 score. The performance evaluation is conducted for several classifiers including XGBoost, RF, LR, and Dt for several cv numbers. For this, three-to 10-fold cv is considered. Table 9 .4 presents the performance results for XGBoost using cv. A number of metrics such as precision, recall, F1 score, and testing accuracy are presented in Table 9 .4. Table 9 .4 shows that XGBoost provides the highest accuracy, precision, recall, and F1 score for cv 5, for which the highest accuracy of XGBoost is 97.2477%.

The performance results of RF, LR, and DT are evaluated in Tables 9.5e9.7, respectively. The accuracy of RF, LR, and DT are 95.4128%, 98.1651%, and 95.4128%, respectively, where the number of cvs is 5. Hence, fivefold cv provides the highest accuracy, precision, and recall for these four classifiers for the given dataset. Compared with other three classifiers, LR provides the highest accuracy (98.1651%) and the same value of precision, recall, and F1 score (98%). XGBoost provides the second highest accuracy of 97.2477% (in Table 9 .4). Hence, the performance results vary with the difference in cv fold value. Next, we compare the classifiers by taking the average of results for cv values of 3e10. For example, we take the average of values of XGBoost for cv 3e10 shown in Next, the classifiers are evaluated in term of ROC curves and the area under the curve (AUC). Fig. 9 .7 shows the ROC curves for (a) XGBoost, (b) RF, (c) LR, and (d) DT, where the AUC values for XGBoost, RF, LR, and DT are 97.1%, 96.8%, 96.4%, and 94.4%, respectively.

Next, the overall performance of the classifiers is shown for the holdout method, in which the dataset is split into different portions of testing and training samples. The results vary with the difference in the splitting. In this case, we split the dataset so that 80% of the data samples are used for training and the remaining 20% are used for testing. Table 9 .9 shows the performance results for different classifiers when 20% data samples are used for testing. LR has the best testing accuracy of 94.06%. Moreover, Chapter 9 Application of machine learning for the diagnosis of COVID-19 187 Table 9 .10 lists the pharmacologic parameters of different therapies for COVID-19. Some promising drug targets include nonstructural proteins and viral entry and immune regulation pathways. Nonstructural proteins such as 3-chymotrypsin-like protease, papain-like protease, and RNA-dependent RNA polymerase share homology with other novel coronaviruses.

According to the report in Wu [44] , so far, the S protein in the genome of SARS-COV-2 is the major target for COVID-19 vaccine development [44] . Major vaccine candidates in development for prevention of COVID-19 are listed in RAPS [45] and https://www. clinicaltrials.gov/ct2/show/NCT04283461 [46]. Fig. 9 .8 lists the five most active vaccine candidates for COVID-19 as of Apr. 23, 2020, as reported by the WHO. RNA polymerase inhibitor Should not be used in case of pregnancy Favipiravir [42, 43] RNA polymerase inhibitor Must be avoided during pregnancy because metabolite has been found in breast milk

This chapter provides an overview of the spread of COVID-19. The United States and some European countries such as Italy, Spain, the United Kingdom, and Germany are heavily affected by the disease. This chapter uses machine learning algorithms to predict COVID-19 for a given dataset. For this particular dataset, our experimental results indicate that serum glucose is the most influential attribute in predicting COVID-19. Our results also show that for the case of cv, XGBoost has the highest accuracy value of 92.67% and LR has the second highest accuracy of 92.58%, whereas both XGBoost and LR have the same 93% value for precision, recall, and F1 score. For the case of the holdout method with 20% testing data samples, LR exhibits the highest testing accuracy of 94.06%. Hence, XGBoost and LR can be used to predict COVID-19. The reliability of the diagnosis results presented in this chapter depends on the reliability of the dataset used. In future, with the availability of more reliable datasets, machine learning algorithms [51, 52] should be applied to those new datasets to validate the effectiveness of the classifiers. Hybrid deep learning algorithms [47e49,53] can also be successfully applied in various chest X-ray or computed tomography image [50] datasets to detect COVID-19 patients.

The origin, transmission and clinical therapies on coronavirus disease 2019 (COVID-19) outbreak e an update on the status

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

COVID-2019) Situation Reports

A Novel coronavirus from patients with pneumonia in China

Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

Consistent Detection of 2019 Novel Coronavirus in Saliva

Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses

Data analytics for novel coronavirus disease

Epidemiologic features and clinical course of patients infected with SARS-CoV-2 in Singapore

A Scalable Tree Boosting System

Comparative performance analysis of different classification algorithm for the purpose of prediction of lung cancer

Lung cancer recognition and prediction according to random forest ensemble and RUSBoost algorithm using LIDC data

Comparative performance exploration and prediction of fibrosis, malign lymph, metastases, normal lymphogram using machine learning method

Data-driven diagnosis of spinal abnormalities using feature selection and machine learning algorithms

Breast cancer prediction applying different classification algorithm with comparative analysis using WEKA

COVID-19: a recommendation to examine the effect of hydroxychloroquine in preventing infection and progression

New insights on the antiviral effects of chloroquine against coronavirus: what to expect for COVID-19?

Chloroquine and hydroxychloroquine as available weapons to fight COVID-19

National Health Commission and State Administration of Traditional Chinese Medicine, Diagnosis and treatment protocol for novel coronavirus pneumonia

Chloroquine Phosphate)

In vitro antiviral activity and projection of optimized dosing design of hydroxychloroquine for the treatment of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial

A pilot study of hydroxychloroquine in treatment of patients with common coronavirus disease-19 (COVID-19)

Pharmacokinetics of hydroxychloroquine and its clinical implications in chemoprophylaxis against malaria caused by Plasmodium vivax

Role of lopinavir/ritonavir in the treatment of SARS: initial virological and clinical findings

Screening of an FDA-approved compound library identifies four small-molecule inhibitors of Middle East respiratory syndrome coronavirus replication in cell culture

A trial of lopinavir-ritonavir in adults hospitalized with severe COVID-19

Pharmacologic treatments for coronavirus disease 2019 (COVID-19): a review

Discovery and synthesis of a phosphoramidate prodrug of a pyrrolo[2,1-f][triazin-4-amino] adenine C-nucleoside (GS-5734) for the treatment of Ebola and emerging viruses

Remdesivir as a possible therapeutic option for the COVID-19

Comparative therapeutic efficacy of remdesivir and combination lopinavir, ritonavir, and interferon beta against MERS-CoV

Influenza virus polymerase inhibitors in clinical development

Taisho Toyama Pharmaceutical Co Ltd

Progress and concept for COVID-19 vaccine development

RAPS, Regulatory Focus, COVID-19 Tracker

Hybrid deep learning for detecting lung diseases from X-ray images

Rubaiyat Hossain Mondal, Artificial neural network based breast cancer screening: a comprehensive review

Diagnosis of breast cancer based on modern mammography using hybrid transfer learning, Multidimensional Systems and Signal Processing

Rubaiyat Hossain Mondal, Diagnosis of Polycystic Ovary Syndrome Using Machine Learning Algorithms

Automated gastric cancer detection and classification using machine learning

Performance of CNN for predicting cancerous lung nodules using LightGBM