key: cord-1006916-7come9dl authors: Guhathakurata, Soham; Kundu, Souvik; Chakraborty, Arpita; Banerjee, Jyoti Sekhar title: A novel approach to predict COVID-19 using support vector machine date: 2021-05-21 journal: Data Science for COVID-19 DOI: 10.1016/b978-0-12-824536-1.00014-9 sha: 21bc4d3f4bcf8495599c3bd9562bf7d938d2f807 doc_id: 1006916 cord_uid: 7come9dl An unexpected outbreak of 2019 Coronavirus disease (COVID-19) in Wuhan, China, led to a massive catastrophe across the world. The majority of the COVID-19 patients are getting diagnosed with pneumonia in their early stages. Over 22,00,000 confirmed cases have shown various ranges of symptoms, but the most predominant set includes fever, cough, and shortness of breath. The predominant set of symptoms, coupled with other critical symptoms, a prediction process has been devised in this paper to check whether a person is infected with COVID-19 or not. Based on the crucial impact of the symptoms, we have applied the support vector machine classifier to classify the patient's condition in no infection, mild infection, and serious infection categories. We have achieved an accuracy of 87% in predicting the cases. The ongoing outbreak of Coronavirus disease (COVID- 19) is an infectious disease caused by severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) [1] was first reported in December 2019 in Wuhan City, Hubei province, China. Since then, the disease has spread globally at an exponential rate infecting more than 22,00,000 and killing more than 1,53,000 and still counting. Studies have shown that transmission takes place mainly via respiratory droplets and close contacts. The virus can survive for 72 h on the surface; as a result, the contaminated surface can become a source of transmission. Those infected with the virus have shown a wide range of signs and symptoms which includes fever (87.9%), dry cough (67.7%), fatigue (38%), shortness of breath (19%), sputum production (33.4%), persistent chest pain, headache (14%), sore throat (13.9%), chills (11%), nasal congestion (4.8%), nausea (5%), diarrhea (4%), hemoptysis (0.9%), and pink eyes and lips (0.8%) [1e3] . A person infected with COVID-19 generally starts to show signs and symptoms within 5e6 days after the infection. The occurrence of the first symptom of COVID-19 to death is estimated to be in a range of 6e41 days [4] . The range decreases for patients above 70 years old compared with those under the age of 70. Certain similarities have been found in the symptoms between COVID-19 and SARS-CoV [5] patients. People infected with COVID-19 may show few mild diseases such as common cold, fever, and dry cough which are common in SARS-CoV. Such patients generally recover within two weeks. However, the upper respiratory tract symptoms like rhinorrhea, sneezing, and sore throat classified that COVID-19 targets the lower airway. Moreover, COVID-19 infected patients developed certain gastrointestinal symptoms like diarrhea which was not so significant among the patients infected with MERS-CoV or SARS-CoV. The graveness of COVID-19 varies. Those infected with severe diseases like dyspnea, the high-respiratory frequency may take three to six weeks to recover. Many of those who died of COVID-19 progressed to the critical condition, which includes respiratory failure, septic shock, multiple organ failure [3, 6] . Studies have shown that people with underlying conditions such as cardiovascular disease and hypertension [7] are more inclined to get infected with COVID-19. According to the reports by National Health Commission of China (NHC) [8] among the confirmed cases of COVID-19, many patients initially showcased heart palpitations and chest tightness. As per the report of NHC, 11.8% of the patients who died had substantial heart damage with elevated levels of cardiac arrest during hospitalization. The incidence of cardiovascular symptoms is pretty high in COVID-19 patients which leads to the immune system disorder during disease progression. In another report [9, 10] , where data of 1099 patients with laboratory confirmed COVID-19 cases extracted from 552 hospitals and 30 provinces, autonomous provinces and municipalities in mainland China, 173 had severe diseases with comorbidities of hypertension. Fever (43.8% on admission and 88.7% during hospitalization) and cough (67.8%) were found to the most common symptoms. China CDC Weekly in Ref. [11] reported that the severity of the symptoms was classified as mild, severe, or critical. Respiratory frequency above 30 per minute, blood oxygen saturation less than 93%, respiratory failure and septic shock were considered as severe and critical cases exhibited by the patients. Based on the symptoms and exposure, clinical diagenesis of the suspected cases was carried out. Throat-swab samples were used to determine confirmed cases. Real-time reverse transcription polymerase chain reaction [6] performed on throat-swab specimens obtained from the upper respiratory tract of the patient has been the standard method of diagnosis [7] . However, there is no standard method for predicting the seriousness of the condition based on the early and midperiod symptoms. Owing to the unpredictability of the infection and the continuously evolving nature of the virus, proper judgment of the infection with just a handful of symptoms is not possible. Our proposed method attempts to predict COVID-19 using a coined set of symptoms based on their severity and frequency. The key attributes in the feature set incline the prediction toward a particular class. The prediction has been classified into three classes, which are not infected, mildly infected, and severely infected. Among all the symptoms of COVID-19 cases, only a handful of them have been chosen. These includes are the most common symptoms (fever, breathing rate, cough) across 90% of the confirmed cases. Moreover, the reports mentioned in the previous paragraph solidify our selection of hypertension, heart diseases, chest pain, and acute respiratory syndromes as an attribute for the dataset. We apply the support vector machine (SVM) classifier to classify the features/symptoms into the mentioned classes. In our paper, we have also performed a comparative study on popular supervised learning models using visual programming. The letter is organized as: Section 2 describes a review of related studies. Section 3 presents a discussion of the proposed COVID-19 detection methodology. Section 4 provides the experimental results and discussions. Section 5 depicts the performance analysis of other supervised learning models using visual programming. Lastly, we present some concluding comments. In the recent past, a lot of work in the field of bioinformatics, face detection, text categorization, etc., has been done by using the support vector machine algorithm [12e16] . A brief review is presented here. Liaqat Ali et al. in Ref. [17] have explained how the SVM is used in the early detection of heart failure, which can help cardiologists to improve the diagnosis process. This system uses two separate models first to eliminate irrelevant features, and the second model is used as a predictive model. From the results, it has been observed that the proposed model shows better performance compared to other machine learning (ML) models with an accuracy in the range of 57.85%e91.83%. Zihe Yang et al. [18] proposed an improved SVM-based learning model for the diagnosis of diabetes. The system transformed the original problem into an unconstrained optimization problem by applying constraints reduction strategy. By using the gradient descent algorithm, the SVM results improved significantly, which outperforms other benchmark methods. Y. Lebrini et al. [19] applied the SVM classification method based on phenological metrics to identify the changes and the main agricultural classes in the concerned area. This system classified the main classes into the irrigated annual crop, irrigated perennial crop, rainfed area, and fallow to control the illegal pumping zones based on the main agricultural system classes. The proposed model reached an overall accuracy of 88% and the values of F1-score greater than 0.76. Joyati Chattopadhyay et al. [20] proposed the idea of facial expression recognition for humans using SVM classification. Based on the extracted facial features, they employed the SVM classifier to separate the facial expressions into six classes, which include happy, sad, disgust, angry, surprise, and fear. This system achieved an accuracy of 80% in detecting facial expressions. Sajja Tulasi Krishna, Hemantha Kumar Kalluri in Ref. [21] proposed a system to predict the cancer tumors in the lungs by using an SVM classifier. They extracted the features from the CT image scan by using the local binary pattern. SVM kernels such as linear, polynomial, and radial basis functions were used for the classification, among which radial basis function yielded the highest accuracy of 88.76%. J. S. Raikwal, Kanak Saxena in Ref. [22] analyzed the performance of SVM and k-nearest neighbor (kNN) algorithms to classify data and discover the data pattern to predict future disease. After evaluation, it was evident from the accuracy and resultant graph that accuracy of kNN is higher compared to SVM for small dataset size. However, the accuracy drops down when the dataset size increases. The evaluation time is also better in SVM compared to kNN for large dataset. Binh Thai Pham et al. [23] evaluated the predictive capability of SVM and Naïve Bayes Tress (NBT) methods for spatial prediction of landslides in a part of Uttarakhand state in India. It was observed that the SVM model outperforms the NBT model for classification of landslide and nonlandslide pixels. Huseyin Polat, Homay Danaei Mehr, Aydin Cetin in Ref. [24] implemented SVM classification algorithm to diagnose Chronic Kidney Disease. Feature selection method was applied to reduce the dimension of the dataset. On applying SVM classifier by using the best first search engine, feature selection method produced an accuracy of 98.5% in the diagnosis of Chronic Kidney Disease which is higher than the other ML algorithms. Ashfaq Ahmed K, Sultan Aljahdali, Syed Naimatullah Hussain in Ref. [25] used SVM and random forest (RF) to learn, classify, and compare cancer, liver, and heart diseases data on different dataset. It is noted that the accuracy varied with different kernel function for SVM. However, the results are observed to be much better with radial basis function with SVM and are comparable with random forest technique. Vivek Patel in Ref. [26] introduced the concept of power quality disturbances classification by using software simulation of SVM using MATLAB. Means, Variance, and Standard Deviation of the signal were extracted for feature classification and simulated using SVM radial basis kernel function. The results generated from the simulation showcased that Field Programmable Gate Array (FPGA)-based SVM classification is fast and yields high-classification accuracy (CA). The main problem for the detection of COVID-19 from the symptoms is because of the uncertainty of the data. As a result, no proper dataset is available to use as a reference. According to the results of the COVID-19einfected patients, the majority got hospitalized with high fever, cough with sputum, and shortness of breath. Patients with hypertension, cardiovascular disease, and high pulse rate [27] are quick to progress to the next stage once they get infected with COVID-19. Once the virus progresses to acute respiratory disease syndrome (ARDS), there could be respiratory failure, septic shock, and multiple organ failure [2, 6] . The multicriteria [28e35,42e47] dataset thus has been created with the following symptoms as attributes (see Table 18 .1): Based on the symptoms, infected status is possible to find out by utilizing our proposed approach. The outcome, i.e., infected status, has been classified into three classes, which are not infected, mildly infected, and severely infected. The classes have been mapped to the numerical values as: Severely Infected ¼ 1; Mildly Infected ¼ 2; Not Infected ¼ 3. Cases assigned to not infected are showing symptoms only in a certain disease, which is quite natural for any human being. However, in certain instances, like the common cold, people can suffer from mild fever, dry cough, but only those conditions are not sufficient to ascertain COVID-19 infection. This class signifies that the symptoms do not confirm COVID-19 with certainty but can lead to severe consequences if proper measures are not taken. In cases like mild fever coupled with the mild breathing problem, we can say that the patient might be infected with COVID-19, but again this prediction has no certainty. Patients with more than two to three symptoms, each of which has crossed their normal limits, have shown positive results for COVID-19 in the majority of cases. In cases where the patient is suffering from high fever, high-breathing rate, and has also developed acute respiratory syndrome is in a critical condition. The dataset is then passed on to the SVM classifier. The block diagram of Fig. 18 .1 depicts the suggested methodology. For this problem, SVM is chosen because it uses kernel trick to convert low-dimensional input space to high-dimensional space and thus converts the nonseparable problem to separable problem. We have split the dataset into a train set and test set in the ratio of 7:3. Using the linear kernel, the SVM classifier linearly separates the data utilizing a hyper-plane. Each class of data is separated by parallel hyper-planes ensuring that the distance between them is as large as possible. We have taken the cost parameter C ¼ 10. We are looking for a smaller margin hyper-plane to classify the infected classes more accurately with fewer miss-predictions; since we are dealing with a very highpriority situation of detecting COVID-19. The working principle of SVM for our paper is depicted in Fig. 18.2 . The dataset that has been created for this study contains 200 records, with eight attributes that are described in Table 18 .1. Authors are bound to employ the machinelearning algorithm on a minimal number of data set, as original data are not currently available because of the pandemic situation world wide. The authors also believe that the proposed technique will perform better provided with the real-time data set. In this dataset (see Table 18 .2), the output column, i.e., the infected column, is an integer-valued from one to three, which are mapped to not infected, mildly infected, and severely infected. Visualizing our prediction before applying SVM is shown in Fig. 18.3 . Each row provides the data regarding the symptoms for individual patients. We replaced the string values (yes, no) of the attributes (ARDS, chest pain, heart disease, cough with sputum) with integers 1 and 0, respectively. For hypertension, the data have been mapped as no ¼ 1, stage 1 ¼ 2, stage 2 ¼ 4. The dataset is then fed to SVM classifier to classify the symptoms into the three classes discussed earlier. We have kept the kernel as "linear." The model has been trained using 70% of the data from the dataset. Fig. 18.4 . shows the prediction plot based on temperature and breathing rate. The confusion matrix describes the performance of classification on a set of test data where the true values or correct results are known, which is shown in Fig. 18.5 . We have shown the scatter plots based on temperature, breathing rate, and heart beat rate because these three are the most common symptoms of COVID-19. Moreover, these three attributes have numeric values, while the rest of the set has character values (yes/no). Our model has an accuracy of 87%. From the classification report, we see that our methodology has a high success rate of predicting severely infected cases, which is very crucial for COVID-19 prediction. The score for the rest two classes in on the lower side because of the variable nature of COVID-19 because people with no symptoms are also getting infected with COVID-19. We cannot predict with certainty that a person showing mild symptoms or no signs at all will not get affected by COVID-19. Because of these uncertainties in the dataset, the accuracy is on the lower side (see Table 18 .3). In Table 18 .3, the classifier has shown higher precision for severely infected class, which means that very few cases have been labeled as severely infected, while not infected and mildly infected class have shown lower precision compared to severely infected class. The classifier has shown perfect result for predicting the cases which belong to severely infected class since it has recall ¼ 1 but has shown a moderate result for the other two classes. As an f1-score, which is the average of precision and recall, is high for severely infected and a bit low for not infected and mildly infected. The final column support is the number of true values for each class. Our analysis on COVID-19 dataset depicts that among all the other supervised models, SVM works best in predicting COVID-19 cases with maximum accuracy. To bolster our claim, we have performed a comparative study on popular supervised learning models. As we know that the severity of COVID-19 is so critical that turnaround time for COVID-19 cases becomes one of the utmost important factors. From current data, we also understand that the symptoms of COVID-19 have evolved over the months since its first inception and they are still evolving. Therefore, the new symptoms and data will further increase the number of predictors for COVID-19 data analysis and prediction. Keeping this in mind, we need a software tool to quickly analyze the data with new set of predictors and see their efficiency in predicting the disease. Thus, we propose the usage of a visual programming methodology using Orange [36] . It is an open source ML and data visualization toolkit especially designed to reduce coding overheads and easy analysis of data using ML models [38e41] for researchers from any background. Thus, to introduce this novel toolkit, we have performed our comparative analysis between different machine-learning models using Orange. Fig. 18 .6 depicts the block diagram of the methodology. First COVID-19 data are parsed and the features are ranked based on their predictive scores. This step is basically a dimensionality reduction which is generally employed to reduce the number of features and select only the features that have the best correlation with the target. Furthermore, the linearly varying redundant features can also be identified using this "Rank" block and thereby overall computation time while working with large dataset gets reduced, thereby making the methodology robust, fast, and scalable. For ranking purpose, popular scoring methods [37] like Info, Gain, Gini, ReliefF, and FCBF are employed. Next, the top features are selected, and they are fed into the ML models for training and prediction. "Test and Score" module helps in evaluating the results of the prediction and provides comparison metrics to evaluate the performance of various supervised machine-learning models viz. kNN, Naïve Bayes, RF, AdaBoost, Binary Tree, and SVM. The evaluation results of the models are depicted in Fig. 18 .6, and SVM outperforms all other models that are tested. We have done our prediction based on the parameters like area under ROC, CA, F1 Score, Precision, and Recall. The confusion matrix for all the models has been summarized in Fig. 18 .7, illustrating the superiority of SVM in predicting COVID-19 [48e51] efficiently. This paper aims to develop a model that can predict whether a person is affected by COVID-19 or not using SVM classification. All the possible conditions of infection based on the symptoms have been meticulously looked upon, and accordingly the dataset has been framed. It's hard to reflect accurate prediction since persons without any signs or symptoms of COVID-19 are also getting affected. The nature of the virus is very uncertain, and the symptoms are evolving and changing day by day. Keeping in mind the criticality of this infected, only those symptoms which are very common and critical have been kept in framing the dataset. The mildly infected class has been kept as a warning sign because the patient's symptoms have inclined toward COVID-19 infection. The not infected class doesn't signify that the patient is free from the shackles of COVID-19, but for the time being, signifies that they are free from immediate danger. Patients who are in the critical class should immediately undergo proper medication to overcome the disease. Furthermore, proper utilization of visual programming toolkit for visualizing and analyzing the data may pave the way for easier and faster COVID-19 data analysis scheme for researchers from multidisciplinary background. Clinical features of patients infected with 2019 novel coronavirus in Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19) World Health Organization Updated understanding of the outbreak of 2019 novel coronavirus (2019-nCoV) in Wuhan The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak Coronavirus disease 2019 (COVID-19): a perspective from China Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study COVID-19 and the cardiovascular system 7 Confusion Matrix of respective Supervised Learning Algorithms (A) kNN (B) Naïve Bayes (C) Random Forest (D) AdaBoost (E) Binary Tree (F) support vector machine Are patients with hypertension and diabetes mellitus at increased risk for COVID-19 infection? Clinical characteristics of coronavirus disease 2019 in China The Epidemiological Characteristics of an Outbreak of 2019 Novel Coronavirus Diseases (COVID-19) d China Impact of machine learning in various network security applications A fuzzy AHP-based relay node selection protocol for wireless body area networks (WBAN) The extent analysis based fuzzy AHP approach for relay selection in WBAN WBAN: a smart approach to next generation e-healthcare system An advance Q learning (AQL) approach for path planning and obstacle avoidance of a mobile robot An optimized stacked support vector machines based expert system for the effective prediction of heart failure Diagnosis of diabetes based on improved support vector machine and ensemble learning Identifying agricultural systems using SVM classification approach based on phenological metrics in a semi-arid region of Morocco Facial expression recognition for human computer interaction Lung image classification to identify abnormal cells using radial basis kernel function of SVM Performance evaluation of SVM and k-nearest neighbor algorithm over medical data set Evaluation of predictive ability of support vector machines and naive Bayes trees methods for spatial prediction of landslides in Uttarakhand state (India) using GIS Diagnosis of chronic kidney disease based on support vector machine by feature selection methods Comparative prediction performance with support vector machine and random forest classification techniques Classification of power system disturbances using support vector machine in FPGA Features, evaluation and treatment coronavirus (COVID-19), in: Statpearls [internet Fuzzy based relay selection for secondary transmission in cooperative cognitive radio networks Relay node selection using analytical hierarchy process (AHP) for secondary transmission in multi-user cooperative cognitive radio systems A decision framework of IT-based stream selection using analytical hierarchy process (AHP) for admission in technical institutions Reliable best-relay selection for secondary transmission in co-operation based cognitive radio systems: a multi-criteria approach A novel best relay selection protocol for cooperative cognitive radio systems using fuzzy AHP Malicious node restricted quantized data fusion scheme for trustworthy spectrum sensing in cognitive radio networks Non-uniform quantized data fusion rule for data rate saving and reducing control channel overhead for cooperative spectrum sensing in cognitive radio networks, Wireless Pers Non-uniform quantized data fusion rule alleviating control channel overhead for cooperative spectrum sensing in cognitive radio networks Machine Learning with Orange Android things: a comprehensive solution from things to smart display and speaker Analysis of implementation factors of 3D printer: the key enabling technology for making prototypes of the engineering design and manufacturing An in-depth study of implementation issues of 3D printer Application of machine learning in app-based cab booking system: a survey on Indian scenario A fuzzy AHP approach to IT-based stream selection for admission in technical institutions in India OPNET: a new paradigm for simulation of advanced communication systems Fundamentals of software defined radio and cooperative spectrum sensing: a step ahead of cognitive radio networks Modeling of software defined radio architecture & cognitive radio, the next generation dynamic and smart spectrum access technology Cognitive Radio Technology Applications for Wireless and Mobile Ad Hoc Networks A comparative study on cognitive radio implementation issues South Asian countries are less fatal concerning COVID-19: a hybrid approach using machine learning and M-AHP Go-COVID: an interactive cross-platform based dashboard for real-time tracking of COVID-19 using data analytics South Asian Countries are less fatal concerning COVID-19: a fact-finding procedure integrating machine learning & multiple criteria decision-making (MCDM) technique Smart farming & water saving based intelligent irrigation system implementation using IoT