key: cord-0581870-7sanu505 authors: Theerthagiri, Prasannavenkatesan; Vidya, J title: Cardiovascular Disease Prediction using Recursive Feature Elimination and Gradient Boosting Classification Techniques date: 2021-06-11 journal: nan DOI: nan sha: 775b2449117075172482282d040dde7af92a9b3d doc_id: 581870 cord_uid: 7sanu505 Cardiovascular diseases (CVDs) are one of the most common chronic illnesses that affect peoples health. Early detection of CVDs can reduce mortality rates by preventing or reducing the severity of the disease. Machine learning algorithms are a promising method for identifying risk factors. This paper proposes a proposed recursive feature elimination-based gradient boosting (RFE-GB) algorithm in order to obtain accurate heart disease prediction. The patients health record with important CVD features has been analyzed for the evaluation of the results. Several other machine learning methods were also used to build the prediction model, and the results were compared with the proposed model. The results of this proposed model infer that the combined recursive feature elimination and gradient boosting algorithm achieves the highest accuracy (89.7 %). Further, with an area under the curve of 0.84, the proposed RFE-GB algorithm was found superior and had obtained a substantial gain over other techniques. Thus, the proposed RFE-GB algorithm will serve as a prominent model for CVD estimation and treatment. As the effects of societal aging worsen, health monitoring has become increasingly important. The disease prediction framework can help medicinal experts in anticipating heat alignment in view of the clinical information of patients. Subsequently, by implementing a prediction framework utilizing advanced algorithms and investigating different health-related issues, it can have the capacity to predict more probabilistically that the patients will be diagnosed with any health problems [1] . One of the most significant parts of healthcare monitoring is heart monitoring. A heart disease prediction system can provide valuable insights to medical professionals in making decisions about the state of heart of patients. Aberrant cardiac rhythms can be caused by an abnormal site of origin or irregular conduction of the electric signal. Arrhythmias are the medical term for these diseases. Some arrhythmias can result in significant consequences and even death [1] . Medical professionals may neglect to take exact decision while diagnosing a patient's heart alignment; in this way, a heart alignment prediction method which utilizes machine learning algorithms aid such cases to get precise outcomes [1, 2, 3] . Disease prediction, disease categorization, and medical image recognition algorithms are all examples of machine learning techniques that have been widely applied in medicine [4, 5, 6] . Gradient boosting (GB), a contemporary and efficient method, is presented and enhanced in this paper. The gradient boosting decision tree is the source of the gradient boosting learning method. GB performs well as an ensemble classifier in terms of generalization. Furthermore, GB provides a regularisation term to regulate the model's complexity, which prevents overfitting. In several machine learning disciplines, GB has outperformed the competition [1, 7] . As a result, the performance of GB in classifying single cardiac diseases is investigated. The goal of this research is to create a clinically useful categorization system for cardiac disease. This work presents a hierarchical technique based on the weighted gradient boosting algorithm to achieve this goal. Preprocessing is the most common method for obtaining useable cardiovascular patient datasets. Following that, numerous types of characteristics are extracted. Following that, recursive feature elimination is used to choose features. Finally, the feature vectors are fed into a hierarchical classifier, which produces predicted labels. The medical field is an application field of information mining because it has a large number of information assets. They realize that it is valuable to include selection and feature reduction. Feature determination is concerned with distinguishing some relevant features sufficient to learn objective thoughts [8, 9, 10] . To choose features for hyperparameter optimization, a stochastic gradient boosting approach is used. To reduce the mean square error, the features are clustered together. Multiple experimental scenarios are examined, and the findings are compared to several earlier studies and typical ML algorithms to prove the usefulness of the suggested technique. The suggested technique is unique in that it uses the gradient boosting technique to classify cardiovascular diseases. Although the gradient boosting algorithm is well-known, it is processed with the weights of each feature of the dataset for heart disease prediction and classification. The proposed approach is unique in that it uses a hierarchical classifier and recursive feature elimination to choose the best feature from all other features. The rest of this paper is laid out as follows. The next part provides background information on past efforts as well as an analysis of their shortcomings. With preprocessing, feature selection, and the hierarchical classification approach, Section 3 outlines the proposed approach employed in this study. In Section 4, performance measurements are used to assess the proposed approaches. This section explains the findings and draws some parallels with past research. Section 5 concludes by summarising all of the works and drawing conclusions. Artificial intelligence and deep learning algorithms are extremely beneficial for using massive data to predict individual outcomes, especially when coupled to EHRs. This study [1] used machine learning to increase the prediction accuracy of traditional CVD risk variables in a large UK population. The effectiveness of machine learning techniques on longitudinal EHR data for ten-year Cardiovascular event prediction was compared to a gold standard reached through pooled cohort risk [11] . A classification approach with three basic steps was developed in this study [12] . The wavelet approach is used to filter the ECG signal during the preprocessing phase. Then fiducial points are used to find all heartbeats. Feature engineering is a technique for extracting different types of features from time and time-frequency domains. Then, to choose features, this study used recursive feature elimination. To get the final findings, a hierarchical classifier based on the XGBoost classifier and threshold is used in the classification step [12] . The authors devised a prediction approach based on physical examination markers to categorize hypertension patients [13] . The important elements from the patients' many clinical assessment signs are retrieved in the first stage. The essential features retrieved in the first stage are used in the second stage to forecast the patients' outcomes. The authors then suggested a model that incorporated recursive feature removal, cross-validation, and a prediction model. Extreme gradient boosting (XGBoost) is believed to successfully forecast patient outcomes by employing their best features subset [13] . This work [14] proposed a wrapper gene selection strategy with a recursive feature removal approach for efficient classification. For several gene selection strategies, the ensemble technique was used, and the top-ranking genes in each methodology were chosen as the final gene subset. Multiple gene selection techniques were combined in this study, and the ideal gene subset was obtained by prioritizing and ranking the most essential genes picked by the gene selection approach. Consequently, the scientists concluded that selecting a more discriminative and compact gene subset yielded the best results [14] . The scientists used machine learning algorithms to forecast a patient's stage of cardiac disease [15] . They chose the optimal features using the stochastic gradient boosting technique and Recursive Feature Elimination (RFE). An ensemble of weak prediction models, often using decision trees, was used to create a calculation model. It provides a stage-by-stage approach to boost and simplifying, and optimizing a subjectively variational failure problem. The authors of [16] presented an AutoML approach for automating the process of developing an AI model that performs well on any dataset. This study for cardiovascular disease prediction automates data pre-processing, feature extraction, hyper-parameter tweaking, and algorithm selection. The authors claimed that their AutoML model had removed a significant technical hurdle, allowing doctors to employ AI approaches more widely. For the best feature detection of the Single Proton Emission Computed Tomography (SPECT), Statlog Heart Disease (STATLOG) datasets, recursive feature removal with crossvalidation and stability selection were utilized, and their results were compared [17] . The approaches of Recursive Feature Elimination with Cross-Validation (RFECV) and Stability Selection (SS) were used to enhance the productivity of tree-based and probability-based machine learning techniques in this research. The feature with the lowest score is therefore removed. The RFECV adapts to the RFE and adjusts the number of characteristics picked automatically. The SS method returns details about the output variable's properties. This technique, according to the authors, is most useful in determining the treatment strategy for professionals in the area [17] . To estimate response variables more correctly, the gradient boosting approach fits new models sequentially during learning. The primary concept behind this technique is to build new base-learners with the highest correlation with the ensemble's negative gradient of the error function [18] . Breiman [19] invented Random Forest, an ensemble learning system based on random decision trees. The main distinction between RF and decision trees is that when breaking a node, RF looks for the best feature among the random subsets of characteristics, whereas decision trees look for the most significant feature. As a result, there is a lot of variety, which leads to a better model. The Bayes' theorem-based NB classifier was utilized, with each pair of classified characteristics being independent of one another. In order to discover the most probable categories, it employs probability theory. When the input has high dimensionality, this approach is appropriate [20] . For the Cleveland and Statlog project heart datasets, the authors suggested a model to predict heart disease categorization based on feature selection [21] . According to them, the random forest algorithm's accuracy is good for feature selection (8 and 6 features) based on classification models. Sensitivities and specificity were also associated with higher scores in this study [21] . The gradient boosting decision tree was used in [22] to estimate blood pressure rates based on human physiologic data obtained by the EIMO device. To pick ideal parameters and avoid overfitting, they employed the cross-validation approach. Also, it has been suggested that when considering the features of age, body fat, ratio, and height, this method had displayed with higher accuracy rate with reduced error rates as compared to other algorithms [22] . In this work [23] had proposed a framework for the prediction of risk factors of heart disease using several classifier algorithms. They have revealed that the support vector machine performs with better prediction accuracy, precision, sensitivity, and F1 score [23] . This section describes the proposed recursive feature elimination, gradient boosting based machine learning classification technique, feature ranking, and classification/prediction metrics which are used to evaluate the performance of the proposed model. The performance of the RFE-GB algorithm is analyzed using the cardiovascular disease dataset retrieved from the Kaggle repository [24] . The cardiovascular disease dataset consists of seventy thousand patient data records with eleven features and a target classifying CVD or non-CVD patients. The eleven attributes are gender, age, height, weight, BP-Systolic, BP-Diastolic, glucose, cholesterol, smoking behavior, physical activities, and patients' alcohol intake. Table. 1 summarizes the sample CVD dataset features and their values. The recursive feature elimination algorithm effectively selects the features from the training dataset that are most relevant in target variable prediction. It is an effective method for removing features from a training dataset in preparation for feature selection. RFE is prominent because it is good at identifying the features in a training dataset that are more or less important in predicting the target variable. RFE is a wrapper-style feature selection algorithm that internally employs filter-based feature selection. RFE operates by looking for a subset of features in the training dataset, beginning with all features and successfully deleting them until the target number of features [25, 26, 27, 28] . This work builds a model with the predictors, and an importance score is being computed for each predictor. The predictors with a minor significance are removed. Then, the model is rebuilt, and the score is computed again. Here, the number of predictor subsets and their size are specified to evaluate a tuning parameter. The optimal subset can be used to train the model. Thus, the RFE algorithm resulted in the group of top-ranked features that can be considered for selecting features [29] . The dataset has been tested with several subsets of features. It selects the popular features from cardiovascular disease dataset to classify cardio and non-cardio patients with reduced errors. In order to rank the features, the ranking criterion as a separating hyperplane has been determined with the largest margin. Then, a set of training samples are considered, the decision function is given in Equation (1). where w is the weight vector which can be obtained by using Equation (2) = ∑ =1 (2) where are lagrange multipliers, ∈ and ∈ {−1,1} and i=1,….n. iteration, potentially affecting accuracy and causing the correlation bias problem [30] . Figure.2 depicts the feature selection and ranking strategy. Thus, Therefore, gradient boosting will fit h to the residual − ( ) and gives the classification results as to whether CVD patient or non-CVD patient. In this proposed work, 70% of the data is considered as training data and 30% is taken as testing data from the CVD dataset. To measure the performance of the proposed RFE-GB model, the metrics namely recall, F1-score, precision, confusion matrix, RMSE, AUC-ROC, Cohen's kappa, and MSE are considered. During error analysis cohorts of data are identified with higher error rates. The Cohen's Kappa score is an excellent measure to handle multi-class and imbalanced class problems very well. Its value ranges from zero to one, and it is derived using Equation In Positives. Its way of calculation is given in Equation (7). The F1-score is the harmonic mean of the recall and precision, and it is presented in Equation (8). = + = + 1 − = 2 * * + (8) This is k-fold cross-validation. The 'k' value in this work is set to 10. As a result, it's often referred to as a 10-fold cross-validation resampling process. The 10-fold cross-validation approach is designed to reduce the prediction model's bias. The effectiveness of machine learning prediction algorithms is often measured using a set of classification algorithm-based metrics. The prediction error rates are quantified using the mean square error (MSE), root mean square error (RMSE), and Kappa score in this study. The confusion matrix and receiver operating characteristic area under the curve are used to analyze the predictions' true/false positive/negative rate (ROC AUC). The machine learning algorithms' prediction performance is measured by prediction accuracy, precision, recall, and f1 score [18, 20, 23] . Importantly, the accuracy of the above-mentioned machine algorithms is determined in this study (whether the patient has cardiovascular disease or not). Each classification model has a distinctive disease prediction accuracy and efficiency over other prediction models based on its hyperparameters. 70% of the dataset is utilized for training, whereas 30% of the data samples are utilized to test classification methods in this study. The proposed disease Further, the F1-score of the proposed RFE-GB algorithm has 11 % to 43 % of improved results. Cohen's kappa score for the proposed and existing machine learning algorithms depicted in Figure. 6; the proposed RFE-GB method has a higher kappa score than traditional methods, as seen by the graph. Cohen's kappa score assures that the classification algorithm's Bayes, extra trees, decision trees, and radial base function, respectively. The confusion matrix for various machine learning algorithms is shown in Figure 8 . It is worth researching much of what is required to forecast and diagnose any disease using machine learning effectively. This work recursive feature elimination-based gradient boosting algorithm has been proposed to select the most important features from the cardiovascular disease dataset. The RFE-GB algorithm selects three optimal number of features as blood pressure, cholesterol, and physical activity from the 12 features. Adopting these three features, a gradient boosting ensemble approach has been developed to predict cardiovascular disease cases. The proposed RFE-GB algorithm has been evaluated with various metrics, and its performance results are compared with explores different machine learning algorithms. Among that, the proposed RFE-GB algorithm has 13.36 %to 31.48 % of improved accuracy as compared to LDA 57 with a reduced error rate MSE of 0.1924 on the prediction of accurate cardio disease cases. The proposed RFE-GB algorithm accurately estimates 88 percent true positives and 84 percent of true negatives from 70,000 patient records with the AUROC score of 84 %. As a consequence of the findings, the proposed RFE-GB algorithm appears to be capable of diagnosing and classifying diabetes patients A hierarchical method based on weighted extreme gradient boosting in ECG heartbeat classification. Computer methods and programs in biomedicine Predictive Analysis of Heart Disease using Stochas-tic Gradient Boosting along with Recursive Feature Elimination Probable Forecasting of Epidemic COVID-19 in Using COCUDE Model A Machine-Learning-Based Prediction Method for Hypertension Outcomes Based on Medical Data Forecasting hyponatremia in hospitalized patients using multilayer perceptron and multivariate linear regression techniques Binary cross entropy with deep learning technique for image classification A heart sound classification method based on joint decision of extreme gradient boosting and deep neural network Elimination and Backward Selection of Features (P-Value Technique) In Prediction of Heart Disease by Using Machine Learning Algorithms High-performance in classification of heart disease using advanced supercomputing technique with cluster-based enhanced deep genetic algorithm Prediction of COVID-19 Possibilities using KNN Classification Algorithm Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction A hierarchical method based on weighted extreme gradient boosting in ECG heartbeat classification A Machine-Learning-Based Prediction Method for Hypertension Outcomes Based on Medical Data WERFE: A Gene Selection Algorithm Based on Recursive Feature Elimination and Ensemble Strategy Predictive Analysis of Heart Disease using Stochas-tic Gradient Boosting along with Recursive Feature Elimination Physician-Friendly Machine Learning: A Case Study with Cardiovascular Disease Risk Prediction A Study on Performance Improvement of Heart Disease Prediction by Attribute Selection Methods Greedy Function Approximation: A Gradient Boosting Machine Random forests Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms Classification and Feature Selection Approaches by Machine Learning Techniques: Heart Disease Prediction Health Data Driven on Continuous Blood Pressure Prediction Based on Gradient Boosting Decision Tree Algorithm", Special Section On Data-Enabled Intelligence For Digital Health Ambient assisted living predictive model for cardiovascular disease prediction using supervised learning A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers Operator functional state classification using leastsquare support vector machine based recursive feature elimination technique Determination of optimal heart rate variability features based on SVM-recursive feature elimination for cumulative stress monitoring using ECG sensor Comparing different feature selection algorithms for cardiovascular disease prediction A decision support system for heart disease prediction based upon machine learning Feature selection and analysis on correlated gas sensor data with recursive feature elimination A gentle introduction to gradient boosting Prognostic Analysis of Hyponatremia for Diseased Patients Using Multilayer Perceptron Classification Technique