key: cord-0077930-xhvhvay6 authors: Karthick, K.; Aruna, S. K.; Samikannu, Ravi; Kuppusamy, Ramya; Teekaraman, Yuvaraja; Thelkar, Amruth Ramesh title: Implementation of a Heart Disease Risk Prediction Model Using Machine Learning date: 2022-05-02 journal: Comput Math Methods Med DOI: 10.1155/2022/6517716 sha: d6524f6e12f94a4885e01036d5c5949ffdcb6f3d doc_id: 77930 cord_uid: xhvhvay6 Cardiovascular disease prediction aids practitioners in making more accurate health decisions for their patients. Early detection can aid people in making lifestyle changes and, if necessary, ensuring effective medical care. Machine learning (ML) is a plausible option for reducing and understanding heart symptoms of disease. The chi-square statistical test is performed to select specific attributes from the Cleveland heart disease (HD) dataset. Support vector machine (SVM), Gaussian Naive Bayes, logistic regression, LightGBM, XGBoost, and random forest algorithm have been employed for developing heart disease risk prediction model and obtained the accuracy as 80.32%, 78.68%, 80.32%, 77.04%, 73.77%, and 88.5%, respectively. The data visualization has been generated to illustrate the relationship between the features. According to the findings of the experiments, the random forest algorithm achieves 88.5% accuracy during validation for 303 data instances with 13 selected features of the Cleveland HD dataset. According to WHO data, heart disease is the leading cause of mortality globally, resulting in 17.9 million deaths annually [1] . The most behavioural risk factors for cardiovascular disease and stroke are unhealthy food, lack of physical activity, smoking, and alcohol drinking [1] . A heart attack occurs when the heart's blood circulation is obstructed by arteries plaque build-up. A thrombus in an artery causes a stroke by impeding blood flow to the brain [2] . The symptoms are common to other illnesses and might be confused with indicators of ageing, making diagnosis difficult for practitioners. Precision prediction and timely identification of cardiac disease are essential for improving patient survival rate. Because of the increased collection of medical data, practitioners now have a great opportunity to promote healthcare diagnosis. ML plays a vital role in many applications like text detection and recognition [3] , early prediction [4] , power quality disturbance detection [5] , truck traffic classification [6] , and agriculture [7] . ML has now become an essential tool in the healthcare sector to aid with patient diagnosis. The current methods for predicting and diagnosing cardiac disease are mostly dependent on practitioners' evaluation of a patient's medical history, signs, and physical assessment reports. Nowadays, information about patients with clinical reports is widely accessible in databases in the healthcare field, and it is rising rapidly day by day. In this article, the UCI ML repository's Cleveland HD dataset was utilized for developing the prediction model to heart disease. The machine is trained for learning patterns based on the features that are already present in the dataset. Classification is an effective ML approach for prediction. When properly trained with adequate data, classification is an effective supervised ML method for identifying disease [8] . The primary goal of this work is to employ contemporary ML techniques to construct the healthcare heart disease predictive model. The Cleveland HD dataset was subjected to SVM with radial basis function (RBF) kernel, Gaussian Naive Bayes, logistic regression, LightGBM, XGBoost, and random forest algorithm, and the best performing prediction model for early diagnosis of heart disease was found. Nave Bayes, random forest, PART, C4.5, and multilevel perceptron algorithm-based predictive model accuracy to HD dataset were determined to be in the range of 75.58%-83.17% [9] . Moreover, Nave Bayes algorithm has the highest accuracy as 83.17%, while other algorithms have less than 80% accuracy [9] . Kumar et al. discovered that the Random Woodland ML classifier had an 85 percent precision for cardiovascular disease [10] . Gudadhe et al. [11] described the framework for predicting the heart disease using SVM and obtained the accuracy as 80.41%. Kahramanli and Allahverdi [12] combined fuzzy and crisp values in health data and attained accuracy rates of 84.24% to Pima Indian diabetes dataset and 86.8% for the Cleveland HD dataset, respectively. Various ML classification models [13] [14] [15] [16] [17] could be used to improve intelligence. Kahramanli and Allahverdi [12] established the artificial and fuzzy-based model to the Pima Indian diabetes dataset and the Cleveland HD dataset and found 84.24% and 86.8% accuracy, respectively. Olaniyi et al. [18] established a prediction model and achieved an accuracy of 85% using feedforward multilayer perceptron (MLP) and 87.5% using SVM on the UCI ML datasets. Polat et al. [19] have employed k-nearest neighbour algorithm and an artificial immune recognition framework and achieved 87% accuracy on the Cleveland dataset. On a Cleveland dataset, Detrano et al. [20] achieved 77% using the logistic regression algorithm. Saw et al. [21] have implemented the improved logistic regression classification model for heart disease dataset. The fast decision tree and C4.5 tree have been employed for HD prediction [22] . As a result of the proposed model's initial phase, trees and features have been extracted. The genetic and fuzzy logic-based approach Computational and Mathematical Methods in Medicine has been proposed [23] which is a hybrid model to instantly generate the rules using a fitness function, appropriate genetic operators, and a rule encoding method. In this article, SVM with RBF kernel, Gaussian Naive Bayes, logistic regression, LightGBM, XGBoost, and random forest algorithms were employed to evaluate the classification accuracy on UCI ML repository's Cleveland HD dataset [24] . The data visualization has also been done to illustrate the relationship between the features. 3.1. Data. The UCI ML repository's Cleveland HD dataset was used in this investigation [24] . As indicated in Table 1 , a subset of 13 attributes were utilized in prediction of heart disease with 303 data instances. Table 1 describes about the attributes and its description that were used in the proposed classification model. The clinical variables that were considered to be essential were given under attribute column in Table 1 , and it is chosen based on the chi-square (chi 2 ) feature selection method [25] . To develop the heart risk prediction model, the remaining 61 attributes of the dataset were excluded to improve the accuracy of the model. Except for null, all other target values from 1 to 4 were considered as risk of cardiovascular disease for developing the model. The classification model consists of two classes, namely, class 0 and 1. The target values 1 to 4 have been changed as 1 during preprocessing. 3.2. Feature Selection. The statistical overview of subset attributes is shown in Table 2 for 303 instances. The count shows us how many nonempty rows are there in a feature. The value of "mean" indicates the feature's average value. The value of "std" reflects the feature's standard deviation. The "min" indicates the feature's minimal value. The 25%, 50%, and 75% are the percentile/quartile of each feature. The maximum value of the attribute is indicated by "max." Statistical tests will be useful in determining which attributes are having the most powerful relationship with the performance variable. The "SelectKBest" class in Python's scikit-learn library is utilized to choose a distinct attribute in a statistical test set. For nonnegative characteristics in this dataset, the statistical chi-square (chi 2 ) test was used to pick 13 of the best features. 3.3. Dataset Visualization. The data visualization of features such as gender, chest pain category, and fasting blood sugar level of the Cleveland heart dataset is shown in Figure 1 . Males are more likely than females to get heart disease, according to this Cleveland dataset. The majority of individuals with cardiovascular disease experience asymptomatic chest discomfort. Figure 2 depicts a heat map of the subset attributes, which serves as an instant visual summary. Thalassemia is a genetic disorder that causes people to have low haemoglobin levels than normal. Haemoglobin allows erythrocyte to transmit oxygen. Figure 3 illustrates the distribution of thalach, chol, trestbps, and people count those who are suffering from cardiovascular disease based on to their age. Cardiovascular disease is quite common in people over the age of 60, as well as adults aged 41 to 60. However, it is uncommon in the 19-year to 40-year-old age category and extremely uncommon in the 0-year to 18-year-old age category. Figure 4 shows the correlation between attributes such as Computational and Mathematical Methods in Medicine thalach and chol, age and target, age and ca, thalach and CP, and oldpeak and exang with respect to target. Figure 5 shows the pair plot that is useful to quickly explore distributions and relationships between the attributes. In adult people, total cholesterol levels < 200 mg/dL are generally preferred. In the range 200-239 mg/dL, 240 mg/dL, and above, borderlines are regarded to be high. A value of <40 mg/dL is measured as a risk factor for HD. A level of 41 mg/dL to 59 mg/ dL is considered borderline low. The maximal HDL level that may be measured is 60 mg/dL. To evaluate the heart disease risk prediction, six ML classifiers were used: SVM with RBF kernel, Gaussian Naive Bayes, logistic regression, LightGBM, XGBoost, and random forest. Machine. The SVM [26] classifier with RBF kernel is a function that turns a nonlinear problem into a linear problem in a multidimensional space. The RBF kernel in SVM classification algorithm is defined as where kx − x′k 2 is the squared Euclidean distance between two feature vectors and γ is a scalar. Gaussian Naive Bayes is the classification algorithm, and here, the 13 features stochastically independent for every class c and the prediction are given as where μ i,j is the mean and σ i,j is the root-mean square deviation of the dataset. where α is intercept arguments, β is slope argument vector, and D n = fðX i , y i Þ, i = 1, 2, 3, ⋯, ng is the independent data size of n with 303 data instances. [27] is a gradient-based boosting approach which makes use of tree-based learning methods. The pseudocode of the algorithm is given below. XGBoost algorithm is adopted from [28] and the pseudocode of the algorithm is given below. 4.6. Random Forest. The random forest [29] constructs multiple decision trees and the pseudocode of the algorithm is given below. The Cleveland HD dataset is split into training and testing set with a ratio of 80 : 20. The classification model accuracy is evaluated using the performance matrices from confusion matrix and it is expressed as where TP stands for true positive, TN stands for true negative, FP stands for false positive, and FN stands for false negative. Table 3 gives testing set and training set accuracy in % for all the six classifier models. Figure 6 depicts the accuracy of all models graphically. Figures 7 and 8 show the confusion matrix and receiver operating characteristic (ROC) curves for all six ML classification models. The validation indicates that the random forest algorithm provides better accuracy in prediction. The test set prediction accuracy of the random forest algorithm is 88.5% with ROC of 0.92 for the selected 13 attributes of the 303 data instances of the UCI ML repository's Cleveland HD dataset. The area under the curve (AUC) is an indicator of a classifier's ability to differentiate among classes and can be used to analyse the receiver operating characteristic (ROC) curve. The greater the AUC, the more accurate the model is at discriminating between favourable and unfavourable classes. The six ML classification algorithms, namely, SVM with RBF kernel, Gaussian Naive Bayes, logistic regression, LightGBM, XGBoost, and random forest, were applied to UCI ML repository's Cleveland HD dataset, and the prediction model has been developed for cardiovascular disease. The random forest algorithm provides better accuracy as 88.5% followed by SVM, and logistic regression provides 80.32% accuracy for the selected 13 attributes using the chi-square distribution. In this classification model, totally 303 data instances have been used. In future, various heart disease datasets from health data repository can be combined, and the best performing classification model using contemporary machine learning models can be outlined. The dataset is available in publicly accessible database. Knowledge of signs and symptoms of heart attack and stroke among Singapore residents Development and evaluation of the bootstrap resampling technique based statistical prediction model for Covid-19 real time data: a data driven approach Estimation of reproduction number and early prediction of 2019 novel coronavirus disease (COVID-19) outbreak in India using statistical computing approach Power quality disturbance detection using machine learning algorithm Machine learning of truck traffic classification groups from weigh-in-motion data Machine learning in agriculture domain: a stateof-art survey Comparing different supervised machine learning algorithms for disease prediction Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques Analysis and prediction of cardio vascular disease using machine learning classifiers Decision support system for heart disease based on support vector machine and artificial neural network Design of a hybrid system for the diabetes and heart diseases InstaCovNet-19: a deep learning classification model for the detection of COVID-19 patients using chest X-ray Machine learning: algorithms, real-world applications and research directions Deep transfer learning based classification model for COVID-19 disease A machine learning approach of predicting high potential archers by means of physical fitness indicators Heart disease classification comparison among patients and normal subjects using machine learning and artificial neural network techniques Heart diseases diagnosis using neural networks arbitration Automatic detection of heart disease using an artificial immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn (nearest neighbour) based weighting preprocessing International application of a new probability algorithm for the diagnosis of coronary artery disease Estimation of prediction for getting heart disease using logistic regression model of machine learning Feature analysis of coronary artery heart disease data sets Evolving rules using genetic fuzzy approach -an educational case study The chi-square test: often used and more often misinterpreted Support vector machine classification algorithm and its application LightGBM: a highly efficient gradient boosting decision tree Xgboost: a scalable tree boosting system Random forest algorithm for the classification of neuroimaging data in Alzheimer's disease: a systematic review The authors declare no conflict of interest. Karthick was responsible for the conceptualization and data curation and wrote the original draft; Aruna was responsible for the investigation and methodology supervision; Ravi Samikannu carried out formal analysis; Ramya Kuppusamy and Yuvaraja Teekaraman wrote, reviewed, and edited the manuscript; Amruth Ramesh Thelkar carried out methodology validation.