key: cord-0032052-lyat4mh4 authors: Yuan, Xiaohan; Chen, Shuyu; Sun, Chuan; Yuwen, Lu title: A novel early diagnostic framework for chronic diseases with class imbalance date: 2022-05-21 journal: Sci Rep DOI: 10.1038/s41598-022-12574-x sha: e479588da2d8a312ec3afa61bf0319af9fa44aa4 doc_id: 32052 cord_uid: lyat4mh4 Chronic diseases are one of the most severe health issues in the world, due to their terrible clinical presentations such as long onset cycle, insidious symptoms, and various complications. Recently, machine learning has become a promising technique to assist the early diagnosis of chronic diseases. However, existing works ignore the problems of feature hiding and imbalanced class distribution in chronic disease datasets. In this paper, we present a universal and efficient diagnostic framework to alleviate the above two problems for diagnosing chronic diseases timely and accurately. Specifically, we first propose a network-limited polynomial neural network (NLPNN) algorithm to efficiently capture high-level features hidden in chronic disease datasets, which is data augmentation in terms of its feature space and can also avoid over-fitting. Then, to alleviate the class imbalance problem, we further propose an attention-empowered NLPNN algorithm to improve the diagnostic accuracy for sick cases, which is also data augmentation in terms of its sample space. We evaluate the proposed framework on nine public and two real chronic disease datasets (partly with class imbalance). Extensive experiment results demonstrate that the proposed diagnostic algorithms outperform state-of-the-art machine learning algorithms, and can achieve superior performances in terms of accuracy, recall, F1, and G_mean. The proposed framework can help to diagnose chronic diseases timely and accurately at an early stage. • We study a universal and efficient diagnostic framework to make timely and accurate early diagnosis of chronic diseases with small-scale datasets. • We propose an NLPNN algorithm to avoid the problem of over-fitting, which can efficiently capture high-level features hidden in chronic disease datasets and achieve high classification accuracy. • We further propose an AEPNN algorithm to solve the class imbalance problem, which greatly improves the recall of the diagnostic model, that is, it can accurately diagnose the sick case. • We evaluate and compare the proposed methods against other state-of-the-art methods using nine chronic diseases datasets (partly with class imbalance) and extensive experimental results demonstrate that the proposed two diagnostic models outperform state-of-the-art machine learning algorithms, and can achieve superior accuracy and recall. The rest of the paper is organized as follows. We discuss related work in "Related work" section. "Diagnostic framework for chronic diseases" section presents the proposed algorithms, and experiment results are shown in "Experimental results" section. Finally, "Conclusion" section concludes this paper. Early diagnosis of chronic diseases. Several existing machine learning algorithms have been proposed to diagnose a certain chronic disease [36] [37] [38] . Heydari et al. 36 compared the performance of various machine learning classification algorithms in the early diagnosis of type 2 diabetes. The simulation results showed that the performance of classification techniques depends on the nature and complexity of the dataset. Khan et al. 37 developed a chronic disease risk prediction framework. To reduce the impact of outliers, Alirezaei et al. 38 incorporated K-means clustering, SVM, and meta-heuristic algorithm to diagnose diabetes disease. However, they ignored the influence of data distribution and structural changes on model generalization performance. Under the premise of not changing the structure and distribution of data, the authors in 13 proposed a diagnostic model 39 used a hierarchical neural network fusion method (FHNN) for the stratified diagnosis of cardiovascular disease (CVD). However, the impact of FHNN mainly depends on the optimal choice of the sub-neural network. Some tree-based ensemble learning techniques applied to early diagnosis methods of diabetes were comprehensively studied by Tama et al. 20 , and the differential performance of different classification methods was evaluated through statistical significance tests. At the same time, Altan et al. 40 also compared various machine learning algorithms for the early diagnosis of chronic obstructive pulmonary disease and proposed a deep learning model to analyze multi-channel lung sounds using statistical features of Hilbert-Huang transform, which successfully achieved high classification performance of accuracy, sensitivity, and specificity of 93.67%, 91%, and 96.33%, respectively. Class imbalance. In medical datasets, the problem of class imbalance seriously affects the accuracy of classifiers 27, 24 . In most cases, it directly leads to a high rate of misdiagnosis of the disease. This is because the class imbalance of the training data brings difficulties to the algorithm learning, and the algorithm pays more attention to the majority class 41 . However, the minority class in medical datasets (sick vs. healthy) is often more important from a data mining perspective, and it usually carries critical and useful knowledge. At present, many scholars have studied the class imbalance problem, among which there are three main methods to alleviate the class imbalance 42, 43 . (1) Data-level methods: in the data preprocessing stage, re-sampling is used to reduce the size of the majority class or increase the size of the minority class (or both) to balance the training set and eliminate difference. (2) Algorithm-level methods: in the training phase, the learning algorithm is modified to be suitable for mining data with imbalanced distributions. (3) Hybrid methods: the advantages of the first two methods are combined to alleviate the adverse effects of class imbalance on the results. Statement: I confirm that all methods were performed in accordance with the relevant guidelines and regulations. In this section, we propose a universal and efficient diagnostic framework for diagnosing chronic diseases timely and accurately. The proposed framework consists of the NLPNN algorithm and AEPNN algorithm to alleviate the problems of feature hiding and class imbalance, respectively. Network-limited polynomial neural network. The PNN algorithm is dedicated to learning the highlevel polynomial feature representation of the data through multi-layer network architecture, and finally, output features hierarchically 32, 33 . Although the PNN algorithm has been proven to run in polynomial time, it still has a limitation, that is, the depth and width of the network cannot be controlled. Its network depth and width are both adaptive, and the criterion for depth stopping is until the training error is zero 35 . In the worst case, the network depth can be infinitely deepened or the network width can be as large as the number of training samples n. This will lead to severe overfitting. Hence, we present an NLPNN algorithm for the early diagnosis of chronic diseases to avoid this issue. The structure of NLPNN is shown in Fig. 1a , and the details of the NLPNN algorithm applied to chronic diseases diagnosis be described below. For the early diagnosis of chronic diseases, we denote the labeled training dataset as D = (X, y) , where X ∈ R n×d is the set of n samples with d features; y = y 1 , y 2 , . . . , y n T is a n-dimensional column vector and y i ∈ {−1, 1} , ∀i = 1, 2, . . . , n . Here, y i = 1 means that the i-th sample is labeled as a sick case, and y i = − 1 otherwise. The M-order multivariate polynomial on the sample x i = (x i1 , . . . , x id ) ∈ X is written as s is of degree j. Represent the value of each polynomial p on n samples by linear projection According to linear algebra, there are n polynomials p 1 , . . . , p n , and p i (x 1 ), . . . , form a basis of R n s p a c e . T h e r e f o r e , t h e r e i s a c o e f f i c i e n t v e c t o r ν = (ν 1 , . . . , ν n ) , s o t h a t n i=1 ν i p i x j = y j , ∀y j ∈ (y 1 , . . . , y n ) T ∈ R n . The network layer of PNN is constructed by solving the basis of polynomial hierarchically, and each node calculates a linear function or weighted product over its input. We denote the j-th node of the i-th layer as η i j (·) , which actually represents a feature (original or high-level) of the input data. For the first layer, the j-th node is the degree-1 polynomial (or linear) function η 1 j (x) = [1 x]w j , and the η 1 is the basis of all values obtained by a polynomial of degree 1 on the training dataset. They form the columns of matrix F 1 ∈ R n×(d+1) and F 1 i,j = η 1 j (x i ) . So far, a single-layer network has been constructed, and its output spans all the values obtained by the linear function on the training sample. Generally speaking, the basis of the degree-2,3,...M polynomial is also obtained in the same trick. However, we find that the basis of the degree-M multiple polynomials is composed of (d + 1) M vector elements. The scale of the basis of the polynomial increases exponentially with its degree, which will run into a computational problem. The work in 35 indicates that any degree-m polynomial can be regraded as where g i (x) and h i (x) are degree-1 and degree-(m − 1) polynomials respectively; k(x) is a polynomial of degree not greater than m − 1 . Since all degree-1 polynomials are spanned by the nodes at the first layer of PNN, any degree-2 polynomial can be written as j are scalar multipliers. (4) implies that the construction of the second layer of the network is based on the first layer. The matrix [F 1F2 ] is formed by concatenating the columns of F 1 , F 2 , which spans all values attainable by degree-2 polynomials, and where the symbol • indicates the Hadamard product; F 1 refers to the first column of F; |F| refers to the number of columns of F. Similar to degree-1 polynomial, the column subset F 2 of F 2 should be found, so that the column of [F 1 F 2 ] are the basis of column of [F 1F2 ] . The second layer of the PNN is constructed by the column of F 2 , which is the product of two nodes η 1 i (·) and η 1 j (·) in the first layer. The next step is to repeat the above process. Successively, the m = 3, 4, . . . , M layers of the network are constructed. We represent the matrix, written as Thus, we find a linearly independent column subset F m of F m , which lets the columns of matrix [F F m ] are a basis of the columns of the augmented matrix [FF m ] , where the columns of F = F 1 F 2 . . . F m−1 can span the values attained by all polynomials for degree at most m − 1 over the training dataset. In addition, it needs to be explained that the conversion of F m to F m is achieved by where the projection matrix W ∈ R |F m−1 |×|F 1 | and W i(s),j(s) = √ n/ F m s . Therefore, when the M-layer network of the PNN is constructed, all the values obtained by the polynomial of degree at most M over the training dataset can be spanned by the columns of the matrix F. In fact, F stores the high-level features of the input data, the deeper layer, the higher feature. However, for the implementation of NLPNN, we use a parameter � = (d + 1, · · · , d + 1) ∈ Z M to pre-limit the depth and width of the network, which represents that the network consists of M ( | | ) non-output layers and each layer has d + 1 nodes at most. In the first non-output layer, we use singular value decomposition on the augmented data matrix [1 X] to obtain its partial orthogonal basis, which forms the d + 1 nodes (select the first d + 1 main singular vectors). In the next non-output layer, a standard Orthogonal Least Squares (OLS) procedure is utilized to greedily select the partial orthogonal basis which are the first d + 1 relevant features for diagnosis of chronic disease according to the established high-level feature set F m . Finally, a simple linear classifier ν m with input data F = F 1 F 2 . . . F m is trained. Therefore, there are M linear classifiers in the output layer. It should www.nature.com/scientificreports/ be pointed out that each linear classifier ν m is trained by a stochastic gradient descent method, which is utilized to solve the L 2 regularization problem is a hinge loss and F m i· represents the i-th row of matrix F m ; m ∈ is the regularization factor. Then combined with the value set of the regularization factor, we check the network performance layer by layer on the verification dataset to find the optimal network layer and the best regularization factor. Finally, an optimal linear classifier ν * is obtained by and the output is this optimal classifier. The purpose of NLPNN is to adaptively find features related to diagnosis from the augmented data that is augmented in terms of its feature space. The detailed process of NLPNN is shown in Algorithm 1, which briefly describes the entire process from the establishment of the network layer to the acquisition of the output layer. Input: D = (X, y); Ω; Λ. Output: An optimal linear classifier ν * . 1 Initialization: Pick a partial orthonormal basis O F of F 's columns based supervised OLS procedure; Compute orthonormal basis O y of y's columns; www.nature.com/scientificreports/ Attention-empowered NLPNN. Some chronic disease datasets exist the class imbalance problem, where sick cases are generally scarce compared to healthy cases. However, the correct diagnosis of the minority sick cases among all cases is vital in a healthcare system. The reason is that the cost of misdiagnosing sick cases is much higher than healthy cases, where the latter only requires further examination and the former carries a life-threatening risk. During the training phase of NLPNN, since the samples of each class in the imbalanced dataset are utilized equally, the trained model tends to bias towards the majority class and ignore the samples (sick cases) in the minority class. Thus, NLPNN does not perform well in dealing with class imbalance problems and causes serious misdiagnosis of minority sick cases. Furthermore, for the early diagnosis of chronic diseases, although we are more concerned with the accurate diagnosis of sick cases, we cannot ignore the overall diagnostic accuracy. To alleviate the class imbalance problem, we empower the cases with attention (i.e., weight) and propose an AEPNN algorithm. AEPNN pays more attention to the cases misdiagnosed by NLPNN by changing the importance of these cases. Motivated by committee-based learning 25 , AEPNN trains and combines multiple complementary NLPNN to further improve the performance of NLPNN in alleviating the class imbalance problem. The structure of AEPNN is shown in Fig. 1b . For the implementation of AEPNN, we first assign an identical initial weight D 1 (x) = 1 n to each sample x in the training dataset. An NLPNN classifier h 1 is trained from the training dataset D 1 with the initialized weight distribution D 1 and h 1 's error ǫ 1 is fed back to the training sample, so that the training sample's distribution is adjusted by D 2 (x) . Then, the second NLPNN classifier h 2 is trained from the training dataset D 2 with the weight distribution D 2 , where the weights of samples misdiagnosed by h 1 are increased in D 2 to make h 2 pay more attention to the samples that are misdiagnosed by h 1 . This process is repeated until h L is trained after L iterations. Finally, the predicted label is obtained through the weighted combination of all NLPNN classifiers. The main process is shown in Algorithm 2. Specifically, we denote the true label corresponding to sample x as f (x) , and the predicted label obtained by the NLPNN classifier as h(x) . Obviously, the loss function ǫ is defined as where p(x) represents the probability density function of x following the data distribution D . However, ǫ has poor mathematical properties (non-convex and non-continuous), which makes it very difficult to be solved directly. To optimize the loss function more conveniently, we select a convex and continuously differentiable exponential loss function (11) to replace the loss function (10) . Lemma 1 proves that ℓ exp (h | D) is the consistent replacement of the loss function ǫ , which means that (11) can replace (10) to update the weight D l (x) of the sample and the weight α l of the classifier in Algorithm 2. Proof Please see Appendix 1. In Algorithm 2, the h 1 is obtained by applying the NLPNN classifier to the initial samples distribution D 1 . When h l is generated based on distribution D l , the weight α l of the classifier h l is obtained iteratively by minimize the exponential loss function ℓ exp (α l h l | D l ) . From Lemma 2, we know that α l = 1 2 ln 1−ǫ l ǫ l is a necessary and is the normalization factor to ensure that D l+1 is a distribution. In summary, we iteratively optimize the exponential loss function by introducing two kinds of attention ( D and α ) to achieve the superiority of AEPNN on class-imbalanced datasets. Some DNN models are not suitable for classification tasks with the small-scale dataset due to the over-fitting problem. However, the PNN-based deep learning algorithm performs well for the early diagnosis of chronic diseases with the small-scale dataset, due to its unique network structure. We select five state-of-the-art machine learning algorithms as the baseline algorithms, i.e. SVM 44 Table 1 , in which Evaluation measurements. For the early diagnosis of chronic diseases, the generalization performance can be estimated on the test dataset. In addition to using the area under the receiver operating characteristic curve (AUC) to evaluate the performance of the model, we also selected the following evaluation indicators to evaluate the proposed algorithm: • Accuracy = TP+TN TP+TN+FP+FN represents the ratio of the number of correctly predicted specific classes to the total number of samples. • Specificity = TN TN+FP represents the ratio of the number of correctly predicted healthy cases to the total healthy cases. • Precision = TP TP+FP represents the ratio of the number of correctly predicted sick cases to the total predicted sick case. • Recall = TP TP+FN represents the ratio of the number of correctly predicted sick cases to the total number of sick cases. • F1_score = 2 * Precision * Recall Precision+Recall = 2 * TP N+TP−TN is defined based on the harmonic average of precision and recall. where TP, FP, TN, and FN represent true positive, false positive, true negative and false negative respectively; N is the total number of samples. factor in the NLPNN model for the diagnostic performance of eleven chronic diseases, where ∈ {2, 3, 4, 5} (network layer plus output layer) and ∈ = {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 } . To visually find the most suitable and , we combine them into a binary set (�, ) , and establish a bijection function between (�, ) and ∈ {1, 2, · · · , 20} described in Table 2 . We set as the horizontal axis to indirectly draw the generalization performance curve of NLPNN with network depth and regularization factor. From Fig. 2 , we can see that NLPNN has two advantages in the diagnosis of all chronic diseases, that is, there is no over-fitting phenomenon; the training accuracy is increasing with the increase of the number of network layers (it can be observed that when =1,6, 11,...). However, different values will affect the performance of the NLPNN algorithm, the impact on different chronic disease datasets is different. Figure 2a shows that NLPNN can achieve 100% generalization performance on the CKD dataset when = {1, 2} . Then, with the increase of and the change of , the test performance decreases somewhat, but both fluctuate within the range of 5%. It means that only a shallow polynomial neural network model can accurately diagnose chronic kidney disease. We can see from Fig. 2b , c, g and k that the value has almost no effect for the diagnostic accuracy of diabetes and heart disease. In particular, for the diagnosis of hepatitis B disease (Fig. 2h) , although the accuracy of the NLPNN model does not vary greatly, its specificity is unstable with the change of value. This reason is that the Hep dataset has only 155 samples and the negative samples only account for 24% of the total samples. In addition, we can find the best output performance P * of NLPNN and the corresponding value * on eleven chronic disease datasets from the Fig. 2 . Therefore, according to Table 2 , we can find the network structure * and the regularization factor * when NLPNN achieves the best performance, as shown in Table 3 . The generalization performance comparison of baseline algorithms and NLPNN algorithm on eleven chronic disease datasets are shown in Table 4 , which lists the test performance results under the unified standard. In general, the diagnostic accuracy of NLPNN on the eleven chronic disease datasets is better than baseline algorithms. Especially for the diagnosis of chronic kidney disease and breast cancer, NLPNN can achieve a generalization accuracy, recall, and F1_score, of 1.0000, 1.0000, and 1.0000, respectively. In addition, NLPNN also shows significant advantages in the diagnosis of Hepatitis disease, and its generalization accuracy is about 10% better than the baseline algorithms (SVM:0.8000, LR: 0.8333, KNN: 0.8000, DT: 0.8333, MLP: 0.8000). www.nature.com/scientificreports/ noting that in the diagnosis task of chronic kidney disease and breast cancer, the NLPNN model is an "ideal model" with an AUC value of 1 (Fig. 3a, i) . www.nature.com/scientificreports/ In this paper, we not only pay attention to the overall accuracy of the model in the diagnosis of chronic diseases but also pay more attention to whether the model can accurately diagnose sick cases (positive samples). That is, we hope that the recall of the model is as high as possible on the premise that the overall accuracy is high. For T2DM, CVD, Fra_Heart, and Pri_diab datasets, we observe that the ratio of the number of correctly predicted sick cases to the total number of sick cases is low, that is, the recall rate is low. The reason is that there is a class imbalance problem in these datasets. To solve this problem, the AEPNN algorithm 2 is proposed in "Diagnostic framework for chronic diseases" section. Because the NLPNN algorithm is a strong classifier, we do not need too many individual classifiers, whose number is equal to the number of iterations. The test performance will change with the increase of the number of training rounds of the NLPNN algorithm. Although the overall diagnostic accuracy decreases slightly, the diagnostic accuracy of sick cases has been significantly improved. We choose the number of iterations corresponding to the maximum value of the difference between the growth rate of recall and the decrease rate of accuracy as the final number of training rounds of the NLPNN algorithm to obtain the best performance. Figures 4, 5, 6, 7 show the performance of the proposed algorithm when applied to the Fra_Heart, T2DM, Pri_diab, and CVD datasets at different iterations of NLPNN, respectively. Comprehensive analysis with Table 1 , we can see that the higher the class imbalance ratio of chronic disease data, the more obvious AEPNN improves the recall. Fig. 4a . The performance growth rate is calculated based on the number of NLPNN classifiers being one. From Fig. 4b , we observe that the recall has a growth rate of close to 300% when the number of NLPNN classifiers is six, which is chosen as the best number of NLPNN classifiers for the diagnosis of heart disease. The most surprising thing is the performance of AEPNN on the T2DM and Pri_diab datasets. As it can be seen from Figs. 5a and 6a, when the number of NLPNN is greater than four, the recall is significantly improved. When the number of NLPNN reaches ten, the growth rate of the recall approaches 4000% on the T2DM dataset and 6000% on the Pri_diab dataset. We can also know that the growth rate of recall is much higher than the decreased rate of accuracy from Figs. 5b and 6b. From Fig. 7 , we can see that although the performance of AEPNN on the CVD dataset is not significantly improved, the growth rate of recall is still higher than the decreased rate of accuracy. It indicates that the proposed algorithm is effective for the improvement of recall. The advantage it brings is that it can reduce the missed diagnosis rate for sick cases so that more patients with chronic diseases can treat and control the development of the disease in time. We also quantitatively compare the generalization performance of AEPNN and NLPNN algorithms by introducing G_mean = Recall * Specificity , which is a powerful indicator to evaluate the classification accuracy for class imbalanced datasets 49 . From Table 5 , we can see that AEPNN can effectively improve In this paper, we have investigated a universal learning algorithm based on PNN for the early diagnosis of chronic diseases. Five state-of-the-art baseline algorithms are selected to compare with the NLPNN algorithm. Experiment results show that NLPNN achieves the best accuracy on the nine chronic disease datasets. In particular, for the early diagnosis of chronic kidney disease and breast cancer disease, the generalization accuracy, recall, specificity, and AUC value of this model have achieved 1.000, 1.000, 1.000, and 1.000, respectively. Furthermore, an AEPNN algorithm is further proposed to alleviate the class imbalance problem in chronic disease datasets. We aim to increase the probability of the sick cases being accurately diagnosed, that is, to increase the recall value of the model. Experiments on the four chronic disease datasets with class imbalance problems have confirmed the effectiveness of our model. It is noted that the AEPNN model performs best on the Pri_diab dataset with a positive-negative sample ratio of 1:12.78, and the growth rate of its recall is close to 6000%. The proposed algorithm can effectively assist chronic disease experts in quickly screening patients with chronic diseases, and save the cost of further testing for patients. It should be pointed out that although our algorithm performs better on small-scale datasets, the PNN-based model also shows great application potential on large-scale datasets, such as protein-protein interaction prediction and disease diagnosis based on medical images. In future work, we will further investigate the PNN-based model in disease diagnosis. Although PNN can effectively capture hidden features parameter-free, there is still a problem with how to adaptively select the besthidden features from the network architecture of PNN to achieve competitive performance. Thus, we consider combining PNN with computational intelligence algorithms (such as monarch butterfly optimization (MBO), earthworm optimization algorithm (EWA), and elephant herding optimization (EHO)) to improve the performance of disease diagnosis. The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request. A novel class imbalance-oriented polynomial neural network algorithm for disease diagnosis WHO reveals leading causes of death and disability worldwide Clinical decision support systems for chronic diseases: A systematic literature review Predicting Alzheimer's disease from spoken and written language using fusion-based stacked generalization An improved SEIR model for reconstructing the dynamic transmission of COVID-19 A review of wearable and unobtrusive sensing technologies for chronic disease management COVID-19: From an acute to chronic disease? Potential long-term health consequences Economic burden of chronic obstructive pulmonary disease (COPD): A systematic literature review Prevention. About Chronic Diseases Post-structuring radiology reports of breast cancer patients for clinical quality assurance A dual-modal attention-enhanced deep learning network for quantification of Parkinson's disease characteristics Dsnet: Dual stack network for detecting diabetes mellitus and chronic kidney disease Xgboost model for chronic kidney disease diagnosis Automated diagnosis of coronary artery disease (CAD) patients using optimized SVM Early diagnosis model of Alzheimer's disease based on sparse logistic regression Prediction of heart disease using k-nearest neighbor and particle swarm optimization A novel Gini index decision tree data mining method with neural network classifiers for prediction of heart disease Prioritizing type 2 diabetes genes by weighted PageRank on bilayer heterogeneous networks Random forest swarm optimization-based for heart diseases diagnosis Tree-based classifier ensembles for early detection method of diabetes: An exploratory study A tongue features fusion approach to predicting prediabetes and diabetes with machine learning Multiple predictively equivalent risk models for handling missing data at time of prediction: With an application in severe hypoglycemia risk prediction for type 2 diabetes Self-adaptive extreme learning machine Imbalanced breast cancer classification using transfer learning Biased random forest for dealing with the class imbalance problem Detection of malicious code variants based on deep learning Chronic kidney disease prediction on imbalanced data by multilayer perceptron: Chronic kidney disease prediction Improved probabilistic neural networks with self-adaptive strategies for transformer fault diagnosis problem Architecture evolution of convolutional neural network using Monarch butterfly optimization Identifying facial phenotypes of genetic disorders using deep learning Using deep neural network with small dataset to predict material defects Multimodal neuroimaging feature learning with multimodal stacked deep polynomial networks for diagnosis of Alzheimer's disease Protein-protein interactions prediction via multimodal deep polynomial network and regularized extreme learning machine Deep polynomial neural networks An algorithm for training polynomial networks Comparison of various classification algorithms in the diagnosis of type 2 diabetes in Iran Chronic disease prediction using administrative data and graph theory: The case of type 2 diabetes A bi-objective hybrid optimization algorithm to reduce noise and data dimension in diabetes diagnosis using support vector machines Fused hierarchical neural networks for cardiovascular disease diagnosis Deep learning on computerized analysis of chronic obstructive pulmonary disease Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and Parkinson's disease Learning from imbalanced data: Open challenges and future directions Multi-class imbalanced big data classification on spark Support vector machine Logistic regression was as good as machine learning for predicting major chronic diseases Efficient heart disease prediction system using k-nearest neighbor classification technique Using decision trees to understand the influence of individual-and neighborhood-level factors on urban diabetes and asthma Heart disease prediction using multilayer perceptron algorithm RCSMOTE: Range-controlled synthetic minority over-sampling technique for handling the class imbalance problem conceived and designed the analysis and revision of this paper. X.Y. and C.S. designed and developed the framework. C.S. participated in the design and implementation of the experiment. S.C. and L.Y. have access to the dataset and performed data analysis Part of this work was accepted by the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA (virtually), December 9-12, 2021, which is cited as reference 1 . The authors declare no competing interests. The online version contains supplementary material available at https:// doi. org/ 10. 1038/ s41598-022-12574-x.Correspondence and requests for materials should be addressed to S.C.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.