key: cord-0910948-t79esoa4 authors: Desuky, Abeer S.; Hussain, Sadiq title: An Improved Hybrid Approach for Handling Class Imbalance Problem date: 2021-01-28 journal: Arab J Sci Eng DOI: 10.1007/s13369-021-05347-7 sha: a1dfaafb70ee558217170913c1cc6b7c83d76bad doc_id: 910948 cord_uid: t79esoa4 Class imbalance issue that presents in many real-world datasets exhibit favouritism toward the majority class and showcases poor performance for the minority class. Such misclassifications may incur dubious outcome in case of disease diagnosis and other critical applications. Hence, it is a hot topic for the researchers to tackle the class imbalance issue. We present a novel hybrid approach for handling such datasets. We utilize simulated annealing algorithm for undersampling and apply support vector machine, decision tree, k-nearest neighbor and discriminant analysis for the classification task. We validate our technique in 51 real-world datasets and compare it with other recent works. Our technique yields better efficacy than the existing techniques and hence it can be applied in imbalance datasets to mitigate the misclassification. Machine learning is a well-known research domain in computers science to use a variety of algorithms to extract useful information among huge raw data. These algorithms have widely been applied to different subjects such as medical data analysis [1] [2] [3] [4] , class noise detection [5] , image processing [6] , sentiment analysis [7, 8] , signal processing [9, 10] , road accident analysis [11] , social data mining [12] and many more. However, in most cases, we may not have a balanced dataset. Having a balanced dataset can benefit different machine learning algorithms to learn various circumstances of different classes. For this reason, dealing with imbalanced datasets is a challenging task in machine learning. In the case of imbalanced data, an uneven distribution of samples occurs among a variety of classes as the minority class has notably fewer samples than the majority class. This imbalanced data case is very widespread in real applications such as disease diagnosis, network intrusion recognition, and software flaw prediction. This biasness arises due to the favouritism of the learned classifier toward the majority population while ignoring the minority samples. Nevertheless, the recognition of the minority records with special property is of utmost importance while dealing with imbalance domains. For instance, in the medical data analytics area, the wrong classification of COVID-19 patient (a minority sample) as a non-COVID-19 subject will incur a high or even unacceptable cost. Software security, financial fraud prediction and helicopter fault monitoring are similar examples of this sort. With great attention or afflux devoted to combating class imbalance issue, various solutions have been devised. These solutions may be categorised as two forms: data-level and algorithm-level methods. The datalevel method mitigates the majority records (undersampling) and the number of minority records is enhanced (oversampling) or integrate both of them to correct imbalance scenario. The inductive bias toward the majority samples is lessened by adjusting the prevailing learning approaches in the algorithm-level techniques. The data-level approaches are more frequently utilized in comparison to algorithmlevel strategies as they can be integrated with other techniques such as ensemble as well as active learning to devise intricate hybrid methods and this approach does not depend on any specific classifier. In our study, we have utilized an undersampling method for the instances of the majority samples. Majority class samples from a definite number of clusters are eradicated by applying cluster-oriented undersampling to balance the training dataset [13] . Underlying data distribution affects the distance-based abolition of occurrences. Weighted learning principle of infrequent instances makes ensemble learning methods an effective solution. Devi et al. [13] utilized the AdaBoost ensemble method to eliminate irrelevant majority class records from the clusters. Mohammed et al. [14] in their study empirically examined two resampling approaches-undersampling and oversampling. They exploited several machine learning techniques with various hyperparameters that yielded superior outcomes for both the resampling techniques. Liu et al. [15] coupled Ensemble of Classifier Chains (ECC) with random undersampling to make ECC flexible to class imbalance. Chains of numerous sizes were constructed to increase the exploitation of majority class records and binary models per label were also devised. If there is an overlap of records along with imbalance, then it makes the learning task trickier [4] . Overlapped data points were eliminated in binary datasets by introducing an undersampling approach by Vuttipittayamongkol et al. [16] . Their techniques were devised to recognize and remove majority class records from the region of overlapping. Possible overlapped examples were identified by employing four techniques that exploited neighbourhood searching with numerous criteria. The authors [17] exploited the elimination threshold and soft clustering techniques to take out negative records in the overlapping area. Sarkar et al. [18] devised an ensemble learning-based undersampling method utilizing Extreme Gradient Boosting (XGBoost) and Support Vector Machine (SVM). They validated their method on a steel plant accident dataset. The outcome showed that their novel method effectively resolved the class imbalance problem. In imbalance settings, there is lots of scientific literature that utilized discriminant analysis (DA), decision tree (DT), SVM and k-nearest neighbor (k-NN). Bejaoui et al. [19] presented an improved regularized-quadratic discriminant analysis (R-QDA) that utilized a modified bias and two regularization parameters, properly selected to evade improper characteristics of R-QDA in the imbalanced scenario and hence ensured enhanced the result of the classifier in best possible way. The presented classifier used a random matrix theory-based scrutiny of its presentation when the number of features and that of samples grew huge simultaneously. Jian et al. [20] proposed a novel contribution sampling method based on the contributions of the non-support and support vectors to classification. Dubey et al. [21] devised a modified kNN technique so that it could handle the class allocation in a broader region around the query example. They empirically validated their method on several real-world datasets. Liu et al. [22] introduced a novel decision tree method which generated rules and was statistically significant and also insensitive and robust to the size of classes. They employed the metric applied in C4.5, Information Gain concerning confidence of a rule to make decision trees robust. Simulated annealing (SA) is also used in machine learning paradigm in various applications. Tóth et al. [23] utilized SA for speedy optimization of parameters of an object recognizer ensemble over huge image databases. Yang et al. [24] proposed a novel edition of monarch butterfly optimization (MBO) with SA method termed as SAMBO. The migration operator and butterfly adjusting operator was exploited by the SA method. The experiments were carried on 14 continuous nonlinear functions. Camelo et al. [25] demonstrated empirically utilizing Fast Simulated Annealing and metaheuristics Simulated Annealing for optimization of Multilayer Perceptron (MLP) to optimize the hyper-parameters. The model was applied to optimize two parameters: the configuration of the neural network layers and its neuron weights. We devise an improved hybrid method to handle the unbalanced settings and hence improve the overall performance. To best of our knowledge, simulated annealing strategy with machine learning classifiers has been utilized for the first time to balance imbalanced datasets. Optimization is the process of achieving the best solution for a problem (selecting an optimal subset of majority class instances); in this work, using simulated annealing optimization technique helps in improving the objective function (classification performance) value. We adopt undersampling using simulated annealing, different classifiers viz. discriminant analysis, SVM, decision tree, kNN. Unlike most other optimization algorithms, SA applies a cooling strategy to locate optimal solutions ignoring local optima while searching the solution space and converging to the global optimum. Moreover, using the F-score metric as an objective function helping in selecting the examples that improve the overall accuracy for both majority and minority class. We evaluate our technique in several real-world datasets including UCI and KEEL datasets. The performance of our technique is comparable to many existing methods in this domain. It also outperformed different techniques in terms of F-score, accuracy, AUC, and G-mean. We also perform the Wilcoxon's Signed Rank test and the p-values indicate that there is a significant difference between the proposed method and state-of-the-art pre-processing techniques. Our hybrid technique is simple, efficient and easy to implement. Hence, it can be applied to balance many real-world datasets as it has been tested in 51 benchmark datasets. Sampling methods in combination with ensemble classification techniques have demonstrated its efficacy in real-world problems, especially to resolve class imbalance issue. Tsai et al. [26] devised a novel undersampling technique that integrated instance selection and clustering analysis. Analogous data records of the majority cohort were assembled into subgroups by the clustering technique. Misleading data samples were sorted out from the subclasses by the instance selection method. The classification problem with class imbalanced data in the medical domain has attracted many researchers. Most of the prevailing techniques categorize samples into the majority class that resulted from bias and inadequate recognition of minority class. Zhu et al. [27] proposed a new method called class weights random forest to tackle this issue. Their technique could detect both minority and majority class with high accuracy and hence improved the overall performance of the classification algorithm. Li et al. [28] presented a unified pre-processing method utilizing stochastic swarm heuristics to jointly optimize the mixtures from the two classes by gradually reconstructing the training dataset. Their method exhibited competitive performance in comparison with popular techniques. Li et al. [29] devised a new hybrid approach dubbed as ant colony optimization resampling (ACOR) to tackle class imbalance issue. ACOR consists of two stages: at first, a particular oversampling method was employed to rebalance an imbalanced dataset; in next stage, it applied ant colony optimization to detect an (sub) optimal subset from the balanced dataset. The benefit of using this approach was that a perfect training set could be achieved by the optimization technique and prevailing oversampling methods could be fully applied. The evaluation metrics confirmed that enhanced performance was recorded by ACOR and yielded a better outcome than four popular oversampling methods. The analysis of medical data from electronic health records (EHRs) poses a great challenge due to its imbalanced and heterogeneous characteristics. Huda et al. [30] reviewed the challenges by utilizing brain tumor images. They integrated ensemble-based classification and feature selection methods to demonstrate an affordable and fast detection of the genetic variant of a brain tumor. To mitigate the effect of imbalanced characteristics of medical data, they hybridized ensemble classification with feature selection. Febriantono et al. [31] applied decision tree C5.0 of costsensitive type to work out imbalanced data issue of multiclass nature. At the first step, C5.0 algorithm was utilized by the decision tree model. Afterwards, the minimum cost model was obtained by using the cost-sensitive learning. The results performed on testing dataset asserted that C5.0 demonstrated better performance than its counterparts ID3 and C4.5 algorithms. Babu et al. [32] proposed a genetic algorithm (GA)-based error classification for imbalanced dataset. For error identification and dataset processing, principle component analysis (PCA) was utilized. The errors presented in a dataset exhibited in a binary form by their approach. Error location identification was achieved through GA. The GA-based approach had successfully recognized the error location and enhanced the processing time of the imbalanced dataset. The classical extreme learning machine (ELM) algorithm is unable to generate better performance in case of the imbalanced dataset. Ri et al. [33] defined a novel cost function based on G-mean for ELM optimization problem in imbalanced data learning. They tested their methodology on 11 multi-class and 58 binary repositories having diverse gradation of imbalance ratio. Their approach outperformed the classical ELM and yielded competitive performance in comparison to prevailing methods. Susan et al. [34] applied a new hybrid technique of learning from imbalanced datasets by undersampling and oversampling of the majority and the minority cohorts' samples, respectively. They utilized different and intelligent versions of oversampling methods. The decision tree method was fed to the datasets after balancing it. Empirical experiments proved the efficiency of their technique as higher accuracies were achieved compared to the baseline techniques. El-Shafeiy et al. [35] carried out a study of class imbalance in the domain of medicine. They applied random forests (RF) for oversampling and undersampling strategies by integrating decision trees to subgroups of the dataset. Their RF-based techniques yielded enhancement in the area of imbalance medical dataset. Yang et al. [36] devised an integrated scheme by combining weight functions and weight constant of cost-sensitive learning techniques into the regularized risk minimization approach. Their results showcased that their methods could mitigate the misclassification cost efficiently while taken care of the privacy requirement. Their empirical evidence revealed that the selection of weight functions and weight constant did not have an impact on the Fisher-consistent property but the performance of the classifiers were influenced highly by interacting with privacy-preserving levels. Abnormal state detection and feature extraction are the key issues in class imbalance thermal signals. Wang et al. [37] designed an improved framework that incorporated hidden information and prior knowledge in the class imbalance condition for sintering state recognition. They fused hidden information and prior knowledge to devise a cascaded stack Autoencoder model for distinguished feature extraction of imbalance records. They also presented a data-dependent kernel modification optimal margin distribution machine (ddKMODM) as a sintering state recognition model. SA is a simple and well-known metaheuristic technique utilized in global optimization issues, whose objective function can be examined via computer simulation [36] . Real-world issues are tackled by it. Annealing in the early 1980s was proposed by Kirkpatrick et al. [37] in the optimization of combinatorial nature. The process involves increasing the temperature of a solid and then make the state of energy lower. The two stages are depicted as follows: • Bring the solid to a very elevated temperature until "melting" of the structure; • Cool the solid consistent with a very specific temperature declining scheme to attain a solid-state of minimum energy. Arbitrary allocation of particles is done at the liquid stage. With long cooling time and high temperature at the initial state facilitates the least energy phase. A metastable position with the energy of non-minimal can be achieved by By comparison, the perturbation method of the Metropolis technique is agreed upon by the principle of creating a neighbour and Metropolis condition is governed by the principle of acceptance. The alternation of the existing solution by a neighboring solution is dubbed as a transition. The transition is executed in acceptance and generation phases. Let g y be the value of the temperature parameter and X y be the total transitions produced with some iteration y in the sequel. The norm of SA is denoted as below: the solid due to non-occurrence of the case. Abrupt cooling of the solid is termed as hardening. In the state space S, a categorization of the solution is produced by the Metropolis algorithm which is utilized in SA method. An equivalence is set between a multiple particle system and the optimization issue to implement it as follows: • The potential solid states are described by the solutions • The solid's energy is denoted by the minimized function After that initialization of control parameter occurs. The objective and the parameter conveyed with units of the same type. A neighborhood, a solution generated by the system in the neighborhood and points in the state space are assumed to be provided by the user. The principle of acceptance is described as below: Definition 1 Let two points of state space be p,q and (R,fn) be an instance of combinatorial minimization issue. The probability of accepting solution q from the present solution p is described by the condition of acceptance as below: The capacity to recognize changeovers that demean the objective function is one of the prime characteristics of SA. Public datasets that are employed in our experiments are listed in Table 1 . UCI and KEEL are the repositories from where these datasets belong. The class imbalance degree is defined as follows: where positive and negative records are marked as P and N, respectively [16] . One can differentiate these datasets with respect to imbalance degree, number of features and instances. The model was chosen in the training phase with tenfold cross-validation. The novel hybrid technique is described as below: Our undersampling method is using SA with each classifier by enhancing the F-score (Eq. 7) to derive the best (2) Imbalance degree (imb) = N P possible subset of the majority class records from the training set. The steps are as follows: 1. Split the data into training (50%), validation (25%), and testing (25%). 2. Send the training and validating to the SA algorithm that uses the F-score of the classifier as an objective function. 3. The population used is vectors of zeros and ones; each vector is the same size as the majority examples in the set meant for training purpose. One denotes the corresponding example stay in the training set, zero implies removing the corresponding example from the set. 4. Each classifier is trained and tested using the validation set and SA uses F-score as its objective function. 5. Train each classifier using the undersampled training subset resulted from the SA, and test the model using the testing subset. The evaluation metrics used are accuracy (Eq. 3), G-mean (Eq. 8), AUC, and F-score. The optimal subset of majority instances selection problem is defined as follows: The optimal subset of majority instances selection problem. Given a set of majority instances G = {G 1 , G 2 , G 3 ,…,G m } and a cost function C:G-> s (0 ≤ s), find the subset such that the value of the cost function is minimized. An initial solution is needed by most of the optimization problem including SA [38] . A feasible solution is randomly selected and marked as an initial solution. The one-bit difference in the binary vectors from the projected solution is exploited as the neighboring solutions. The cost function is one of the significant factors for examining individual solutions and hence critical in heuristic optimization method like SA. The basic concept used in this paper for the cost function is to exploit the F-score of classification by applying majority instances represented by the given solution. While exploring the solution space by evading the local optima to locate the best possible solutions, a cooling strategy is adopted by SA. The cooling strategy indicates a scheme for how to search. Initial temperature, termination condition and temperature declining functions are applied as parameters. Adequate transitions can be achieved by allocating the starting temperature enormous. The product of temperature and a constant x is used as temperature declining function. If the temperature is less than a particular value 0.0001, then the method terminates. That particular value is obtained by executing many trails. At first, randomly an initial solution is chosen and it is considered as the optimal solution. The cost function is utilized to compute the cost of the initial solution. As long as temperature Temp does not satisfy the terminating criteria, a neighbouring solution of the current optimal solution is chosen and its cost is also determined. If the current optimal solution's cost is equal to or less than the newly chosen neighboring solution, the newly chosen optimal solution replaces the current optimal solution. If the cost of the neighboring solution is higher than the current optimal solution, a random value s is chosen in the range of (0,1). Following 6, temperature T is reduced and the whole strategy continues until Temp satisfies the terminating condition. Let us say TP, TN denotes true positives, true negatives, while FP, FN depicts False positives and False negatives, respectively. N TP denotes the number of true positives and so on. Accuracy, precision, TNR, F-measure, G-mean are calculated as follows: TPR × TNR AUC: In signal detection theory related to radio signals, the Receiver Operating Characteristics (ROC) curves were initially devised. For model evaluation strategies, ROC has been used recently by data mining and machine learning communities. For a binary classification problem, the ROC curve plots the true positive rate as a function of the falsepositive rate. The AUC is denoted as the area under the ROC curve and is closely associated with the ranking quality of the classification. To assess the performance of the proposed method, the experiments were conducted in MATLAB R2020a platform on a laptop equipped with 2.20GHZ core i7 processor and 6 GB RAM. Our experiments were performed on 51 realworld datasets. First, original dataset is divided into three subsets-50%, 25% and 25%-for training, validation and test, respectively; each subset has an equivalent percentage of majority class and minority class as the other two sets. The training and validation sets are fed to the undersampling phase where the training data is divided into two-majority and minority-class groups. The best examples are selected from the majority class examples based on the F-score fitness function. To emphasize the efficiency of our method, we include the best results obtained in [16] , as a baseline for comparison. In [16] , 24 out of the 51 datasets were used to assess different classification methods. In [16] , four undersampling techniques based on neighbourhood searching (NB-based) by utilizing the k-NN rule to select and remove majority class examples from the potential region of overlapping. RF (Random Forest) and SVM (Support Vector Machine) were used in [16] for learning and their results compared with several pre-processing state-of-the-art techniques to rebalance datasets before applying the learning algorithm, like the SMOTE (Synthetic Minority Over-Sampling Technique) [40] , kmUnder (k -means undersampling) [29] , OBU [41] , BLSMOTE [42] and ENN [43] . Columns "NB-SVM" and "NB-RF" in Tables 2 and 3 present the best G-mean and F-score values, respectively, selected for each classifier (SVM and RF) among the four (NB-based) methods, while posterior columns present the best value selected from the classifiers (SVM and RF) after applying the state-of-theart pre-processing techniques.Bold values in Tables 2 and 3 represents the best value in that row while Italic represnts the second and third best values in the same row. As can be noticed from Table 2 , our SA-based classifiers have achieved an overall superior performance in G-mean over other methods. While the best pre-processing technique achieved the best G-mean for only 4 datasets out of the 24 datasets used; our method achieved the best G-mean for 14 datasets like the number achieved by methods proposed in [16] . Our method also shows better performance with the dataset (Glass5) that recorded zero G-mean with most preprocessing methods. These enhancements in G-mean indicate that our proposed method has achieved a better balance in the classification accuracy between both (majority and minority) classes and it has not affected by the class distribution that affected on other state-of-the-art methods. Table 3 presents another comparison with [16] based on the F-score measure. F-score provides a good measure to assess the trade-offs between the accuracy of positive class and the negative class' errors. This measure is very useful to appreciate classifiers performance, especially when used to give more insights on performance if G-mean metric is competitive for two different methods; as in our case where the number of best G-mean values is equivalent to [16] . It can be noticed from Table 3 that our proposed method ranks top in F-score; where it provides a significantly higher number of best F-score values than [16] and in all pre-processing methods. This superiority in F-score proves that our method has enhanced the trade-offs between specificity and sensitivity, which means the reduction obtained by our method for false positives and negatives, over state-of-theart methods. Accuracy is also an important measure for any classifier performance so, we cannot ignore it. As accuracy has not been used in [16] , we have compared the classification accuracy of our proposed method with another recent work [44] . Researchers in this work introduced a hybrid approach to handle the problem of imbalanced data using oversampling and the instance selection undersampling algorithms. They relied on clustering to select instances from majority class using an agent-based population learning algorithm. Their experiment was based mainly on proving that their proposal performed better than the methods of traditional learning where machine learning techniques were applied for learning on original imbalanced data so, they compared the performance accuracy of their proposed technique-Agentbased Over and Undersampling for the Imbalanced Data (AOUSID)-with the classification accuracy obtained by another 7 techniques. Three of these techniques introduced by other researchers for undersampling, while the rest of the 7 techniques were traditional machine learning algorithms. Table 4 presents the results obtained by our proposal and the best results presented in [44] in a comparison based on the classification accuracy. Only 3 from the 7 techniques that have been used in [44] besides their proposal (AOUSID) are listed in Table 4 because only these 4 techniques have gained best results with the imbalanced data used. These three techniques are: • AISAID-an algorithm introduced by [45] for solving the imbalance problem by applying the instance selection procedure to resample the majority class. • Traditional ML algorithms: C4.5 algorithm [44] , and K-nearest neighbor (k-NN) [46] . The experiment conducted by [44] used 11 datasets, from the dataset repository of KEEL [47] . We used the binary class datasets-9 datasets-in our comparison. Data descriptions of these 9 datasets are also listed in Table 1 . It can be noticed from the results shown in Table 4 , that in comparison to other algorithms, our proposed method asserts competitive results. The SA algorithm with the 4 classifiers performs best with almost all the imbalanced datasets compared to the best results obtained by other undersampling techniques and the traditional machine learning algorithms. For more validation for our proposal, another comparison has been conducted; where 28 datasets with different overlapping degrees, some features with outliers, noisy samples, and multiclass have been used in this last comparison (more details about datasets characteristics are available in [48] ). These datasets used in [48] to compare between its authors' proposal named RCSMOTE and state-of-the-art over-sampling SMOTE-based techniques. RCSMOTE (Range-Controlled SMOTE) is an improved SMOTE method based on over-sampling the borderline samples (considering a safe range) after identifying them from noisy ones in minority class samples [49] . Table 5 shows the comparison between our proposed method and RCSMOTE method based on AUC and G-mean values.Results in Table 5 demonstrate that the proposed method outperforms RCSMOTE for both AUC and G-mean in 17 datasets out of 28 compared to the superiority of RCSMOTE in just 5 datasets. Wilcoxon's signed ranks test [51] is also applied as a statistical test to compare the performance of the proposed method with all other pre-processing techniques involved in the performed comparisons in our experiments. The p-values associated with these comparisons has been obtained to indicate the degree of difference between the methods. The difference considered to be significant if the p-value is lower than 0.05. Table 6 shows the results of Wilcoxon's test for the three performed comparisons. The results present the decrement in the p-values, which indicates a great significance in the differences between the proposed method and almost all other pre-processing methods. Particularly, in the G-mean values that clearly shows an improvement in the performance obtained with the proposed method. Since G-mean is the square root of the product of class-wise sensitivity (sensitivity for positive examples and specificity for negative examples) and this measure tries to maximize the accuracy of both classes in balance. All the preceding results indicate the efficiency of the proposed method in improving the classification performance for the used datasets. Using F-score measure as an objective function in SA technique, helped in improving the classification accuracy for both minority and majority classes; since the F-score provides a way to combine both recall and precision into a single score that achieves both properties and provides a way to express them with a single measure. Also, using SA optimization itself helps avoid falling into a local optimum solution trap and converges to the global optimum solution. Finally, applying SA on different classifiers allows the proposal to deal with the diverse and variation in datasets. The real-world imbalance datasets exhibited erroneous classification results and showed a bias toward majority class. To tackle the imbalance issue, various techniques were devised by the researchers. We introduce a modified hybrid strategy to take care of this problem. We use simulated annealing to pick the best possible subset of major class records (rows of data). Afterwards, KNN, DA, SVM and DT classifiers are utilized to assess the efficiency of our technique. We evaluate our empirical results with the two recent works [16, 44] . We explore 51 real datasets from different data repositories for the experiments. In [16] , 24 datasets were used. Out of 24 datasets, our method outperforms method proposed in [16] in 14 datasets and yields comparable performance with the rest of the datasets. The evaluation metrics used are G-mean and F-score for this comparison. Accuracy was not considered in [16] . To evaluate our findings using accuracy, the comparison is done with [44] . In both cases, our approach proves its efficacy and hence can be applied in real-world settings where the dataset is an imbalance. The proposed technique is further validated with the presented method [49] in terms of AUC and G-mean. Our technique showcased superiority 17 datasets whereas RCSMOTE method yielded better results only in 5 datasets out of 28 datasets. We also perform Wilcoxon's signed ranked test and results demonstrate that there is a great difference between the proposed method and the other pre-processing techniques. NE-nu-SVC: a new nested ensemble clinical decision support system for effective diagnosis of coronary artery disease Performance improvement of decision trees for diagnosis of coronary artery disease using multi filtering approach Hybrid particle swarm optimization for rule discovery in the diagnosis of coronary artery disease Association between work-related features and coronary artery disease: a heterogeneous hybrid feature selection integrated with balancing approach A mixed solution-based high agreement filtering method for class noise detection in binary classification Face recognition with triangular fuzzy set-based local cross patterns in wavelet domain Energy choices in Alaska: mining people's perception and attitudes from geotagged tweets A novel method for sentiment classification of drug reviews using fusion of deep and machine learning techniques. Knowl.-Based Syst Novel methodology for cardiac arrhythmias classification based on long-duration ECG signal fragments analysis Automated detection of presymptomatic conditions in Spinocerebellar Ataxia type 2 using Monte Carlo dropout and deep neural network techniques with electrooculogram signals Mohammed, I.A.: Performance evaluation of various data mining algorithms on road traffic accident dataset Mining social media and DBpedia data using gephi and R A boosting-aided adaptive cluster-based undersampling approach for treatment of class imbalance problem Machine learning with oversampling and undersampling techniques: overview study and experimental results Dealing with class imbalance in classifier chains via random undersampling. Knowl.-Based Syst Neighbourhood-based undersampling approach for handling imbalanced and overlapped data Improved overlap-based undersampling for imbalanced dataset classification with application to Epilepsy and Parkinson's disease An ensemble learning-based undersampling technique for handling class-imbalance problem Improved design of quadratic discriminant analysis classifier in unbalanced settings A new sampling method for classifying imbalanced data based on support vector machine ensemble Class based weighted k-nearest neighbor over imbalance dataset A robust decision tree algorithm for imbalanced data sets Efficient sampling-based energy function evaluation for ensemble optimization using simulated annealing Improving monarch butterfly optimization through simulated annealing strategy Multilayer perceptron optimization through simulated annealing and fast simulated annealing Under-sampling class imbalanced datasets by combining clustering analysis and instance selection Class weights random forest algorithm for processing class imbalanced medical data Adaptive multi-objective swarm crossover optimization for imbalanced data classification Aco resampling: enhancing the performance of oversampling methods for class imbalance classification A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis Classification of multiclass imbalanced data using cost-sensitive decision tree C50 Genetic algorithm-based PCA classification for imbalanced dataset G-mean based extreme learning machine for imbalance learning Hybrid of intelligent minority oversampling and PSO-based intelligent majority undersampling for learning from imbalanced datasets Medical imbalanced data classification based on random forests Privacy-preserving cost-sensitive learning A sintering state recognition framework to integrate prior knowledge and hidden information considering class imbalance Simulated annealing: from basics to applications Optimization by simulated annealing A feature selection approach based on simulated annealing for detecting various denial of service attacks SMOTE: synthetic minority over-sampling technique Clustering-based undersampling in class-imbalanced data Overlap-based undersampling for improving imbalanced data classification An approach to imbalanced data classification based on instance selection and over-sampling Asymptotic properties of nearest neighbor rules using edited data Cluster-based instance selection for the imbalanced data classification C4. 5: Programs for Machine Learning Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning Statistical comparisons of classifiers over multiple data sets