2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 130 An Ensemble Learning Method for Text Classification Based on Heterogeneous Classifiers Fan Huimin School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 492896361@ qq.com Li Pengpeng School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: m18295715879@163.com Zhao Yingze School of School of Marxism Xi'an Jiaotong University Xi’an, China e-mail: yingze1013@163.com Li Danyang School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 821563942@qq.com Abstract—Ensemble learning can improve the accuracy of the classification algorithm and it has been widely used. Traditional ensemble learning methods include bagging, boosting and other methods, both of which are ensemble learning methods based on homogenous base classifiers, and obtain a diversity of base classifiers only through sample perturbation. However, heterogenous base classifiers tend to be more diverse, and multi-angle disturbances tend to obtain a variety of base classifiers. This paper presents a text classification ensemble learning method based on multi-angle perturbation heterogeneous base classifier, and validates the effectiveness of the algorithm through experiments. Keywords-Machine Learning; Ensemble Learning; Text Classification I. INTRODUCTION The main idea of ensemble learning is to generate multiple learners through certain rules and then adopt some integrated strategy to make the final decision[1]. In general, multiple learners in the so-called ensemble learning are all homogenous “weak learners”. Based on these weak learners, multiple learners are generated through sample set perturbation, and a strong learner is obtained after integration. With the deepening of integrated learning, its broad definition gradually accepted by scholars. It refers to a collection of multiple classifiers using learning methods, without distinction between the nature of the classifier。 However, the research of ensemble learning with homogenous classifiers is still the most common, and it is usually only perturbed by a single angle such as algorithm training set[2][3]. The random forest algorithm adds the perturbation of the classification attribute to the traditional bagging algorithm, and thus obtains a better classification effect[4]. This shows that the multi-angle perturbation can produce a larger difference base learner, and the ensemble learning model has higher classification accuracy. In addition, the research shows that the diversity of base learners based on the heterogeneous base classifier is stronger, so the classification model has stronger classification accuracy and generalization performance[5][6]. Therefore, this paper combines the above two factors and designs a text classification ensemble learning method based on multi-angle perturbation heterogeneous base classifier. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 131 II. ENSEMBLE LEARNING "Weak Learning Is Equivalent to Strong Learning" is a theoretical issue raised by Kearns and Valiant in 1989. The Boosting algorithm arises from the proof of this issue. Then the Boosting algorithm derived a number of variants, including Gradient Boosting, LPBoosting and so on. Because of the characteristics of boosting that training classifiers serially, the training process takes up more resources and has lower efficiency. Therefore, whether it is possible to use a few classifiers and obtain the same performance is a matter of concern to researchers. Zhou Zhihua and others on the "selective ensemble"[7][8] of boosting algorithm helped to overcome this problem. "Selective ensemble" only used the classifier with has good classification results to integrate the classifiers. This idea can finish the construction of ensembled model more efficiently without changing the original algorithm that training base classifiers. In recent years, a method of selective integration based on clustering, selection, optimization and other methods has also been developed. The theoretical basis of ensemble learning shows that strongly learner and weak learner are equivalent, so we can find ways to convert weaker learners into strongly learners, without having to look for hard-to-find Learner. Currently there is a representative ensemble learning method boosting, bagging. The traditional Bagging algorithm and Boosting algorithm as well as many derived algorithms of the two algorithms are ensemble learning based on homogenous base classifier. And diversity is only obtained through sample disturbances, while multi-angle disturbances and heterogeneous classifiers can improve model classification accuracy. This paper first trains and integrates homogenous base classifiers, compares and analyzes changes in the accuracy of base classifiers and integrated models, and then integrates k-nearest neighbor classifiers, Bayesian classifiers, and logistic regression classifiers in text classifiers. The integration model of the heterogeneous base classifier compares the diversity with the base classifier homogenous Bagging algorithm to measure the KW value and accuracy. III. ENSEMBLE LEARNING MODEL BASED ON HETEROGENEOUS BASE CLASSIFIER In order to obtain an integrated learning model with higher accuracy, more base classifiers with more diversity and good classification results should be obtained as much as possible. From the perspective of diversity, we can try to select a combination of many "attributes" from the variable factors in the classification process. Here, "attribute" refers to everything that causes the change of the algorithm classification result. From the general process of text classification analysis, feature selection, feature dimension, classifier selection and classifier parameters can be used as a basis for the diversity of the classifier. For each classification model, its algorithm parameters, feature selection algorithm, feature dimension are disturbed. In this paper, many kinds of classifiers are integrated, and an integrated learning model based on multi-angle perturbation heterogeneous basis classifiers is designed. Inputs in the process of model training are feature selection algorithm set S, feature dimension set N, classifier set C, adjustable parameter set A and parameter optional value set (dictionary) V. Training steps are as follows: Step 1: Pre-process the sample set. Step 2: Select an algorithm for each feature, make a feature selection for each feature dimension, and add the feature selection result to the feature selection result list L. Step 3: Perform Step 4 for each classifier. Step 4: Train and save to the classifier list C-output for each parameter of the classifier in combination with eachresult in the L list. The output of the model is the classifier list C-output. The testing process of the model is as follows: After the pre-processing and the vectorization of the sample to be tested, a series of classification models are used to predict the samples to obtain a plurality of classification results. The majority of voting integration strategies lead to the final classification result. The feature selection algorithm, feature dimension, and classifier all serve as a source for the diversity of the base classifiers. In this paper, feature selection algorithm can use chi-square statistics, information gain and mutual 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 132 information algorithm. Classifier perturbation can be trained by Bayesian classifier, k-nearest neighbor classifier and logistic regression classifier. Since the parameters of the classifier are also variables, they can also be used as disturbance variables. IV. EXPERIMENT ANALYSIS The experiment uses Sogou Labs' entire network news dataset, and randomly selects 600 news documents from five categories of financial, education, automotive, entertainment and women, and uses the body part and its category markers as the experimental text data Set (balanced data set). The experiment will use 80% data as the training set and the rest as test sets. A. The impact of changes in featuredimensions. Figure 1. Experiment on the variation of feature dimension between integrated model and single classifier model With the increase of feature dimensions, the accuracy of each model is on the rise. When the number of features is small, the accuracy of the integrated model is only lower than that of the information gain algorithm. When the number of features exceeds 300, the integrated model performs best. It can be seen that the classification effect of the integrated model is not always better than that of a single classifier. When the feature dimension is small, the accuracy of the integrated model is lower than that of the information gain algorithm model. In the experimental results obtained from experimental data in this paper, when the feature dimension exceeds 400 dimensions, the accuracy of the model tends to be stable, and the accuracy of ensemble learning model is always higher than that of a single classifier model. B. The effect of feature selection algorithm and classifiers TABLE I. EXPERIMENT OFTHE PERTURBATION OF FEATURE SELECTION ALGORITHM Type feature selection algorithm classifier accuracy/% Base classifier CHI KNN 79.6 IG 83.8 MI 88.8 Ensembleclassifier Above three kinds 89.6 Base classifier CHI Bayesian 77.8 IG 90.2 MI 82.4 Ensembleclassifier Above three kinds 86.2 Base classifier CHI Logistic regression 79.4 IG 91.4 MI 87.6 Ensembleclassifier Above three kinds 89.2 It can be seen from the experimental results that under the same conditions, the classification results of multiple classifiers combined with multiple feature selection algorithms are quite different. That is to say, the diversity between the base classifiers obtained by the perturbation feature selection algorithm is strong. Therefore, a variety of feature selection algorithms can be used as one of the sources of the base classifiers. As can be seen from Table 1, when the feature selection algorithm is chi-square statistics, information gain or mutual information algorithm, the classification accuracy of the single classifier is lower, and the accuracy of the integrated classifier classification is higher than that of any single classifier. The disturbance of classifier makes the algorithm vary greatly in accuracy, so the disturbance of classifier can also be used as one of the sources of the diversity of classifier. C. Effect of classifier parameters Due to the different settings of the base classifier parameters will lead to some differences between the training model, this paper designed experiments to further examine the accuracy of the basic learning model in the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 133 disturbance of classifier parameters. The experimental results are shown in Table 2-4. TABLE II. PARAMETER PERTURBATION EXPERIMENT OF K NEAREST NEIGHBOR CLASSIFIER Type K classifier Accuracy/% Base classifier 5 KNN 85.2 10 85.6 15 84.6 20 81.2 25 78.6 30 78 ensemble classifier - - 81.6 TABLE III. PERTURBATION EXPERIMENT OF BAYESIAN CLASSIFIER PARAMETER type Type of classifier classifier Accuracy/% Base classifier Polynomial Bayesian classifier 89.6 Gaussian 91 Bernoulli 84.8 ensemble classifier - - 93.2 TABLE IV. PERTURBATION EXPERIMENT OF LOGISTIC REGRESSION CLASSIFIER PARAMETER Type The way of classification Loss function optimization method classifier Accur acy/% Base classifier One to many liblinear Logistic regressio n 90.6 newton-cg 90.6 lbfgs 90.6 sag 90.8 multi-category newton-cg 90.8 lbfgs 90.8 sag 90.8 ensemble classifier - - - 90.6 From the data in Table 2-4 found: Compared with the above three groups of experiments, the K-nearest neighbor classifier has a strong diversity among the classifiers in the selection of "K value" and the Bayesian classifier perturbation of the "classifier type" parameters. Therefore, Base classifiers with higher classification accuracy are candidates. However, the logistic regression classifier is insensitive to the two parameters of "classification method" and "loss function optimization method". The accuracy of the base classifier is almost constant and the diversity is lower. In the multi-angle perturbation integrated model, only one of the classifiers can be selected. D. Multi-angle disturbance Through the above three groups of experiments, we have screened the selected parameters of the base classifier with strong diversity. From the experimental data obtained from the above three experiments, the KW diversity measure between homogeneity classifiers that make up each classifier can be calculated as shown in Table 5. TABLE V. BASE CLASSIFIER DIVERSITY MEASURE KW VALUE Ensemble learning model Disturb variable KW value KNN Feature Selection Algorithm 0.06 Bayesian classifier 0.05 Logistic regression classifier 0.04 CHI classifier 0.07 IG 0.03 MI 0.05 KNN K value 0.04 Bayesian classifier Classifier type 0.04 Logistic regression classifier Classification and optimization methods 0 Multiangle perturbation heterogeneous basis classifier Multi-angle disturbance 0.07 The range of KW values is [0,1]. When KW is 0 or 1, the base classifiers are the same, and there is no diversity among base classifiers. When KW is 0.25, the base classifier has the highest diversity. As can be seen from table 5, the integrated models with the most diversity of base classifiers in table 5 are all based on heterogeneous base classifiers. The KW value of this model is better than that based on the rest of the integrated learning models. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 134 Using the integrated method, the above feature selection algorithm, feature dimension, classifier and its parameters are taken as input to integrate all the base classifiers, and an integrated model based on multi-angle perturbation heterogeneous base classifiers is obtained. The multi-angle disturbance integrated learning model parameters are summarized in Table 6. TABLE VI. MODEL PARAMETERS variable Value / classifier Classifier property value Feature Selection Algorithm CHI、IG、MI - Characteristic dimension 400、450、500 - Classifier Bayesian classifier Type: Gaussian, Bernoulli, Polynomial Classifier KNN K=5、10、15 Classifier Logistic regression classifier Classification: one to many; optimization methods: sag The parameters shown in Table 6 are used as inputs to the model to train the integrated learning model designed in this paper. Compare this model with the Bagging text classification model with only sample perturbation. The experimental results are shown in Table 7. TABLE VII. THE COMPARISON BETWEEN THE MODEL AND BAGGING MODEL model type variable classifier KW value accuracy/% Bagging Sample disturbance KNN 0.10 83 Bagging Sample disturbance Bayesian 0.03 85.4 Bagging Sample disturbance Logistic regression 0.06 83 Heterogeneous classifier model Multi-angle disturbance - 0.07 92 The experimental results show that the Bagging algorithm based on K-nearest neighbor classifier has higher KW value, that is to say, the classifier has strong diversity but low accuracy. Bagging algorithm based on Bayesian classifier and logistic regression classifier has low KW value and accuracy, that is, the base classifier has less diversity and low accuracy. The integrated learning model based on multi-angle disturbance heterogeneous basis classifier designed in this paper has the highest classification accuracy and the strong diversity of base classifiers. V. CONCLUSION This paper analyzes the algorithmic process of Bagging and Boosting, and finds that both of them are integrated learning strategies based on homogeneity classifier. At present, the research on heterogeneous base classifier integrated learning is less. In this paper, we design a learning model of multi-angle perturbation heterogeneous basis classifier. Multi-angle perturbation of heterogeneous classifiers, and try to integrate them. The experimental results show that the integrated learning model based on multi-angle perturbation-based heterogeneous base classifiers proposed and designed in this paper has higher classification accuracy and rich base classifier diversity. This will provide an important basis for further research on heterogeneous classifier integration. REFERENCES [1] Lai J H. Ensemble Learning for Text Classification[J]. 2017. [2] Wang G, Sun J, Ma J, et al. Sentiment classification: The contribution of ensemble learning[J]. Decision support systems, 2014, 57: 77-93. [3] Xia R, Zong C, Li S. Ensemble of feature sets and classification algorithms for sentiment classification[J]. Information Sciences, 2011, 181(6): 1138-1152. [4] Jia J, Liu Z, Xiao X, et al. pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach[J]. Journal of theoretical biology, 2016, 394: 223-230. [5] Rodriguez J J, Kuncheva L I, Alonso C J. Rotation forest: A new classifier ensemble method[J]. IEEE transactions on pattern analysis and machine intelligence, 2006, 28(10): 1619-1630. [6] Wu Z, Lin W, Zhang Z, et al. An Ensemble Random Forest Algorithm for Insurance Big Data Analysis[C]//Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Conference on. IEEE, 2017, 1: 531-536. [7] Li N, Jiang Y, Zhou Z H. Multi-label Selective Ensemble[C]//International Workshop on Multiple Classifier Systems. Springer, Cham, 2015: 76-88. [8] Qian C, Yu Y, Zhou Z H. Pareto Ensemble Pruning[C]//AAAI. 2015: 2935-2941.