doi:10.1016/j.eswa.2006.04.001 www.elsevier.com/locate/eswa Expert Systems with Applications 33 (2007) 1–5 Expert Systems with Applications A novel feature selection algorithm for text categorization Wenqian Shang a,*, Houkuan Huang a, Haibin Zhu b, Yongmin Lin a, Youli Qu a, Zhihai Wang a a School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, PR China b Department of Computer Science, Nipissing University, North Bay, Ont., Canada P1B 8L7 Abstract With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many meth- ods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments show that our improvements of Gini index behave better than other methods of feature selection. � 2006 Elsevier Ltd. All rights reserved. Keywords: Text feature selection; Text categorization; Gini index; kNN classifier; Text preprocessing 1. Introduction With the advance of WWW (world wide web), text cat- egorization becomes a key technology to deal with and organize large numbers of documents. More and more methods based on statistical theory and machine learning has been applied to text categorization in recent years. For example, k-nearest neighbor (kNN) (Cover & Hart, 1967; Yang, 1997; Yang & Lin, 1999; Tan, 2005), Naive Bayes (Lewis, 1998), decision tree (Lewis & Ringuette, 1994), support vector machines (SVM) (Joachims, 1998), linear least squares fit, neural network, SWAP-1, and Roc- chio are all such kinds of methods. A major problem of text categorization is the high dimensionality of the feature space. For many learning algorithms, such high dimensionality is not permitted. Moreover most of these dimensions are not relative to text categorization; even some noise data hurt the precision of 0957-4174/$ - see front matter � 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.04.001 * Corresponding author. E-mail addresses: shangwenqian@hotmail.com (W. Shang), haibinz@ npissingu.ca (H. Zhu). the classifier. Hence, we need to select some representative features from the original feature space (i.e., feature selec- tion) to reduce the dimensionality of feature space and improve the efficiency and precision of classifier. At present the feature selection method is based on statistical theory and machine learning. Some well-known methods are information gain, expected cross entropy, the weight of evi- dence of text, odds ratio, term frequency, mutual informa- tion, CHI (Yang & Pedersen, 1997; Mladenic & Grobelnik, 2003; Mladenic & Grobelnik, 1999) and so on. In this paper, we do not discuss these methods in detail. We present another new text feature selection method— Gini index. Gini index was early used in decision tree for splitting attributes and got better categorization precision. However, it is rarely used for feature selection in text cate- gorization. Shankar and Karypis discuss how to use Gini index for text feature selection and weight-adjustment. They mainly pay attention on weight-adjustment. Their method only limits to centroid based classifier and their iterative method is time-consuming. Our method is very different from theirs. Through deeply analyzing the princi- ples of Gini index and text feature, we construct a new mailto:shangwenqian@hotmail.com mailto:haibinz@npissingu.ca mailto:haibinz@npissingu.ca 2 W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5 measure function of Gini index and use it to select features in the original feature space. It not only fit centroid classi- fiers but also fit other classifiers. The experiments show that its quality is comparable with other text feature selection methods. However, its complexity of computing is lower and its speed is higher. The rest of this paper is organized as follows. Section 2 describes the classical Gini index algorithm. Section 3 gives the improved Gini index algorithm. Section 4 discusses the classifiers using in the experiments to compare Gini index with the other text feature selection methods. Section 5 pre- sents the experiments’ results and their analysis. In the last section, we give the conclusion. 2. Classical Gini index algorithm Gini index is a non-purity split method. It fits sorting, binary systems, continuous numerical values, etc. It was put forward by Breiman, Friedman, and Olshen (1984) and was widely used in decision tree algorithms of CART, SLIQ, SPRINT and intelligent miner. The main idea of Gini index algorithm is as follows: Suppose S is the set of s samples. These samples have m different classes (Ci,i = 1, . . ., m). According to the differ- ences of classes, we can divide S into m subset (Si,i = 1, . . ., m). Suppose Si is the sample set which belongs to class Ci, si is the sample number of set Si, then the Gini index of set S is: GiniðSÞ¼ 1 � Xm i¼1 P 2i ; ð1Þ where Pi is the probability that any sample belongs to Ci and estimating with si/s. Gini(S)’s minimum is 0, that is, all the members in the set belong to the same class; this de- notes it can get the maximum useful information. When all the samples in the set distribute equably for the class field, Gini(S) is maximum; this denotes it can get the minimum useful information. If the set is divided into n subset, then the Gini after splitting is: GinisplitðSÞ¼ Xn j¼1 sj s GiniðSjÞ: ð2Þ The minimum Ginisplit is selected for splitting attribute. The main idea of Gini index is: for every attribute, after it traverses all possible segmentation methods, if it can pro- vide the minimum Gini index then it is selected as the divi- sive criterion of this node no matter it is the root node or a sub node. 3. The improved Gini index algorithm To apply the Gini index theory described above directly to the text feature selection, we can construct the new formula: GiniðW Þ¼ PðW Þ 1 � X i PðCijW Þ 2 ! þ PðW Þ 1 � X i PðCijW Þ 2 ! ð3Þ After we analyze and compare the merits and demerits of the existing text feature selection measure functions, we improve formula (3) to: Gini TextðW Þ¼ X PðW jCiÞ 2PðCijW Þ 2 : ð4Þ Why we amend formula (3) to formula (4)? The reasons include three aspects as follows: (1) The original form of Gini index is used to measure the impurity of attributes towards categorization. Smaller the impurity is, better the attribute is. If we adopt the form GiniðSÞ¼ Pm i¼1P 2 i , it is to measure the purity of attributes towards categorization. Big- ger the value of purity is, better the attribute is. In this paper, we adopt the measure form of purity. This form is more adapt to text feature selection. In paper (Gupta, Somayajulu, Arora, & Vasudha, 1998; Shan- kar & Karypis), they all adopt the measure form of purity. (2) In other authors’ papers, they all emphasize that text feature selection inclines to high frequency words, namely, including the P(W) factor in the formula. Experiments show that some words that do not appear have contributions to judge the class of text, but this contribution is far less significant than the effort to consider the words that do not appear, espe- cially when the distribution of the class and feature values is highly unbalanced. Yang and Pedersen (1997) and Mladenic and Grobelnik (1999) compare and analyze synthetically the merits and demerits of many feature measure functions in their papers. Their experiments show that the demerits of information gain are to consider the word that does not appear. The demerits of mutual information are not to con- sider the affect of the P(W) factor leading to select rare words. Expected cross entropy and weight of evi- dence of text overcome these demerits, hence their results are better. Therefore, when we construct the new measure function of Gini index, we get ride of the affection factor expressing words that do not appear. (3) Iff W1 appears in the documents of class C1 and W1 appears in every document of class C1; Iff W2 appears in the documents of class C2 and W2 appears in every documents of class C2, then W1 and W2 is the same important feature. But due to P(Ci) 5 P(Cj), from Gini TextðW Þ¼ PðW Þ P iPðCijW Þ 2 to compute out Gini Text(W1) 5 Gini Text(W2), this is not consistent with domain knowledge. So we adopt P(WjCi)2 to replace P(W), for considering the unbalanced class W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5 3 distribution. In formula (4), iff W appears in the doc- uments of class Ci and W appears in every document of class Ci, it can get the maximum Gini Text(W), namely Gini Text(W) = 1. This is consistent with domain knowledge. If there is no term P(WjCi)2, according to the Bayes decision theory of minimum error rate, P(CijW)2 is the posterior probability when feature W appears. When the documents distribute evenly where W appears, it gets the minimum Gini - Text(W). But text feature is special, it only gets two values: appearance in the documents or no appear- ance in the documents. Moreover, according to field knowledge, we omit the circumstance that a feature does not appear in the documents. The class in the training set is always unbalanced and it is opinion- ated to decide Gini Text(W) is the minimum. Hence, when we construct the new measure function of Gini index, we consider feature W’s condition probability, combining posterior probability and condition prob- ability as the whole measure function to depress the affection when the class is unbalanced. 4. Classifiers in the experiments In order to evaluate the new feature selection algorithm, we use three classifiers: SVM (support vector machine), kNN and fkNN to show that our new Gini index algorithm is effective in different classifiers. The algorithms of classifi- ers can be described as follows. 4.1. kNN classifier The kNN algorithm is to search k documents (called neighbors) that have the maximal similarity (cosine similar- ity) in training sets. According to what classes these neigh- bors are affiliated with, it grades the test document’s candidate classes. The similarity between the neighbor doc- ument and the test document is taken as this class weight of neighbor documents. The decision function can be defined as follows: ljðXÞ¼ Xk i¼1 ljðX iÞsimðX ; X iÞ; ð5Þ where lj(Xi) 2 {0, 1} shows whether Xi belongs to xj(lj(Xi) = 1 is true) or not (lj(Xi) = 0 is false); sim(X, Xi) denotes the similarity between training document and test document. Then the decision rule is: If ljðXÞ¼ max i liðXÞ, then X 2 xj. 4.2. fkNN classifier The kNN algorithm in 4.1 can not get better categoriza- tion performance, especially when the class is unbalanced. Hence, we adopt the fuzzy theory to improve the kNN algorithm as follows. The reasons of this improvement can consult (Shang, Huang, Zhu, & Lin, in press): ljðXÞ¼ Pk i¼1ljðX iÞsimðX ; X iÞ 1ð1 � simðX ; X iÞÞ 2=ðb�1ÞPk i¼1 1 ð1 � simðX ; X iÞÞ 2=ðb�1Þ ; ð6Þ where j = 1, 2, . . ., c, lj(Xi) is the membership of known sample X to class j. If sample X belongs to class j then the value is 1, otherwise 0. From this formula, we can see that in reality the membership is using the different distance of every neighbor to the candidate classifying sample to weigh its effect. Parameter b is used to adjust the degree of a distance weight. In this paper we take b’s value 2. Then fuzzy k-nearest neighbors’ decision rule is: If lj(X) = max- ili(X), then X 2 xj. 4.3. SVM classifier SVM is put forward by Vapnik (1995). It is used to solve the problem of two-class categorization. Here we adopt the linear SVM, using the method of one-versus-rest to classify the documents. The detailed description can be referred to (Vapnik, 1995). 5. Experiments 5.1. Data collections We use two corpora for this study: the Reuters-21578 and data set coming from the International Database Cen- ter, Department of Computing and Information Technol- ogy, Fudan University, China. In Reuters-21578 data set, we adopt the top ten classes. 7053 documents in training set and 2726 documents in test set. The distribution of the class is unbalance. The maxi- mum class has 2875 documents, occupying 40.762% of training set. The minimum class has 170 documents, occu- pying 2.41% of training set. In the second data set, we use 3148 documents as train- ing samples and 3522 documents as test samples. The train- ing samples are divided into document sets A and B. In document set A, the class distribution is unbalance. In these documents, the political documents are 619 pieces, occupying 34.43% of the training document set A, the energy sources documents are only 59 pieces, occupying 3.28% of the training document set A. In training sample B, the class distribution is correspondingly balance. Every class is 150 pieces. 5.2. Experimental settings For every classifier, in the phase of text preprocess we use information gain, expected cross entropy, the weight of evidence of text and CHI to compare with our improved Gini index algorithm. Every measure function can be described as follows: 4 W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5 Information gain: Inf GainðW Þ¼ PðW Þ Xm i PðCijW Þlog2 PðCijW Þ PðCiÞ þ PðW Þ Xm i PðCijW Þlog2 PðCijW Þ PðCiÞ ð7Þ Expected cross entropy: Cross EntropyðW Þ¼ PðW Þ Xm i PðCijW Þlog2 PðCijW Þ PðCiÞ ð8Þ CHI(v2): v2ðW Þ¼ Xm i PðCiÞ � NðA1A4 � A2A3Þ 2 ðA1 þ A3ÞðA2 þ A4ÞðA1 þ A2ÞðA3 þ A4Þ ð9Þ Weight of evidence of text: Weight of EvidðW Þ¼ PðW Þ � Xm i¼1 PðCiÞ log PðCijW Þð1 � PðCiÞÞ PðCiÞð1 � PðCijW ÞÞ ���� ���� ð10Þ After selecting the feature subset using above measure functions, we use TF–IDF to weight the feature, the for- mula is as follows: wik ¼ tf ik � logðN=niÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPM j¼1½tf ik � logðN=niÞ� 2 q ð11Þ In Reuters-21578, k = 45, in document set A k = 10, in document set B k = 35. Table 1 The performance of five feature selection measure functions on top 10 classes Measure function SVM kNN Macro-F1 Micro-F1 Macro Gini index 69.940 88.591 66.584 Inf Gain 69.436 88.445 66.860 Cross Entroy 69.436 88.445 66.579 CHI 67.739 88.225 66.404 Weigh of Evid 68.731 88.481 66.766 Table 2 The performance of five feature selection measure functions on training set A Measure function SVM kNN Macro-F1 Micro-F1 Macro Gini index 91.577 90.941 84.176 Inf Gain 91.531 90.708 83.318 Cross Entroy 91.481 90.708 83.318 CHI 91.640 91.057 84.491 Weigh of Evid 91.407 90.825 84.073 5.3. Performance measure To evaluate the performance of a text classifier, we use F1 measure put forward by Rijsbergen (1979). This mea- sure combines recall and precision as follows: Recall ¼ number of correct positive predictions number of positive examples Precision ¼ number of correct positive predictions number of positive predictions F 1 ¼ 2 � Recall � Precision ðRecall þ PrecisionÞ 5.4. The experimental results and analysis The experimental result in Reuters-21578 can be described as Table 1. From this table, we can see that in SVM and fkNN, Gini index gets the best categorization performance. We can notice that five measure functions show better performance all. In SVM, the micro-F1 difference between the best and the worst is 0.366%, in kNN is 0.294%, in fkNN is 0.477%. In kNN, the Macro-F1 of Gini index is only inferior to information gain, the Micro-F1 of Gini index is only infe- rior to CHI. The experimental result in the second data set can be described as Tables 2 and 3. From Table 2, we can see that the categorization perfor- mance in SVM, Gini index is only inferior to CHI and exceed Information Gain, in kNN, the Macro-F1 of Gini index is only inferior to CHI, but the Micro-F1 of Gini index gets the best, in fkNN, Gini index is only inferior to weight of evidence of text. fkNN -F1 Micro-F1 Macro-F1 Micro-F1 85.620 67.999 86.537 85.326 67.032 86.134 85.326 67.518 86.207 85.761 66.846 86.060 85.180 67.509 86.280 fkNN -F1 Micro-F1 Macro-F1 Micro-F1 83.043 84.763 83.856 81.301 84.346 82.811 81.301 84.216 82.578 82.811 85.256 84.008 82.927 85.867 85.017 Table 3 The performance of five feature selection measure functions on training set B Measure function SVM kNN fkNN Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Gini index 91.421 91.222 86.272 85.222 87.006 86.556 Inf Gain 91.799 91.556 86.326 85.222 87.305 86.556 Cross Entroy 91.419 91.222 85.764 85.111 86.999 86.444 CHI 91.238 91.000 85.770 85.000 86.898 86.444 Weigh of Evid 91.799 91.556 85.914 85.111 87.138 86.444 W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5 5 From Table 3, we can find that the categorization per- formance in SVM, Gini index is only inferior to informa- tion gain and weight of evidence of tex, in kNN, the Macro-F1 of Gini index is only inferior to information gain, but the Micro-F1 of Gini index gets the best, in fkNN, the Macro-F1 of Gini index is only inferior to infor- mation gain, but the Micro-F1 of Gini index gets the best. In summary, in some data set, the categorization perfor- mance of our improved Gini index gets the best. In another data set, its performance is only inferior to other measure function. As a whole, Gini index shows better categoriza- tion performance. From formula (7)–(10), we can find that the computation of Gini index is simpler than other feature selection methods. Gini index has no logarithm computa- tions and only has simple multiplication operations. 6. Conclusion In this paper, we studied the text feature selection based on Gini index. We compare its performance with the other feature selection methods in text categorization. The exper- iments show that our improved Gini index has a better per- formance and simpler computation than the other feature selection methods. It is a promising method for text feature selection. In the future, we will improve this method further and will study how to select different feature selection methods at different data set. Acknowledgement This research is partly supported by Beijing Jiaotong University Science Foundation under the Grant 2004RC008. References Breiman, L., Friedman, J. H., Olshen, R. A., et al. (1984). Classification and regression trees. Montery, CA: Wadsworth International Group. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern clas- sification. IEEE Transaction on Information Theory, IT-13(1), 21– 27. Gupta, S. K., Somayajulu, D. V. L. N., Arora, J. K., & Vasudha, B. (1998). Scalable classifiers with dynamic pruning. In Proceedings of the 9th international workshop on database and expert systems applications (pp. 246–251). Washington, DC, USA: IEEE Computer Society. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European conference on machine learning (pp. 137–142). New York: Springer. Lewis, D. D. (1998). Naı̈ve (Bayes) at forty: the independence assumption in information retrieval. In Proceedings of the 10th European confer- ence on machine learning (pp. 4–15). New York: Springer. Lewis, D.D., Ringuette, M., 1994. Comparison of two learning algorithms for text categorization. In Proceedings of the third annual symposium on document analysis and information retrieval. Las Vegas, NV, USA, pp. 81–93. Mladenic, D., Grobelnik, M., 1999. Feature selection for unbalanced class distribution and Naı̈ve Bayes. In Proceedings of 16th international conference on machine learning, San Francisco 258–267. Mladenic, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Decision Support Systems, 35(1), 45–87. Rijsbergen, V. (1979). Information retrieval. London: Butterworth. Shang, W., Huang, H., Zhu, H., & Lin, Y. (2005). An improved kNN algorithm—Fuzzy kNN. In Proceedings of international confer- ence on computational intelligence and security (pp. 741–746). China: Xi’an. Shankar, S., Karypis, G. A feature weight adjustment algorithm for document categorization. Available from: http://www.cs.umm.edu/ ~karypis. Tan, S. (2005). Neighbor-weighted K-nearest Neighbor for Unbalanced Text Corpus. Expert System with Applications, 28(4), 667–671. Vapnik, V. (1995). The nature of statistical learning theory. Springer. Yang, Y. (1997). An evaluation of statistical approaches to text catego- rization. Information Retrieval, 1(1), 76–88. Yang, Y., Pedersen, J.O., 1997. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th interna- tional conference on machine learning, Nashville, USA, pp. 412– 420. Yang, Y., & Lin, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in the information retrieval (pp. 42–49). New York: ACM Press. http://www.cs.umm.edu/~karypis http://www.cs.umm.edu/~karypis A novel feature selection algorithm for text categorization Introduction Classical Gini index algorithm The improved Gini index algorithm Classifiers in the experiments kNN classifier fkNN classifier SVM classifier Experiments Data collections Experimental settings Performance measure The experimental results and analysis Conclusion Acknowledgement References