doi:10.1016/j.eswa.2006.04.001


www.elsevier.com/locate/eswa

Expert Systems with Applications 33 (2007) 1–5

Expert Systems
with Applications
A novel feature selection algorithm for text categorization

Wenqian Shang a,*, Houkuan Huang a, Haibin Zhu b,
Yongmin Lin a, Youli Qu a, Zhihai Wang a

a
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, PR China

b
Department of Computer Science, Nipissing University, North Bay, Ont., Canada P1B 8L7
Abstract

With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner
data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive
data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many meth-
ods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text
feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality
of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments
show that our improvements of Gini index behave better than other methods of feature selection.
� 2006 Elsevier Ltd. All rights reserved.

Keywords: Text feature selection; Text categorization; Gini index; kNN classifier; Text preprocessing
1. Introduction

With the advance of WWW (world wide web), text cat-
egorization becomes a key technology to deal with and
organize large numbers of documents. More and more
methods based on statistical theory and machine learning
has been applied to text categorization in recent years.
For example, k-nearest neighbor (kNN) (Cover & Hart,
1967; Yang, 1997; Yang & Lin, 1999; Tan, 2005), Naive
Bayes (Lewis, 1998), decision tree (Lewis & Ringuette,
1994), support vector machines (SVM) (Joachims, 1998),
linear least squares fit, neural network, SWAP-1, and Roc-
chio are all such kinds of methods.

A major problem of text categorization is the high
dimensionality of the feature space. For many learning
algorithms, such high dimensionality is not permitted.
Moreover most of these dimensions are not relative to text
categorization; even some noise data hurt the precision of
0957-4174/$ - see front matter � 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2006.04.001

* Corresponding author.
E-mail addresses: shangwenqian@hotmail.com (W. Shang), haibinz@

npissingu.ca (H. Zhu).
the classifier. Hence, we need to select some representative
features from the original feature space (i.e., feature selec-
tion) to reduce the dimensionality of feature space and
improve the efficiency and precision of classifier. At present
the feature selection method is based on statistical theory
and machine learning. Some well-known methods are
information gain, expected cross entropy, the weight of evi-
dence of text, odds ratio, term frequency, mutual informa-
tion, CHI (Yang & Pedersen, 1997; Mladenic & Grobelnik,
2003; Mladenic & Grobelnik, 1999) and so on.

In this paper, we do not discuss these methods in detail.
We present another new text feature selection method—
Gini index. Gini index was early used in decision tree for
splitting attributes and got better categorization precision.
However, it is rarely used for feature selection in text cate-
gorization. Shankar and Karypis discuss how to use Gini
index for text feature selection and weight-adjustment.
They mainly pay attention on weight-adjustment. Their
method only limits to centroid based classifier and their
iterative method is time-consuming. Our method is very
different from theirs. Through deeply analyzing the princi-
ples of Gini index and text feature, we construct a new

mailto:shangwenqian@hotmail.com
mailto:haibinz@npissingu.ca
mailto:haibinz@npissingu.ca


2 W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5
measure function of Gini index and use it to select features
in the original feature space. It not only fit centroid classi-
fiers but also fit other classifiers. The experiments show that
its quality is comparable with other text feature selection
methods. However, its complexity of computing is lower
and its speed is higher.

The rest of this paper is organized as follows. Section 2
describes the classical Gini index algorithm. Section 3 gives
the improved Gini index algorithm. Section 4 discusses the
classifiers using in the experiments to compare Gini index
with the other text feature selection methods. Section 5 pre-
sents the experiments’ results and their analysis. In the last
section, we give the conclusion.
2. Classical Gini index algorithm

Gini index is a non-purity split method. It fits sorting,
binary systems, continuous numerical values, etc. It was
put forward by Breiman, Friedman, and Olshen (1984)
and was widely used in decision tree algorithms of CART,
SLIQ, SPRINT and intelligent miner. The main idea of
Gini index algorithm is as follows:

Suppose S is the set of s samples. These samples have m
different classes (Ci,i = 1, . . ., m). According to the differ-
ences of classes, we can divide S into m subset
(Si,i = 1, . . ., m). Suppose Si is the sample set which belongs
to class Ci, si is the sample number of set Si, then the Gini
index of set S is:

GiniðSÞ¼ 1 �
Xm
i¼1

P 2i ; ð1Þ

where Pi is the probability that any sample belongs to Ci
and estimating with si/s. Gini(S)’s minimum is 0, that is,
all the members in the set belong to the same class; this de-
notes it can get the maximum useful information. When all
the samples in the set distribute equably for the class field,
Gini(S) is maximum; this denotes it can get the minimum
useful information. If the set is divided into n subset, then
the Gini after splitting is:

GinisplitðSÞ¼
Xn
j¼1

sj
s

GiniðSjÞ: ð2Þ

The minimum Ginisplit is selected for splitting attribute.
The main idea of Gini index is: for every attribute, after

it traverses all possible segmentation methods, if it can pro-
vide the minimum Gini index then it is selected as the divi-
sive criterion of this node no matter it is the root node or a
sub node.
3. The improved Gini index algorithm

To apply the Gini index theory described above directly
to the text feature selection, we can construct the new
formula:
GiniðW Þ¼ PðW Þ 1 �
X

i

PðCijW Þ
2

 !

þ PðW Þ 1 �
X

i

PðCijW Þ
2

 !
ð3Þ

After we analyze and compare the merits and demerits
of the existing text feature selection measure functions,
we improve formula (3) to:

Gini TextðW Þ¼
X

PðW jCiÞ
2PðCijW Þ

2
: ð4Þ

Why we amend formula (3) to formula (4)? The reasons
include three aspects as follows:

(1) The original form of Gini index is used to measure
the impurity of attributes towards categorization.
Smaller the impurity is, better the attribute is. If we
adopt the form GiniðSÞ¼

Pm
i¼1P

2
i , it is to measure

the purity of attributes towards categorization. Big-
ger the value of purity is, better the attribute is. In
this paper, we adopt the measure form of purity. This
form is more adapt to text feature selection. In paper
(Gupta, Somayajulu, Arora, & Vasudha, 1998; Shan-
kar & Karypis), they all adopt the measure form of
purity.

(2) In other authors’ papers, they all emphasize that text
feature selection inclines to high frequency words,
namely, including the P(W) factor in the formula.
Experiments show that some words that do not
appear have contributions to judge the class of text,
but this contribution is far less significant than the
effort to consider the words that do not appear, espe-
cially when the distribution of the class and feature
values is highly unbalanced. Yang and Pedersen
(1997) and Mladenic and Grobelnik (1999) compare
and analyze synthetically the merits and demerits of
many feature measure functions in their papers. Their
experiments show that the demerits of information
gain are to consider the word that does not appear.
The demerits of mutual information are not to con-
sider the affect of the P(W) factor leading to select
rare words. Expected cross entropy and weight of evi-
dence of text overcome these demerits, hence their
results are better. Therefore, when we construct the
new measure function of Gini index, we get ride of
the affection factor expressing words that do not
appear.

(3) Iff W1 appears in the documents of class C1 and W1
appears in every document of class C1; Iff W2 appears
in the documents of class C2 and W2 appears in every
documents of class C2, then W1 and W2 is the same
important feature. But due to P(Ci) 5 P(Cj), from
Gini TextðW Þ¼ PðW Þ

P
iPðCijW Þ

2
to compute out

Gini Text(W1) 5 Gini Text(W2), this is not consistent
with domain knowledge. So we adopt P(WjCi)2 to
replace P(W), for considering the unbalanced class


W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5 3
distribution. In formula (4), iff W appears in the doc-
uments of class Ci and W appears in every document
of class Ci, it can get the maximum Gini Text(W),
namely Gini Text(W) = 1. This is consistent with
domain knowledge. If there is no term P(WjCi)2,
according to the Bayes decision theory of minimum
error rate, P(CijW)2 is the posterior probability when
feature W appears. When the documents distribute
evenly where W appears, it gets the minimum Gini -
Text(W). But text feature is special, it only gets two
values: appearance in the documents or no appear-
ance in the documents. Moreover, according to field
knowledge, we omit the circumstance that a feature
does not appear in the documents. The class in the
training set is always unbalanced and it is opinion-
ated to decide Gini Text(W) is the minimum. Hence,
when we construct the new measure function of Gini
index, we consider feature W’s condition probability,
combining posterior probability and condition prob-
ability as the whole measure function to depress the
affection when the class is unbalanced.

4. Classifiers in the experiments

In order to evaluate the new feature selection algorithm,
we use three classifiers: SVM (support vector machine),
kNN and fkNN to show that our new Gini index algorithm
is effective in different classifiers. The algorithms of classifi-
ers can be described as follows.
4.1. kNN classifier

The kNN algorithm is to search k documents (called
neighbors) that have the maximal similarity (cosine similar-
ity) in training sets. According to what classes these neigh-
bors are affiliated with, it grades the test document’s
candidate classes. The similarity between the neighbor doc-
ument and the test document is taken as this class weight of
neighbor documents. The decision function can be defined
as follows:

ljðXÞ¼
Xk
i¼1

ljðX iÞsimðX ; X iÞ; ð5Þ

where lj(Xi) 2 {0, 1} shows whether Xi belongs to
xj(lj(Xi) = 1 is true) or not (lj(Xi) = 0 is false); sim(X, Xi)
denotes the similarity between training document and
test document. Then the decision rule is: If ljðXÞ¼
max

i
liðXÞ, then X 2 xj.
4.2. fkNN classifier

The kNN algorithm in 4.1 can not get better categoriza-
tion performance, especially when the class is unbalanced.
Hence, we adopt the fuzzy theory to improve the kNN
algorithm as follows. The reasons of this improvement
can consult (Shang, Huang, Zhu, & Lin, in press):

ljðXÞ¼

Pk
i¼1ljðX iÞsimðX ; X iÞ 1ð1 � simðX ; X iÞÞ

2=ðb�1ÞPk
i¼1

1

ð1 � simðX ; X iÞÞ
2=ðb�1Þ

; ð6Þ

where j = 1, 2, . . ., c, lj(Xi) is the membership of known
sample X to class j. If sample X belongs to class j then
the value is 1, otherwise 0. From this formula, we can see
that in reality the membership is using the different distance
of every neighbor to the candidate classifying sample to
weigh its effect. Parameter b is used to adjust the degree
of a distance weight. In this paper we take b’s value 2. Then
fuzzy k-nearest neighbors’ decision rule is: If lj(X) = max-
ili(X), then X 2 xj.

4.3. SVM classifier

SVM is put forward by Vapnik (1995). It is used to solve
the problem of two-class categorization. Here we adopt the
linear SVM, using the method of one-versus-rest to classify
the documents. The detailed description can be referred to
(Vapnik, 1995).

5. Experiments

5.1. Data collections

We use two corpora for this study: the Reuters-21578
and data set coming from the International Database Cen-
ter, Department of Computing and Information Technol-
ogy, Fudan University, China.

In Reuters-21578 data set, we adopt the top ten classes.
7053 documents in training set and 2726 documents in test
set. The distribution of the class is unbalance. The maxi-
mum class has 2875 documents, occupying 40.762% of
training set. The minimum class has 170 documents, occu-
pying 2.41% of training set.

In the second data set, we use 3148 documents as train-
ing samples and 3522 documents as test samples. The train-
ing samples are divided into document sets A and B. In
document set A, the class distribution is unbalance. In
these documents, the political documents are 619 pieces,
occupying 34.43% of the training document set A, the
energy sources documents are only 59 pieces, occupying
3.28% of the training document set A. In training sample
B, the class distribution is correspondingly balance. Every
class is 150 pieces.
5.2. Experimental settings

For every classifier, in the phase of text preprocess we
use information gain, expected cross entropy, the weight
of evidence of text and CHI to compare with our improved
Gini index algorithm. Every measure function can be
described as follows:


4 W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5
Information gain:

Inf GainðW Þ¼ PðW Þ
Xm

i

PðCijW Þlog2
PðCijW Þ

PðCiÞ

þ PðW Þ
Xm

i

PðCijW Þlog2
PðCijW Þ

PðCiÞ
ð7Þ

Expected cross entropy:

Cross EntropyðW Þ¼ PðW Þ
Xm

i

PðCijW Þlog2
PðCijW Þ

PðCiÞ
ð8Þ

CHI(v2):

v2ðW Þ¼
Xm

i

PðCiÞ

�
NðA1A4 � A2A3Þ

2

ðA1 þ A3ÞðA2 þ A4ÞðA1 þ A2ÞðA3 þ A4Þ
ð9Þ

Weight of evidence of text:

Weight of EvidðW Þ¼ PðW Þ

�
Xm
i¼1

PðCiÞ log
PðCijW Þð1 � PðCiÞÞ
PðCiÞð1 � PðCijW ÞÞ

����
����

ð10Þ
After selecting the feature subset using above measure

functions, we use TF–IDF to weight the feature, the for-
mula is as follows:

wik ¼
tf ik � logðN=niÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPM
j¼1½tf ik � logðN=niÞ�

2
q ð11Þ

In Reuters-21578, k = 45, in document set A k = 10, in
document set B k = 35.
Table 1
The performance of five feature selection measure functions on top 10 classes

Measure function SVM kNN

Macro-F1 Micro-F1 Macro

Gini index 69.940 88.591 66.584
Inf Gain 69.436 88.445 66.860
Cross Entroy 69.436 88.445 66.579
CHI 67.739 88.225 66.404
Weigh of Evid 68.731 88.481 66.766

Table 2
The performance of five feature selection measure functions on training set A

Measure function SVM kNN

Macro-F1 Micro-F1 Macro

Gini index 91.577 90.941 84.176
Inf Gain 91.531 90.708 83.318
Cross Entroy 91.481 90.708 83.318
CHI 91.640 91.057 84.491
Weigh of Evid 91.407 90.825 84.073
5.3. Performance measure

To evaluate the performance of a text classifier, we use
F1 measure put forward by Rijsbergen (1979). This mea-
sure combines recall and precision as follows:

Recall ¼
number of correct positive predictions

number of positive examples

Precision ¼
number of correct positive predictions

number of positive predictions

F 1 ¼
2 � Recall � Precision
ðRecall þ PrecisionÞ
5.4. The experimental results and analysis

The experimental result in Reuters-21578 can be
described as Table 1.

From this table, we can see that in SVM and fkNN, Gini
index gets the best categorization performance. We can
notice that five measure functions show better performance
all. In SVM, the micro-F1 difference between the best and
the worst is 0.366%, in kNN is 0.294%, in fkNN is 0.477%.
In kNN, the Macro-F1 of Gini index is only inferior to
information gain, the Micro-F1 of Gini index is only infe-
rior to CHI.

The experimental result in the second data set can be
described as Tables 2 and 3.

From Table 2, we can see that the categorization perfor-
mance in SVM, Gini index is only inferior to CHI and
exceed Information Gain, in kNN, the Macro-F1 of Gini
index is only inferior to CHI, but the Micro-F1 of Gini
index gets the best, in fkNN, Gini index is only inferior
to weight of evidence of text.
fkNN

-F1 Micro-F1 Macro-F1 Micro-F1

85.620 67.999 86.537
85.326 67.032 86.134
85.326 67.518 86.207
85.761 66.846 86.060
85.180 67.509 86.280

fkNN

-F1 Micro-F1 Macro-F1 Micro-F1

83.043 84.763 83.856
81.301 84.346 82.811
81.301 84.216 82.578
82.811 85.256 84.008
82.927 85.867 85.017


Table 3
The performance of five feature selection measure functions on training set B

Measure function SVM kNN fkNN

Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1

Gini index 91.421 91.222 86.272 85.222 87.006 86.556
Inf Gain 91.799 91.556 86.326 85.222 87.305 86.556
Cross Entroy 91.419 91.222 85.764 85.111 86.999 86.444
CHI 91.238 91.000 85.770 85.000 86.898 86.444
Weigh of Evid 91.799 91.556 85.914 85.111 87.138 86.444

W. Shang et al. / Expert Systems with Applications 33 (2007) 1–5 5
From Table 3, we can find that the categorization per-
formance in SVM, Gini index is only inferior to informa-
tion gain and weight of evidence of tex, in kNN, the
Macro-F1 of Gini index is only inferior to information
gain, but the Micro-F1 of Gini index gets the best, in
fkNN, the Macro-F1 of Gini index is only inferior to infor-
mation gain, but the Micro-F1 of Gini index gets the best.

In summary, in some data set, the categorization perfor-
mance of our improved Gini index gets the best. In another
data set, its performance is only inferior to other measure
function. As a whole, Gini index shows better categoriza-
tion performance. From formula (7)–(10), we can find that
the computation of Gini index is simpler than other feature
selection methods. Gini index has no logarithm computa-
tions and only has simple multiplication operations.

6. Conclusion

In this paper, we studied the text feature selection based
on Gini index. We compare its performance with the other
feature selection methods in text categorization. The exper-
iments show that our improved Gini index has a better per-
formance and simpler computation than the other feature
selection methods. It is a promising method for text feature
selection. In the future, we will improve this method further
and will study how to select different feature selection
methods at different data set.
Acknowledgement

This research is partly supported by Beijing Jiaotong
University Science Foundation under the Grant
2004RC008.
References

Breiman, L., Friedman, J. H., Olshen, R. A., et al. (1984). Classification
and regression trees. Montery, CA: Wadsworth International Group.
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern clas-
sification. IEEE Transaction on Information Theory, IT-13(1), 21–
27.

Gupta, S. K., Somayajulu, D. V. L. N., Arora, J. K., & Vasudha, B.
(1998). Scalable classifiers with dynamic pruning. In Proceedings of
the 9th international workshop on database and expert systems

applications (pp. 246–251). Washington, DC, USA: IEEE Computer
Society.

Joachims, T. (1998). Text categorization with support vector machines:
learning with many relevant features. In Proceedings of the 10th
European conference on machine learning (pp. 137–142). New York:
Springer.

Lewis, D. D. (1998). Naı̈ve (Bayes) at forty: the independence assumption
in information retrieval. In Proceedings of the 10th European confer-
ence on machine learning (pp. 4–15). New York: Springer.

Lewis, D.D., Ringuette, M., 1994. Comparison of two learning algorithms
for text categorization. In Proceedings of the third annual symposium on
document analysis and information retrieval. Las Vegas, NV, USA, pp.
81–93.

Mladenic, D., Grobelnik, M., 1999. Feature selection for unbalanced class
distribution and Naı̈ve Bayes. In Proceedings of 16th international
conference on machine learning, San Francisco 258–267.

Mladenic, D., & Grobelnik, M. (2003). Feature selection on hierarchy of
web documents. Decision Support Systems, 35(1), 45–87.

Rijsbergen, V. (1979). Information retrieval. London: Butterworth.
Shang, W., Huang, H., Zhu, H., & Lin, Y. (2005). An improved

kNN algorithm—Fuzzy kNN. In Proceedings of international confer-
ence on computational intelligence and security (pp. 741–746). China:
Xi’an.

Shankar, S., Karypis, G. A feature weight adjustment algorithm for
document categorization. Available from: http://www.cs.umm.edu/
~karypis.

Tan, S. (2005). Neighbor-weighted K-nearest Neighbor for Unbalanced
Text Corpus. Expert System with Applications, 28(4), 667–671.

Vapnik, V. (1995). The nature of statistical learning theory. Springer.
Yang, Y. (1997). An evaluation of statistical approaches to text catego-

rization. Information Retrieval, 1(1), 76–88.
Yang, Y., Pedersen, J.O., 1997. A Comparative Study on Feature

Selection in Text Categorization. In Proceedings of the 14th interna-
tional conference on machine learning, Nashville, USA, pp. 412–
420.

Yang, Y., & Lin, X. (1999). A re-examination of text categorization
methods. In Proceedings of the 22nd annual international ACM SIGIR
conference on research and development in the information retrieval

(pp. 42–49). New York: ACM Press.

http://www.cs.umm.edu/~karypis
http://www.cs.umm.edu/~karypis

	A novel feature selection algorithm for text categorization
	Introduction
	Classical Gini index algorithm
	The improved Gini index algorithm
	Classifiers in the experiments
	kNN classifier
	fkNN classifier
	SVM classifier

	Experiments
	Data collections
	Experimental settings
	Performance measure
	The experimental results and analysis

	Conclusion
	Acknowledgement
	References