doi:10.1016/j.eswa.2007.10.042


Available online at www.sciencedirect.com

ARTICLE IN PRESS
www.elsevier.com/locate/eswa

Expert Systems with Applications xxx (2007) xxx–xxx

Expert Systems
with Applications
Imbalanced text classification: A term weighting approach

Ying Liu a,*, Han Tong Loh b, Aixin Sun c

a
Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China

b
Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Singapore 117576, Singapore

c
School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore
Abstract

The natural distribution of textual data used in text classification is often imbalanced. Categories with fewer examples are under-rep-
resented and their classifiers often perform far below satisfactory. We tackle this problem using a simple probability based term weight-
ing scheme to better distinguish documents in minor categories. This new scheme directly utilizes two critical information ratios, i.e.
relevance indicators. Such relevance indicators are nicely supported by probability estimates which embody the category membership.
Our experimental study using both Support Vector Machines and Naı̈ve Bayes classifiers and extensive comparison with other classic
weighting schemes over two benchmarking data sets, including Reuters-21578, shows significant improvement for minor categories, while
the performance for major categories are not jeopardized. Our approach has suggested a simple and effective solution to boost the per-
formance of text classification over skewed data sets.
� 2007 Elsevier Ltd. All rights reserved.

Keywords: Text classification; Imbalanced data; Term weighting scheme
1. Introduction

1.1. Motivation

Learning from imbalanced data has emerged as a new
challenge to the machine learning (ML), data mining
(DM) and text mining (TM) communities. Two recent
workshops in 2000 (Japkowicz, 2000) and 2003 (Chawla,
Japkowicz, & Kolcz, 2003) at AAAI and ICML confer-
ences, respectively and a special issue in ACM SIGKDD
explorations (Chawla, Japkowicz, & Kolcz, 2004) were
dedicated to this topic. It has been witnessing growing
interest and attention among researchers and practitioners
seeking solutions in handling imbalanced data. An excel-
lent review of the state-of-the-art is given by Weiss (2004).

The data imbalance problem often occurs in classifica-
tion and clustering scenarios when a portion of the classes
possesses many more examples than others. As pointed out
0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.10.042

* Corresponding author. Tel.: +852 34003782.
E-mail address: mfyliu@polyu.edu.hk (Y. Liu).

Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
by Chawla et al. (2004) when standard classification algo-
rithms are applied to such skewed data, they tend to be
overwhelmed by the major categories and ignore the minor
ones. There are two main reasons why the uneven cases
happen. One is due to the intrinsic nature of such events,
e.g. credit fraud, cancer detection, network intrusion, and
earthquake prediction (Chawla et al., 2004). These are rare
events presented as a unique category but only occupy a
very small portion of the entire example space. The other
reason is due to the expense of collecting learning examples
and legal or privacy reasons. In our previous study of
building a manufacturing centered technical paper corpus
(Liu & Loh, 2007), due to the costly efforts demanded for
human labeling and diverse interests in the papers, we
ended up naturally with a skewed collection.

Automatic text classification (TC) has recently witnessed
a booming interest, due to the increased availability of doc-
uments in digital form and the ensuing need to organize
them (Sebastiani, 2002). In TC tasks, given that most test
collections are composed of documents belonging to multi-
ple classes, the performance is usually reported in terms of
micro-averaged and macro-averaged scores (Sebastiani,
sification: A term weighting approach, Expert Systems with Ap-

mailto:mfyliu@polyu.edu.hk


2 Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
2002; Yang & Liu, 1999). Macro-averaging gives equal
weights to the scores generated from each individual cate-
gory. In comparison, micro-averaging tends to be domi-
nated by the categories with more positive training
instances. Due to the fact that many of these test corpora
used in TC are either naturally skewed or artificially imbal-
anced especially in the binary and so called ‘‘one-against-
all” settings, classifiers often perform far less than satisfac-
torily for minor categories (Lewis, Yang, Rose, & Li, 2004;
Sebastiani, 2002; Yang & Liu, 1999). Therefore, micro-
averaging mostly yields much better results than macro-
averaging does.

1.2. Related work

There have been several strategies in handling imbal-
anced data sets in TC. Here, we only focus on the
approaches adopted in TC and group them based on their
primary intent. The first approach is based on sampling
strategy. Yang (1996) has tested two sampling methods,
i.e. proportion-enforced sampling and completeness-driven
sampling. Her empirical study using the ExpNet system
shows that a global sampling strategy which favors com-
mon categories over rare categories is critical for the suc-
cess of TC based on a statistical learning approach.
Without such a global control, the global optimal perfor-
mance will be compromised and the learning efficiency
can be substantially decreased. Nickerson, Japkowicz,
and Milios (2001) provide a guided sampling approach
based on a clustering algorithm called Principal Direction
Divisive Partitioning to deal with the between-class imbal-
ance problem. It has shown improvement over existing
methods of equalizing class imbalances, especially when
there is a large between-class imbalance together with
severe imbalance in the relative densities of the subcompo-
nents of each class. Liu’s recent efforts (Liu, 2004) in testing
different sampling strategies, i.e. under-sampling and over-
sampling, and several classification algorithms, i.e. Naı̈ve
Bayes, k-Nearest Neighbors (kNN) and Support Vector
Machines (SVMs), improve the understanding of interac-
tions among sampling method, classifier and performance
measurement.

The second major effort emphasizes cost sensitive learn-
ing (Dietterich, Margineantu, Provost, & Turney, 2000;
Elkan, 2001; Weiss & Provost, 2003). In many real scenar-
ios like risk management and medical diagnosis, making
wrong decisions are usually associated with very different
costs. A wrong prediction of the nonexistence of cancer,
i.e. false negative, may lead to death, while the wrong pre-
diction of cancer existence, i.e. false positive, only results in
unnecessary anxiety and extra medical tests. In view of this,
assigning different cost factors to false negatives and false
positives will lead to better performance with respect to
positive (rare) classes (Chawla et al., 2004). Brank, Grobel-
nik, Milic-Frayling, and Mladenic (2003) have reported
their work on cost sensitive learning using SVMs on TC.
They obtain better results with methods that directly mod-
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
ify the score threshold. They further propose a method
based on the conditional class distributions for SVM scores
that works well when only very few training examples are
available.

The recognition based approach, i.e. one-class learning,
has provided another class of solutions (Japkowicz, Myers,
& Gluck, 1995). One-class learning aims to create the deci-
sion model based on the examples of the target category
alone, which is different from the typical discriminative
approach, i.e. the two classes setting. Manevitz and Yousef
(2002) have applied one-class SVMs on TC. Raskutti and
Kowalczyk (2004) claim that one-class learning is particu-
larly helpful when data are extremely skewed and com-
posed of many irrelevant features and very high
dimensionality.

Feature selection is often considered an important step
in reducing the high dimensionality of the feature space
in TC and many other problems in image processing and
bioinformatics. However, its unique contribution in identi-
fying the most salient features to boost the performance of
minor categories has not been stressed until some recent
work (Mladenic & Grobelnik, 1999). Yang and Pedersen
(1997) has given a detailed evaluation of several feature
selection schemes. We noted the marked difference between
micro-averaged and macro-averaged values due to the poor
performances over rare categories. Forman (2003) has
done a very comprehensive study of various schemes for
TC on a wide range of commonly used test corpora. He
has recommended the best pair among different combina-
tions of selection schemes and evaluation measures. The
recent efforts from Zheng, Wu, and Srihari (2004) advance
the understanding of feature selection in TC. They show
the merits and great potential of explicitly combining posi-
tive and negative features in a nearly optimal fashion
according to the imbalanced data.

Some recent work simply adapting existing machine
learning techniques and not even directly targeting the
issue of class imbalance have shown great potential with
respect to the data imbalance problem. Castillo and Ser-
rano (2004) and Fan, Yu, and Wang (2004) have reported
the success using an ensemble approach, e.g. voting and
boosting, to handle skewed data distribution. Challenged
by real industry data with a huge number of records and
an extremely skewed data distribution, Fan’s work shows
that the ensemble approach is capable of improving the
performance on rare classes. In their approaches, a set of
weak classifiers using various learning algorithms are built
up over minor categories. The final decision is reached
based on the combination of outcomes from different clas-
sifiers. Another promising approach which receives less
attention falls into the category of semi-supervised learning
or weakly supervised learning (Blum & Mitchell, 1998;
Ghani, 2002; Goldman & Zhou, 2000; Lewis & Gale,
1994; Liu, Dai, Li, Lee, & Yu, 2003; Nigam, 2001; Yu,
Zhai, & Han, 2003; Zelikovitz & Hirsh, 2000). The basic
idea is to identify more positive examples from a large
amount of unknown data. These approaches are especially
sification: A term weighting approach, Expert Systems with Ap-


Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx 3

ARTICLE IN PRESS
viable when unlabeled data are steadily available. The last
effort attacking the imbalance problem uses parameter tun-
ing in kNNs (Baoli, Qin, & Shiwen, 2004). The authors
expect to set k dynamically according to the data distribu-
tion, in which a large k is granted given a minor category.

In this paper, we tackle the data imbalance problem in
text classification from a different angle. We present a
new approach assigning better weights to the features from
minor categories. After a brief review of the classic term
weighting scheme, e.g. tfidf, in Section 2 and inspired by
the analysis of various feature selection methods in Section
3, we introduce a simple probability based term weighting
scheme which directly utilizes two critical information
ratios, i.e. relevance indicators, in Section 4. These rele-
vance indicators are nicely supported by the probability
estimates which embody the category membership. The
setup of experimental study is explained in Section 5. We
carry out the evaluation and comparison of our new
scheme with many other different weighting forms over
two skewed data sets. We report the experimental findings
and discuss their performance in Section 6. Section 7 con-
cludes as well as highlights some future work.
2. Term weighting scheme

Text classification (TC) is such a task to categorize doc-
uments into predefined thematic categories. In particular, it
aims to find the mapping n, from a set of documents D:
{d1, . . . , di} to a set of thematic categories C: {C1, . . . , Cj},
i.e. n : D ? C. In its current practice, which is dominated
by supervised learning, the construction of a text classifier
is often conducted in two main phases (Debole & Sebas-
tiani, 2003; Sebastiani, 2002):

� Document indexing – the creation of numeric represen-
tations of documents:
– Term selection – to select a subset of terms from all

terms occurring in the collection to represent the doc-
uments in a better way, either to faster computing or
to achieve better effectiveness in classification.

– Term weighting – to assign a numeric value to each
term to weight its contribution which helps a docu-
ment stand out from others.
� Classifier induction – the building of a classifier by learn-
ing from the numeric representations of documents.

In information retrieval and machine learning, term
weighting has long been formulated in a form as term fre-
quency times inverse documents frequency, i.e. tfidf (Baeza-
Yates & Ribeiro-Neto, 1999; Salton & Buckley, 1988; Sal-
ton & McGill, 1983; van-Rijsbergen, 1979). The more pop-
ular ‘‘ltc” form (Baeza-Yates & Ribeiro-Neto, 1999; Salton
& Buckley, 1988; Salton & McGill, 1983) is given by

tfidfðti; d jÞ¼ tfðti; d jÞ� log
N

NðtiÞ

� �
ð1Þ
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
and its normalized version is

wi;j ¼
tfidfðti; d jÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPjT j
k¼1tfidfðtk; d jÞ

2
q ; ð2Þ

where N and jTj denote the total number of documents and
unique terms contained in the collection, respectively, and
N(ti) represents the number of documents in the collection
in which term ti occurs at least once, and

tfðti; d jÞ¼
1 þ logðnðti; d jÞÞ; if nðti; d jÞ > 0;
0; otherwise;

�
ð3Þ

where n(ti, dj) is the number of times that term ti occurs in
document dj. In practice, the summation in Eq. (2) is only
concerned about the terms occurred in document dj.

The significance of the classic term weighting schemes in
Eqs. (1) and (2) is that they have embodied three funda-
mental assumptions of term frequency distribution in a col-
lection of documents (Debole & Sebastiani, 2003;
Sebastiani, 2002). These assumptions are:

� Rare terms are no less important than frequent terms –
idf assumption.
� Multiple appearances of a term in a document are no

less important than single appearance – tf assumption.
� For the same quantity of term matching, long docu-

ments are no less important than short documents – nor-
malization assumption.

Because of these, the ‘‘ltc” and its normalized form have
been extensively studied by many researchers and show
their good performance over a number of different data sets
(Sebastiani, 2002). Therefore, they have become the default
choice in TC.

3. Inspiration from feature selection

Feature selection serves as a key procedure to reduce the
dimensionality of input data space to save computational
cost. It has been integrated as a default step for many
learning algorithms, like artificial neuron network, k-near-
est neighbors, decision tree, etc. In the research community
of machine learning, the computation constraints imposed
by the high dimension of the input data space and the rich-
ness of information available to maximally identify each
individual object is a well known tradeoff. The ability of
feature selection to capture the salient information by
selecting the most important attributes, and thus making
the computing tasks tractable has been shown in informa-
tion retrieval and machine learning research (Forman,
2003; Ng, Goh, & Low, 1997; Ruiz & Srinivasan, 2002;
Yang & Pedersen, 1997). Furthermore, feature selection
is also beneficial since it tends to reduce the over-fitting
problem, in which the trained objects are tuned to fit very
well the data upon which they have been built, but per-
forms poorly when applied to unseen data (Sebastiani,
2002).
sification: A term weighting approach, Expert Systems with Ap-


Table 1
Several feature selection methods, and their functions

Feature selection method Mathematical form

Information gain Pðtk; ciÞ log Pðtk;ciÞPðtkÞ�PðciÞþ Pð�tk; ciÞ log
Pð�tk;ciÞ

Pð�tkÞ�PðciÞ

Mutual information log
Pðtk;ciÞ

PðtkÞPðciÞ

Chi-square
N�½Pðtk;ciÞ�Pð�tk;�ciÞ�Pðtk;�ciÞ�Pð�tk;ciÞ�2

PðtkÞ�Pð�tkÞ�PðciÞ�Pð�ciÞ

Odds ratio log
PðtkjciÞ�ð1�Pðtkj�ciÞÞ
ð1�PðtkjciÞÞ�Pðtkj�ciÞ

tk denotes a term; ci stands for a category; P(tk, ci) denotes the probability
of documents from category ci where term tk occurs at least once; Pðtk;�ciÞ
denotes the probability of documents not from category ci where term tk
occurs at least once; Pð�tk; ciÞ denotes the probability of documents from
category ci where term tk does not occur; Pð�tk; �ciÞ denotes the probability
of documents not from category ci where term tk does not occur.

Table 3
Feature selection methods and their formations as represented by
information elements in Table 2

Method Mathematical form represented by information
elements

Information gain �AþCN log
AþC

N þ
A
N log

A
AþB

� �
þ CN log

C
CþD

� �
Mutual information log(AN/(A + B)(A + C))
Chi-square N(AD � BC)2/(A + C)(B + D)(A + B)(C + D)
Odds ratio log(AD/BC)

4 Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
In TC, several feature selection methods have been
intensively studied to distill the important terms while still
keeping the dimension small. Table 1 shows the main func-
tions of several popular feature selection methods. These
methods are evolved either from the information theory
or from the linear algebra literature (Sebastiani, 2002;
Yang & Pedersen, 1997).

Basically, there are two distinct ways to rank and assess
the features, i.e. globally and locally. Global feature selec-
tion aims to select features which are good across all cate-
gories. Local feature selection intends to differentiate those
terms that are more distinguishable for certain categories
only. The sense of either ‘global’ or ‘local’ does not have
much effect on the selection of method itself, but it does
affect the performance of classifiers built upon different cat-
egories. In TC, the main purpose is to address whether doc-
ument belongs to a specific category. Obviously, we prefer
the salient features which are unique from one category to
another, i.e. a ‘local’ approach. Ideally, the salient feature
set from one category does not have any items overlapping
with those from other categories. If this cannot be avoided,
then how to better present them has become an issue.

While many previous works have shown the relative
strengths and merits of these methods (Forman, 2003; Ng
et al., 1997; Ruiz & Srinivasan, 2002; Sebastiani, 2002;
Yang & Pedersen, 1997), our experience with feature selec-
tion over a number of standard or ad hoc data sets shows
the performance of such methods can be highly dependent
on the data. This is partly due to the lack of understanding
Table 2
Fundamental information elements used for feature selection in text
classification

ci �ci

tk A B
�tk C D

A denotes the number of documents belonging to category ci where the
term tk occurs at least once; B denotes the number of documents not
belonging to category ci where the term tk occurs at least once; C denotes
the number of documents belonging to category ci where the term tk does
not occur; D denotes the number of documents not belonging to category
ci where the term tk does not occur.

Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
of different data sets in a quantitative way, and it needs fur-
ther research. From our previous study of all feature selec-
tion methods and what has been reported in the literature
(Yang & Pedersen, 1997), we noted when these methods
are applied to text classification for term selection purpose,
they are basically utilizing four fundamental information
elements shown in Table 2.

These four information elements have been used to esti-
mate the probability listed in Table 1. Table 3 shows the
functions in Table 1 as presented by these four information
elements A, B, C and D.
4. A probability based term weighting scheme

4.1. Revisit of tfidf

As stated before, while many researchers believe that
term weighting schemes in the form as tfidf representing
those three aforementioned assumptions, we understand
tfidf in a much simpler manner, i.e.

� Local weight – the tf term, either normalized or not,
specifies the weight of tk within a specific document,
which is basically estimated based on the frequency or
relative frequency of tk within this document.
� Global weight – the idf term, either normalized or not,

defines the contribution of tk to a specific document in
a global sense.

If we temporarily ignore how tfidf is defined, and focus
on the core problem, i.e. whether this document is from
this category, we realize that a set of terms is needed to rep-
resent the documents effectively and a reference framework
is required to make the comparison possible. As previous
research shows that tf is very important (Leopold & Kin-
dermann, 2002; Salton & Buckley, 1988; Sebastiani, 2002)
and using tf alone can already achieve good performance,
we retain the tf term. Now, let us consider idf, i.e. the glo-
bal weighting of tk.

The conjecture is that if term selection can effectively dif-
ferentiate a set of terms tk out of all terms t to represent cat-
egory ci, then it is desirable to transform that difference
into some sort of numeric values for further processing.
Our approach is to replace the idf term with the value that
reflects the term’s strength of representing a specific cate-
gory. Since this procedure is performed jointly with the cat-
sification: A term weighting approach, Expert Systems with Ap-


Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx 5

ARTICLE IN PRESS
egory membership, this basically implies that the weights of
tk are category specific. Therefore, the only problem left is
how to compute such values.

4.2. Probability based term weights

We decide to compute those term values using the most
direct information, e.g. A, B and C, and combine them in a
sensible way which is different from existing feature selec-
tion measures. From Table 2, two important ratios which
directly indicate terms’ relevance with respect to a specific
category are noted, i.e. A/B and A/C:

� A/B: if term tk is highly relevant to category ci only,
which basically indicates that tk is a good feature to rep-
resent category ci, then the value of A/B tends to be
higher.
� A/C: given two terms tk, tl and a category ci, the term

with a higher value of A/C, will be the better feature
to represent ci, since a larger portion of it occurs with
category ci.

In the following of this paper, we name A/B and A/C
relevance indicators since these two ratios immediately
indicate the term’s strength in representing a category. In
fact, these two indicators are nicely supported by probabil-
ity estimates. For instance, A/B can be extended as (A/N)/
(B/N), where N is the total number of documents, A/N is
the probability estimate of documents from category ci
where term tk occurs at least once and B/N is the probabil-
ity estimate of documents not from category ci where term
tk occurs at least once. In this manner, A/B can be inter-
preted as a relevance indicator of term tk with respect to
category ci. Surely, the higher the ratio, the more important
the term tk is related to category ci. A similar analysis can
be made with respect to A/C. The ratio reflects the expec-
tation that a term is deemed as more relevant if it occurs
in the larger portion of documents from category ci than
other terms.

Since the computing of both A/B and A/C has its intrin-
sic connection with the probability estimates of category
membership, we propose a new term weighting factor
which utilizes the aforementioned two relevance indicators
to replace idf in the classic tfidf weighting scheme. Consid-
ering the probability foundation of A/B and A/C, the most
immediate choice is to take the product of these two ratios.
Therefore, the proposed weighting scheme is formulated as

tf � log 1 þ
A
B

A
C

� �
: ð4Þ
5. Experiment setup

Two data sets were tested in our experiment, i.e. MCV1
and Reuters-21578. MCV1 is an archive of 1434 English
language manufacturing related engineering papers which
we gathered by the courtesy of the Society of Manufactur-
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
ing Engineers (SME). It combines all engineering technical
papers published by SME from year 1998 to year 2000. All
documents were manually classified (Liu & Loh, 2007).
There are a total of 18 major categories in MCV1. Fig. 1
gives the class distribution in MCV1.

Reuters-21578 is a widely used benchmarking collection
(Sebastiani, 2002). We followed Sun’s approach (Sun, Lim,
Ng, & Srivastava, 2004) in generating the category infor-
mation. Fig. 2 gives the class distribution of the Reuters
data set used in our experiment. Unlike Sun et al. (2004),
we did not randomly sample negative examples from cate-
gories not belonging to any of the categories in our data
set, instead we treated examples not from the target cate-
gory in our data set as negatives.

We compared our probability based term weighting
scheme with a number of other well established weighting
schemes, e.g. TFIDF, ‘ltc’ and normalized ‘ltc’, on
MCV1 and Reuters-21578. We also carried out the bench-
marking experiments between our conjecture and many
other feature selection methods, e.g. chi-square (ChiS), cor-
relation coefficient (CC), odds ratio (OddsR), and informa-
tion gain (IG), by replacing the idf term with the feature
selection value in the classic tfidf weighting scheme. There-
fore, schemes are largely formulated in a form as tf � (fea-
ture value) (TFFV). Table 4 shows all eight weighting
schemes tested in our experiments and their mathematic
formations. Please note that basically the majority of
TFFV schemes are composed of two items, i.e. the normal-
ized term frequency, tf(ti, dj)/max[tf(dj)], and the term’s
feature value, e.g. N(AD � BC)2/(A + C)(B + D)(A +
B)(C + D), in the chi-square scheme, where tf(ti, dj) is the
frequency of term ti in the document dj and max[tf(dj)] is
the maximum frequency of a term in the document dj.
The only different ones are TFIDF weighting, ‘ltc’ form
and the normalized ‘ltc’ form as specified in Table 4.

Two popular classification algorithms were tested, i.e.
Complement Naı̈ve Bayes (CompNB) (Rennie, Shih,
Teevan, & Karger, 2003), and Support Vector Machine
(SVM) (Vapnik, 1999). The CompNB has been recently
reported that it can significantly improve the performance
of Naı̈ve Bayes over a number of well known data sets,
including Reuters-21578 and 20 Newsgroups. Various
correction steps are adopted in CompNB, e.g. data trans-
formation, better handling of word occurrence dependen-
cies and so on. In our experiments, we borrowed the
package implemented in Weka 3.5.3 Developer version
(Witten & Frank, 2005). For SVM, we chose the well
known implementation SVMLight (Joachims, 1998, 2001).
Linear kernel has been adopted, since previous work has
shown its effectiveness in TC (Dumais & Chen, 2000;
Joachims, 1998). As for the performance measurement,
precision, recall and their harmonic combination, i.e. the
F1-value, were calculated (Baeza-Yates & Ribeiro-Neto,
1999; van-Rijsbergen, 1979). Performance was assessed
based on fivefold cross validation. Since we are very con-
cerned about the performance of every category, we report
the overall performance in macro-averaged manner, i.e.
sification: A term weighting approach, Expert Systems with Ap-


Fig. 2. Class distribution in Reuters-21578.

Fig. 1. Class distribution in MCV1.

6 Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
macro-average F1, to avoid the bias for minor categories in
imbalanced data associated with micro-averaged scores
(Sebastiani, 2002; Yang & Liu, 1999).
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
Major standard text preprocessing steps were applied in
our experiments, including tokenization, stop word and
punctuation removal, and stemming. However, feature
sification: A term weighting approach, Expert Systems with Ap-


Table 4
All eight weighting schemes tested in the experiments and their mathematic formations, where the normalized term frequency ntf is defined as tf(ti, dj)/
max[tf(dj)]

Weighting scheme Name Mathematical formations

tf � chi-square ChiS ntf � N(AD � BC)2/(A + C)(B + D)(A + B)(C + D)
tf � correlation coef. CC ntf � ½

ffiffiffiffi
N
p
ðAD � BCÞ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðA þ CÞðB þ DÞðA þ BÞðC þ DÞ

p
�

tf � odds ratio OddsR ntf � log(AD/BC)
tf � info gain IG ntf � ðAN log

AN
ðAþBÞðAþCÞþ

C
N log

CN
ðCþDÞðAþCÞÞ

TFIDF TFIDF ntf � logð NNðtiÞÞ
tfidf � ltc ltc tfðti; d jÞ � logð NNðtiÞÞ
Normalized ltc nltc tfidfltcffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

tfidf 2ltc

p
Probability based Prob. ntf � log 1 þ AB

A
C

� 	

Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx 7

ARTICLE IN PRESS
selection was skipped and all terms left after stop word and
punctuation removal and stemming were kept as features.

6. Experimental results and discussion

6.1. Overall performance

Fig. 3 shows the overall performance of eight weighting
schemes tested over MCV1 and Reuters-21578 using SVM
and CompNB. They are reported in terms of macro-aver-
aged F1-values.

Our first observation is that all TFFV weighting
schemes, e.g. tf � chi-square, tf � information gain and our
probability based one, outperform classic ones, i.e. TFIDF,
‘ltc’, and normalized ‘ltc’ schemes. The TFIDF’s perfor-
mance on Reuters-21578 is in line with the literature (Sun
Fig. 3. The macro-averaged F1-values of eight weighting schemes teste

Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
et al., 2004; Yang & Liu, 1999). This has demonstrated
the overall effectiveness of TFFV based schemes. In gen-
eral, the performance patterns of eight weighting schemes
on MCV1 and Reuters-21578 using two classification algo-
rithms match very well. For example, our probability based
term weighting scheme always take the lead in all eight
schemes including the TFFV ones, and the normalized
‘ltc’ performs always the worst. When compared to
TFIDF, the prevailing choice for term weighting in TC,
our weighting strategy improves the overall performance
from 6% to more than 12%, shown in Table 5. We also
observe that when our scheme is adopted, CompNB has
delivered a result which is very close to the best one that
SVM can achieve using TFIDF scheme in Reuters-21578.
This has demonstrated the great potential of using Comp-
NB as a state-of-the-art classifier.
d over MCV1 and Reuters-21578 using both SVM and CompNB.

sification: A term weighting approach, Expert Systems with Ap-


Table 5
Macro-averaged F1-values of TFIDF and probability based term weights
on MCV1 and Reuters-21578

Classifier MCV1 21578

TFIDF Prob. TFIDF Prob.

SVM 0.6729 0.7553 0.8381 0.8918
CompNB 0.4517 0.5653 0.6940 0.8120

8 Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
Among the three global based classic weighting schemes,
i.e. TFIDF, ‘ltc’, and normalized ‘ltc’ form, none of them
can generate comparable results over either MCV1 or Reu-
ters-21578. A close look into their performance reveals that
classifiers built for minor categories, e.g. composite manu-
facturing, electronic manufacturing and others in MCV1 or
rice, natgas, cocoa and others in Reuters-21578, do not
produce satisfactory results. As a result, this has largely
affected the overall performance negatively. Among all
TFFVs, surprisingly, odds ratio does not perform as
expected, since in literature odds ratio is mentioned as
one of the leading feature selection methods for TC (Ruiz
& Srinivasan, 2002; Sebastiani, 2002). This implies that it
is always worthwhile to reassess the strength of a term
selection method for a new data set, even if it tends to per-
form well.
6.2. Gains for minor categories

As shown in Figs. 1 and 2, both MCV1 and Reuters-
21578 are skewed data sets. While MCV1 possesses 18 cat-
egories with one major category occupying up to 25% of
Fig. 4. F1 scores of TFIDF and the probability based term weight

Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
the whole population of supporting documents, there are
six categories where each owns only 1% of MCV1, and
other 11 categories falling below the average, i.e. 5.5%, if
MCV1 is evenly distributed. The same case also happens
to Reuters-21578 data set. While it has 13 categories, grain
and crude, the two major categories, share around half of
the population, there are eight categories in total whose
shares falling below the average. Previous literature did
not report successful stories over these minor categories
(Sebastiani, 2002; Sun et al., 2004; Yang & Liu, 1999).
Note, this imbalance situation is even worse when the
training examples are arranged in the so called ‘‘one-
against-all” setting for the induction of classifiers, i.e.
examples from the target category (a minor category in this
case) are considered as positive while examples from the
rest categories are all deemed as negative. Nevertheless,
given the nature of TC is to answer whether the document
belongs to this particular category or not, the ‘‘one-against-
all” setting is still the prevailing approach in TC, owing
much to the fact that it dramatically reduces the number
of classifiers to be induced.

Since the proposed probability based weighting scheme
is the best in the benchmarking test over both MCV1
and Reuters-21578, we intend to examine why this is the
case. Therefore, we plot its performances in detail against
TFIDF in Figs. 4 and 5, respectively. This is largely
because TFIDF is the best among the three classic
approaches as well as the default choice for TC in its cur-
rent research and application (Sebastiani, 2002).

A close examination of Figs. 4 and 5 shows that the
probability based scheme produces much better results
ing scheme tested over MCV1 using both SVM and CompNB.

sification: A term weighting approach, Expert Systems with Ap-


Fig. 5. F1 scores of TFIDF and the probability based term weighting scheme tested over Reuters-21578 using both SVM and CompNB.

Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx 9

ARTICLE IN PRESS
over minor categories in both MCV1 and Reuters-21578,
regardless of classifiers used. For all minor categories
shown in both figures, we observed a sharp increase of per-
formance occurs when the system’s weighting method
switch from TFIDF to the probability one.

Table 6 reveals more insights with respect to the system
performance. In general, we observe that using the proba-
bility based term weighting scheme can greatly enhance the
systems’ recalls. Although it falls slightly below TFIDF in
terms of precision using SVM, it still improves the systems’
precisions in CompNB, far superior to those TFIDF can
deliver. For SVM, while the averaged precision of TFIDF
in MCV1 is 0.8355 which is about 5% higher than the prob-
ability’s, the averaged recall of TFIDF is 0.6006 only, far
less than the probability’s 0.7443. The case with Reuters-
21578 is even more impressive. While the averaged preci-
sion of TFIDF is 0.8982 which is only 1.8% higher than
the other, the averaged recall of probability based scheme
reaches 0.9080, in contrast to TFIDF’s 0.7935. Overall,
the probability based weighting scheme surpasses TFIDF
in terms of F1-values over both data sets.
Table 6
Macro-averaged precision and recall of TFIDF and probability based term
weights on MCV1 and Reuters-21578

Data Classifier precision recall

TFIDF Prob. TFIDF Prob.

MCV1 SVM 0.8355 0.7857 0.6006 0.7443
CompNB 0.4342 0.6765 0.4788 0.5739

21578 SVM 0.8982 0.8803 0.7935 0.9080
CompNB 0.5671 0.7418 0.9678 0.9128

Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
6.3. Significance test

To determine whether the performance improvement
gained by the probability based scheme and other TFFVs
over these two imbalanced data sets are significant, we per-
formed the macro-sign test (S-test) and macro-t-test (T-
test) on the paired F1-values. As pointed out by Yang
and Liu (1999), on the one hand, the S-test may be more
robust in reducing the influence of outliers, but at the risk
of being insensitive or not sufficiently sensitive in perfor-
mance comparison because it ignores the absolute differ-
ence between F1-values; on the other hand, the T-test is
sensitive to the absolute values but could be overly sensitive
when F1-values are highly unstable, e.g. for the minor
ategories. Therefore, we adopt both tests here to give a
comprehensive understanding of the performance
improvement.

Since for both data sets, TFIDF performs better than the
other two classic approaches, we choose it as the represen-
tative of its peers. For both the S-test and T-test, we actually
conduct two sets of tests over two data sets, respectively.
One is to test all TFFV schemes, including the probability
one, against TFIDF and the other one is to test the proba-
bility scheme against others. While the first aims to assess
the goodness of schemes in the form of TFFVs, the second
intends to test whether the probability based scheme does
generate better results. Table 7 summarizes p-values in the
S-test for TFFV schemes against TFIDF and the probabil-
ity one against others over two data sets. We consider two
F1-values to be the same if their difference is not more than
sification: A term weighting approach, Expert Systems with Ap-


Table 8
t-Values of pairwise T-test on MCV1 and Reuters-21578, where a = 0.001

Test TFIDF ChiS CC OddsR IG Prob.

MCV1

SVM, t-critical = 3.354
XX vs. TFIDF – 1.963E+01 2.151E+01 8.343E+00 2.588E+01 3.017E+01
Prob. vs. XX 3.017E+01 1.347E+01 1.069E+01 2.400E+01 6.571E+00 –

CompNB, t-critical = 3.354
XX vs. TFIDF – 2.135E+01 2.049E+01 4.597E+00 2.419E+01 3.127E+01
Prob. vs. XX 3.127E+01 9.468E+00 1.043E+01 2.649E+01 8.192E+00 –

Reuters-21578
SVM, t-critical = 3.467

XX vs. TFIDF – 2.516E+01 1.957E+01 1.692E+00 2.435E+01 2.889E+01
Prob. vs. XX 2.889E+01 3.682E+00 8.587E+00 2.262E+01 3.993E+00 –

CompNB, t-critical = 3.467
XX vs. TFIDF – 2.157E+01 1.946E+01 5.318E+00 2.130E+01 3.064E+01
Prob. vs. XX 3.064E+01 4.926E+00 9.127E+00 2.167E+01 6.128E+00 –

Table 7
p-Values of pairwise S-test on MCV1 and Reuters-21578, where two F1-values are the same if their difference is not more than 0.01

Test TFIDF ChiS CC OddsR IG Prob.

MCV1

SVM
XX vs. TFIDF – 4.813E�02 2.090E�03 5.923E�02 1.544E�02 6.561E�04
Prob. vs. XX 6.561E�04 6.363E�03 3.841E�02 6.561E�04 1.051E�01 –

CompNB
XX vs. TFIDF – 4.813E�02 4.813E�02 2.403E�01 4.813E�02 2.452E�02
Prob. vs. XX 2.452E�02 4.813E�02 2.452E�02 6.363E�03 1.544E�02 –

Reuters-21578

SVM
XX vs. TFIDF – 3.174E�03 3.174E�03 1.938E�01 5.859E�03 3.174E�03
Prob. vs. XX 3.174E�03 2.744E�01 1.938E�01 1.929E�02 3.872E�01 –

CompNB
XX vs. TFIDF – 3.271E�02 3.271E�02 1.133E�01 7.300E�02 3.271E�02
Prob. vs. XX 3.271E�02 1.133E�01 1.133E�01 5.859E�03 1.334E�01 –

10 Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
0.01, i.e. 1%. Table 8 summarizes t-values of T-test for the
identical comparison settings, where a is 0.001.

From the results we can summarize the strength of dif-
ferent schemes. Consider the merits evaluated based on
TFFVs against TFIDF, TFFVs have shown that they are
the better approach in handling imbalanced data. Among
various TFFVs, our proposed scheme claims the leading
performance tested in both MCV1 and Reuters-21578,
regardless of the classifier used. However, the approach
based on the odds ratio is not much superior to TFIDF.
With respect to the evaluation based on the merits of the
probability scheme against others, it is not surprising to
see that the new scheme still takes the lead. It manages to
perform better than other approaches in TFFV, e.g. infor-
mation gain and chi-square, where the absolute difference
of F1-values is considered. Finally, the results of informa-
tion gain, chi-square and correlation coefficient shown in
our tests are compatible with those in literature (Forman,
2003; Yang & Pedersen, 1997). In general, the more minor
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
categories the data set possesses, the better overall perfor-
mance can be elevated if the probability based weighting
scheme is chosen.

7. Conclusion and future work

Handling of imbalanced data in TC has become an
emerging challenge. In this paper, we introduce a new
weighting paradigm which is generally formulated as
tf � (feature value) (TFFV) to replace the classic TFIDF
based approaches. We propose a probability based term
weighting scheme, which directly makes use of two critical
information ratios, as a new way to compute the term’s
weight. These two ratios are deemed to possess the most
salient information reflecting the term’s strength in associ-
ating a category. Their computation does not impose any
extra cost compared to the conventional feature selection
methods. Our experimental study and extensive compari-
sons based on two imbalanced data sets, MCV1 and Reu-
sification: A term weighting approach, Expert Systems with Ap-


Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx 11

ARTICLE IN PRESS
ters-21578, show the merits of TFFV based approaches
and their ability to handle imbalanced data. Among the
various TFFVs, our probability based scheme offers the
best overall performance in both data sets regardless of
classifier used. Our approach has suggested an effective
solution to improve the performance of imbalanced TC.

Start from the work reported in this paper, there are a
few immediate tasks awaiting us. Since the probability
scheme is derived from the understanding of feature selec-
tion, the A/B � A/C itself can also be considered as a new
feature selection method that reflects the relevance of terms
with respect to different thematic categories. It is interesting
to further explore its joint application with other algo-
rithms in TC. As for the slight decrease of precision noted,
we intend to remedy the situation by switching the linear
kernel with a string kernel in SVM (Lodhi, Saunders,
Shawe-Taylor, Cristianini, & Watkins, 2002). Another
challenge we are facing is to handle the situation where
the critical information needed, e.g. A and B, cannot be
easily secured, i.e. in text clustering. One potential direction
is to infer these critical values from a small collection of
labeled data and then test how robust these values or this
probability approach could be, what strategies we can pro-
pose to accommodate the variation of term occurrence in
the unlabeled documents, and how to modify the critical
values accordingly. The whole idea falls into the emerging
paradigm of semi-supervised learning. We will report our
study when the results become more solid.
Acknowledgement

The work described in this paper was partially sup-
ported by a grant from the Research Grants Council of
the Hong Kong Polytechnic University, Hong Kong Spe-
cial Administrative Region, China (Project No. G-YF59).
References

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval.
Boston, MA, USA: Addison-Wesley.

Baoli, L., Qin, L., & Shiwen, Y. (2004). An adaptive k-nearest neighbor
text categorization strategy. ACM Transactions on Asian Language
Information Processing (TALIP), 3(4), 215–226.

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data
with co-training. In COLT: Proceedings of the workshop on computa-
tional learning theory (pp. 92–100).

Brank, J., Grobelnik, M., Milic-Frayling, N., & Mladenic, D. (2003).
Training text classifiers with SVM on very few positive examples. MSR-
TR-2003-34.

Castillo, M. D. D., & Serrano, J. I. (2004). A multistrategy approach for
digital text categorization from imbalanced documents. ACM SIG-
KDD Explorations Newsletter, 6(1) [Special issue on learning from
imbalanced datasets].

Chawla, N., Japkowicz, N., & Kolcz, A. (Eds.). (2003). Proceedings of the
ICML’2003 workshop on learning from imbalanced data sets.

Chawla, N., Japkowicz, N., & Kolcz, A. (Eds.). (2004). ACM SIGKDD
Explorations Newsletter, 6(1) [Special issue on learning from imbal-
anced data sets].

Debole, F., & Sebastiani, F. (2003). Supervised term weighting for
automated text categorization. In Proceedings of the 2003 ACM
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
symposium on applied computing (pp. 784–788). Melbourne, Florida,
USA.

Dietterich, T., Margineantu, D., Provost, F., & Turney, P. (Eds.). (2000).
In Proceedings of the ICML’2000 workshop on cost-sensitive learning.

Dumais, S., & Chen, H. (2000). Hierarchical classification of Web content.
In Proceedings of the 23rd annual international ACM SIGIR conference
on research and development in information retrieval (SIGIR2000) (pp.
256–263). Athens, Greece.

Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceed-
ings of the 17th international joint conference on artificial intelligence

(IJCAI’01) (pp. 973–978).
Fan, W., Yu, P. S., & Wang, H. (2004). Mining extremely skewed trading

anomalies. In Advances in database technology – EDBT 2004: Ninth
international conference on extending database technology (pp. 801–
810). Heraklion Crete, Greece.

Forman, G. (2003). An extensive empirical study of feature selection
metrics for text classification. The Journal of Machine Learning
Research, 3, 1289–1305 [Special issue on variable and feature
selection].

Ghani, R. (2002). Combining labeled and unlabeled data for multiclass
text categorization. In International conference on machine learning
(ICML 2002), Sydney, Australia.

Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with
unlabeled data. In Proceedings of 17th international conference on
machine learning (pp. 327–334). San Francisco, California, USA.

Japkowicz, N. (Ed.). (2000). Proceedings of the AAAI’2000 workshop on
learning from imbalanced data sets, AAAI Tech Report WS-00-05,
AAAI.

Japkowicz, N., Myers, C., & Gluck, M. A. (1995). A novelty detection
approach to classification. In Proceedings of the 14th international joint
conference on artificial intelligence (IJCAI-95) (pp. 518–523).

Joachims, T. (1998). Text categorization with support vector machines:
Learning with many relevant features. Machine learning: ECML-98,
10th European conference on machine learning (pp. 137–142). Berlin,
Germany.

Joachims, T. (2001). A statistical learning model of text classification with
support vector machines. In Proceedings of the 24th annual interna-
tional ACM SIGIR conference on research and development in

information retrieval (pp. 128–136). New Orleans, Louisiana, United
States.

Leopold, E., & Kindermann, J. (2002). Text categorization with support
vector machines – How to represent texts in input space. Machine
Learning, 46(1–3), 423–444.

Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training
text classifiers. In Proceedings of {SIGIR}-94, 17th ACM international
conference on research and development in information retrieval (pp. 3–
12). Dublin, Ireland.

Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new
benchmark collection for text categorization research. Journal of
Machine Learning Research, 5, 361–397.

Liu, A. Y. C. (2004). The effect of oversampling and undersampling on
classifying imbalanced text datasets. Masters thesis, University of
Texas at Austin.

Liu, Y., & Loh, H. T. (2007). Corpus building for corporate knowledge
discovery and management: A case study of manufacturing. In
Proceedings of the 11th international conference on knowledge-based

and intelligent information and engineering systems, KES’07, Lecture

notes in artificial intelligence, LNAI, Vol. 4692 (pp. 542–550). Vietri sul
Mare, Italy.

Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. (2003). Building text
classifiers using positive and unlabeled examples. In Proceedings of the
third IEEE international conference on data mining (ICDM’03),
Melbourne, Florida.

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C.
(2002). Text classification using string kernels. The Journal of Machine
Learning Research, 2, 419–444.

Manevitz, L. M., & Yousef, M. (2002). One-class SVMS for document
classification. The Journal of Machine Learning Research, 2, 139–154.
sification: A term weighting approach, Expert Systems with Ap-


12 Y. Liu et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced
class distribution and Naive Bayes. In Proceedings of the 16th
international conference on machine learning, ICML’99 (pp. 258–267).

Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perception
learning, and a usability case study for text categorization. In ACM
SIGIR forum, Proceedings of the 20th annual international ACM SIGIR

conference on research and development in information retrieval (pp. 67–
73). Philadelphia, Pennsylvania, United States.

Nickerson, A., Japkowicz, N., & Milios, E. (2001). Using unsupervised
learning to guide re-sampling in imbalanced data sets. In Proceedings
of the eighth international workshop on AI and statistics (pp. 261–
265).

Nigam, K. P. (2001). Using unlabeled data to improve text classification.
PhD thesis, Carnegie Mellon University.

Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: A
case study. ACM SIGKDD Explorations Newsletter, 6(1), 60–69
[Special issue on learning from imbalanced datasets].

Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling
the poor assumptions of Naive Bayes text classifiers. In Proceedings of
the 20th international conference on machine learning (pp. 616–623).
Washington, DC, USA.

Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization
using neural networks. Information Retrieval, 5(1), 87–118.

Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic
text retrieval. Information Processing and Management, 24(5), 513–
523.

Salton, G., & McGill, M. J. (1983). Introduction to modern information
retrieval. New York, USA: McGraw-Hill.

Sebastiani, F. (2002). Machine learning in automated text categorization.
ACM Computing Surveys (CSUR), 34(1), 1–47.

Sun, A., Lim, E.-P., Ng, W.-K., & Srivastava, J. (2004). Blocking
reduction strategies in hierarchical text classification. IEEE Transac-
tions on Knowledge and Data Engineering (TKDE), 16(10), 1305–
1308.
Please cite this article in press as: Liu, Y. et al., Imbalanced text clas
plications (2007), doi:10.1016/j.eswa.2007.10.042
van-Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London, UK:
Butterworths.

Vapnik, V. N. (1999). The nature of statistical learning theory (2nd ed.).
New York: Springer-Verlag.

Weiss, G. M. (2004). Mining with rarity: A unifying framework. ACM
SIGKDD Explorations Newsletter, 6(1), 7–19 [Special issue on learning
from imbalanced datasets].

Weiss, G. M., & Provost, F. (2003). Learning when training data are
costly: The effect of class distribution on tree induction. Journal of
Artificial Intelligence Research, 19, 315–354.

Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning
tools and techniques (2nd ed.). San Francisco, CA, USA: Morgan
Kaufman.

Yang, Y. (1996). Sampling strategies and learning efficiency in text
categorization. In Proceedings of the AAAI spring symposium on
machine learning in information access (pp. 88–95).

Yang, Y., & Liu, X. (1999). A re-examination of text categorization
methods. In Proceedings of the 22nd annual international ACM SIGIR
conference on research and development in information retrieval (pp. 42–
49). Berkeley, California, United States.

Yang, Y., & Pedersen, J. O. (1997). A Comparative study on feature
selection in text categorization. In Proceedings of ICML-97, 14th
international conference on machine learning (pp. 412–420).

Yu, H., Zhai, C., & Han, J. (2003). Text classification from positive and
unlabeled documents. In Proceedings of the 12th international confer-
ence on information and knowledge management (CIKM 2003) (pp.
232–239). New Orleans, LA, USA.

Zelikovitz, S., & Hirsh, H. (2000). Improving short text classification using
unlabeled background knowledge. In Proceedings of the 17th interna-
tional conference on machine learning (ICML2000).

Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text
categorization on imbalanced data. ACM SIGKDD Explorations
Newsletter, 6(1), 80–89 [Special issue on learning from imbalanced
datasets].
sification: A term weighting approach, Expert Systems with Ap-


	Imbalanced text classification: A term weighting approach
	Introduction
	Motivation
	Related work

	Term weighting scheme
	Inspiration from feature selection
	A probability based term weighting scheme
	Revisit of tfidf
	Probability based term weights

	Experiment setup
	Experimental results and discussion
	Overall performance
	Gains for minor categories
	Significance test

	Conclusion and future work
	Acknowledgement
	References