doi:10.1016/j.eswa.2007.05.037


Available online at www.sciencedirect.com
www.elsevier.com/locate/eswa

Expert Systems with Applications 34 (2008) 2858–2869

Expert Systems
with Applications
An efficient bit-based feature selection method

Wei-Chou Chen a, Shian-Shyong Tseng a, Tzung-Pei Hong b,*

a
Department of Computer and Information Science, National Chiao Tung University, Hsinchu 300, Taiwan, ROC

b
Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC
Abstract

Feature selection is about finding useful (relevant) features to describe an application domain. Selecting relevant and enough features
to effectively represent and index the given dataset is an important task to solve the classification and clustering problems intelligently.
This task is, however, quite difficult to carry out since it usually needs a very time-consuming search to get the features desired. This
paper proposes a bit-based feature selection method to find the smallest feature set to represent the indexes of a given dataset. The pro-
posed approach originates from the bitmap indexing and rough set techniques. It consists of two-phases. In the first phase, the given
dataset is transformed into a bitmap indexing matrix with some additional data information. In the second phase, a set of relevant
and enough features are selected and used to represent the classification indexes of the given dataset. After the relevant and enough fea-
tures are selected, they can be judged by the domain expertise and the final feature set of the given dataset is thus proposed. Finally, the
experimental results on different data sets also show the efficiency and accuracy of the proposed approach.
� 2007 Elsevier Ltd. All rights reserved.

Keywords: Feature selection; Bitmap indexing; Rough set; Classification; Clustering
1. Introduction

Feature selection is about finding useful (relevant) fea-
tures to describe an application domain. Selecting relevant
and enough features to effectively represent and index the
given dataset is an important task to solve the classification
and clustering problems intelligently. This task is, however,
quite difficult to carry out since it usually needs an exhaus-
tive search to get the features desired. In the past, some
approaches have been proposed to solve the feature selec-
tion problem (Almullim et al., 1991; Chen, Tseng, Chen,
& Jiang, 2000, 2002, 2003; Doak, 1992; Gonzalez & Perez,
2001; Huang & Tseng, 2004; John et al., 1994; Lee, Chen,
Chen, & Jou, 1997; Liu et al., 1996; Liu & Setiono, 1998;
Quinlan, 1986; Yu, 2001). These approaches can roughly
be classified into the following two strategies:
0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.05.037

* Corresponding author.
E-mail addresses: sstseng@cis.nctu.edu.tw (S.-S. Tseng), tphong@nuk.

edu.tw (T.-P. Hong).
1. Optimal strategy: This kind of approaches considers all
the subsets of a given feature set (Almullim et al., 1991;
Schlimmer et al., 1993; Wu, 1999). Some searching tech-
niques, such as branch and bound, may be adopted to
reduce the search space. For example, Liu et al. pro-
posed a special feature selector (Liu et al., 1996), which
randomly produced feature subsets according to the Las
Vegas algorithm (Brassard, 1996). It thus searched the
entire solution spaces and guaranteed to get an optimal
feature set.

2. Heuristic strategy: This kind of approaches prunes
search spaces according to some heuristics. The results
obtained by these approaches are usually not optimal,
but within a short time (Zhong, Dong, & Ohsuga,
2001). There are three typical heuristic approaches for
feature selection, including forward selection, backward
selection and bi-directional selection. The forward selec-
tion approach initializes the desired feature set as null
and then adds features into it until the results are satis-
factory (Miao, 1999; Skowron & Rauszer, 1992; Yu,
2001). The backward selection approach initializes the

mailto:sstseng@cis.nctu.edu.tw
mailto:tphong@nuk.edu.tw
mailto:tphong@nuk.edu.tw


W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869 2859
desired feature set as all the given features and then
removes unnecessary features from it (Choubey, 1998;
Yu, 2001). The bi-directional selection approach initial-
izes the desired feature set as a partial feature set, and
then either puts good features into it or eliminates bad
features from it (Doak, 1992).

In the past, we proposed a bitwise indexing method based
on a given feature set to accelerate case matching in CBR
(Chen, Tseng, Chang, & Jiang, 2001, 2002). In this paper,
we further investigate the determination of the appropriate
feature set. We propose a two-phase feature selection
approach to discover significant feature sets from a given
database table, and use them to further investigation. The
proposed feature selection approach originates from the bit-
map indexing (O’Neil & Quass, 1997; Wu et al., 1998) and
rough set techniques (Pawlak, 1982, 1991). Naturally, it is
designed to discover optimal feature sets for a given dataset
since the proposed method is originated from the rough set
theory. The Experimental results also show the efficiency
and accuracy of the proposed approach.

This paper is organized as follows. Some related works
are reviewed in Section 2. The proposed feature selection
method and some corresponding definitions and algo-
rithms are stated in Section 3. The time and space complex-
ities of the proposed algorithms are analyzed in Section 4.
Experimental results are shown in Section 5. Conclusions
are finally given in Section 6.

2. Review of feature selection and rough sets

Feature selection is about finding useful (relevant) fea-
tures to describe an application domain. The problem of
feature selection can formally be defined as selecting min-
imum features M0 from original M features where
M0 5 M such that the class distribution of M0 features
is as similar as possible to M features (Last, Kandel, &
Maimon, 2001). Generally speaking, the function of fea-
ture selection is divided into three parts: (1) simplifying
data description, (2) reducing the task of data collection,
and (3) improving the quality of problem solving. The
benefits of having a simple representation are abundant
such as easier understanding of problems, and better
and faster decision making. In the field of data collection,
having less features means that less data need to be col-
lected. As we know, collecting data is never an easy job
in many applications because it could be time-consuming
and costly. Regarding the quality of problem solving, the
more complex the problem is if it has more features to be
processed. It can be improved by filtering out the irrele-
vant features which may confuse the original problem,
and it will win the better performance. There are many
discussions about feature selection, and many existing
methods to assist it, such as GA technology (Raymer,
Punch, Goodman, & Kuhn, 2000), entropy measure
(Huang & Tseng, 2004), and rough set theory (Tseng,
Jothishankar, & Wu, 2004; Yu, 2001).
Next, the rough set theory is briefly reviewed. The rough
set theory, proposed by Pawlak in 1982 (Pawlak, 1982), can
serve as a new mathematical tool for dealing with data clas-
sification problems. It adopts the concept of equivalence
classes to partition training instances according to some
criteria. Two kinds of partitions are formed in the mining
process: lower approximations and upper approximations.
Rough sets can also be used for feature reduction. The fea-
tures that do not contribute to the classification of the
given training data are removed. The concepts of equiva-
lence classes and approximations are quite suitable to gen-
erate the bit-based class vectors and record vectors, which
can then be directly and efficiently transformed to the bit-
wise indexing matrixes for CBR systems (Yang et al.,
2000). This paper thus adopts these concepts to solve the
feature selection problem.

3. The proposed bitmap-based feature selection method

As we mentioned above, we proposed a heuristic feature
selection approach, called the bitmap-based feature selec-
tion method with discernibility matrix (Chen, Yang, &
Tseng, 2002), to find a nearly optimal feature set. However,
finding the optimal solutions of feature selection is still
needed in some applications. Although, some exhaustive
search methods can guarantee the optimality of selected
feature sets, the computation cost may be very high.

In this section, we thus consider finding an optimal solu-
tion via the rough set techniques and the bit-based indexing
method for the feature selection. The proposed approach
encodes a given data set into a bit vector matrix and uses
bit-processing operations on them to reduce the computa-
tion time. The proposed approach consists of several main
steps, as shown in Fig. 1.

There are two-phases in the proposed algorithm – bit-
map indexing phase and feature selection phase. In the bit-
map indexing phase, the given dataset is transformed into a
bitmap indexing matrix with some additional data infor-
mation. In the feature selection phase, a set of relevant
and enough features are selected and used to represent
the dataset. The details of the two-phases are described in
following sub-sections.

3.1. Problem definitions

Let T denote a target table in a database, R denote the
set of n records in T, and C denote the set of m features in
T, R can then be represented as {R1, R2, . . ., Rn}, where Ri is
the ith record. C can be represented as {C1, C2, . . ., Cm},
where Cj is the jth feature. The first m�1 elements in C
are condition features and the last one, Cm, is a decision
feature. Let Vj denote the domain of Cj. Vj can then be rep-
resented as {Vj1, Vj2, . . ., Vjrj}, where each element is a pos-
sible value of Cj and rj is the number of possible values of
Cj. Let Vj(i) denote the value of Cj in record Ri,
Vj(i) 5 null. Table 1 shows an example of a target table
T with ten records R = {R1, R2, . . ., R10} and five features


Table 2
The record vectors and class vectors from Table 1

Feature Feature-value Record vector Class vector

C1 V11 1100100000 110
V12 0011010000 111
V13 0000001111 111

C2 V21 1110000000 100
V22 0001111000 011

Fig. 1. The flowchart of the proposed feature selection approach.

Table 1
An example of a target table

C1 C2 C3 C4 C5

R1 M L 3 M 1
R2 M L 1 H 1
R3 L L 1 M 1
R4 L R 3 M 2
R5 M R 2 M 2
R6 L R 3 L 3
R7 H R 3 L 3
R8 H N 3 L 3
R9 H N 2 H 2
R10 H N 2 H 1

2860 W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869
C = {C1, C2, C3, C4, C5}. C5 is a decision feature and the
others are condition features.

The purpose of this paper is to find the one of the small-
est feature set to effectively index the given table. The def-
initions and algorithms used in the bitmap indexing phase
and in the feature selection phase are described below.
V23 0000000111 111

C3 V31 1001011100 111
V32 0110000000 100
V33 0000100011 110

C4 V41 1011100000 110
V42 0100000011 110
V43 0000011100 001

C5 V51 1110000001 100
V52 0001100010 010
V53 0000011100 001
3.2. Bitmap indexing phase

In this phase, the target table is first transformed into a
bitmap indexing matrix with some additional classification
information. Let bi is a bit in a bit vector. Let ONEk denote
the bit string of length k, with all the bits set to 1, ZEROk
denote the one with all the bits set to 0, and UNIQUEk
denote the one, with only one bit set to 1 and the others
set to 0. A record vector, which is used to keep the informa-
tion of the records with a specific value of a feature, is
defined below.

Definition 1 (Record vector). A record vector RVjk is a bit
string b1, b2, . . ., bn, with bi set to 1 for Vj(i) = Vjk and set to
0 otherwise, where 1 6 j 6 m, 1 6 k 6 rj, and 1 6 i 6 n.

RVjk thus keeps the information of the records with the
kth possible value of the feature Cj. For example in Table
1, C1 has three possible values {M, L, H}. The record vec-
tor for C1 = M is 1100100000 since the first, second and
fifth records have this feature value. Similarly, the record
vector for C1 = L is 0011010000 and for C1 = H is
0000001111. All the record vectors are shown in the third
column of Table 2.

A class vector, which is used to keep the information of
the classes (values of the decision feature) with a specific
value of a feature, is defined below.

Definition 2 (Class vector). A class vector CVjk is a bit
string b1, b2, . . ., brm, with bi set to 1 if RVjk \ RVmi 5
ZEROn, and set to 0 otherwise, where rm is the number
of possible values of Cm and n is the number of records
in R.

Here, the ‘‘AND’’ bitwise operator is used for the inter-
section in Definition 2. CVjk thus keeps the information of
the classes related to the kth possible value of the feature
Cj. For example in Table 2, the record vector (RV11) for
C1 = M is 1100100000 and the one (RV51) for C5 = 1 is
1110000001. Since, the bitwise intersection of 1100100000
and 1110000001 is 1100000000, not equal to ZERO10, the
first bit in RV11 is thus 1. Similarly, the second and third
bits in RV11 are 1 and 0 from the intersection results of
RV11 with RV52, and with RV53. the class vector CV11 is
thus 110. All the class vectors are shown in the fourth col-
umn of Table 2. Formally, a class vector CVjk can be
obtained by the following FindClassVector algorithm.


W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869 2861
Algorithm 1 (FindClassVector).

Input: Record vector RVjk.
Output: Class vector CVjk.
Step 1: Set CVjk to ZEROrm.
Step 1: For each i, 1 6 i 6 rm, set the ith bit of CVjk to 1

if RVjk \ RVmi 5 ZEROn; otherwise, set it to 0.
Step 1: Return CVjk.
Definition 3 (Feature-value vector). A feature-value vector
Fjk is the concatenation of RVjk and CVjk.

For example, the feature-value vector F11 in Table 2 is
1100100000110, which is RV11 concatenated with CV11.
All the feature-value vectors for a feature are then collected
together as a feature matrix. This is defined below.

Definition 4 (A feature matrix for a feature). A feature

matrix Mj for the feature Cj is denoted

F j1
F j2
..
.

F jrj

2
6664

3
7775, where rj is
the number of possible values in Cj.

For example, the feature matrix M1 in Table 2 is show as
follows:

M 1 ¼
1100100000110

0011010000111

0000001111111

2
64

3
75

The bits with underlines are class vectors. From the def-
inition of the feature matrix, it is easily derived that apply-
ing the bitwise operator ‘‘OR’’ on all the record vectors in a
feature matrix will get the ONEn vector, and applying the
bitwise operator ‘‘AND’’ on any two record vectors in a
feature matrix will get the ZEROn vector. Note that, the
‘‘OR’’ and ‘‘AND’’ operators are defined for executing
the ‘‘OR’’ and ‘‘AND’’ operations on all corresponding
bits of the given two bit vectors. Thus, if we apply the bit-
wise operator ‘‘XOR’’ on all the record vectors in a feature
matrix, we will also get the ZEROn vector. Take M1 as an
example. The result for 1100100000 OR 0011010000 OR
0000001111 is 1111111111. The result for 1100100000
AND 0011010000 is 0000000000. The result for
1100100000 XOR 0011010000 XOR 0000001111 is
0000000000.

Definition 5 (A feature matrix for a table T). A feature

matrix M for a table T is denoted

M 1
M 2
..
.

M m

2
6664

3
7775, where m is the
number of features in T.

For example, the matrix composed of the bit strings
from columns 3 and 4 of Table 2 is the feature matrix for
the data given in Table 1. The feature matrix for a table is
then input to the feature selection phase to find relevant
and enough features.
3.3. Feature selection phase

In this phase, we want to find a set of relevant and
enough features to represent the given dataset. It is further
divided into several stages. First, a feature-based spanning
tree is built for cleansing the bitmap indexing matrix. The
dataset with noisy information is thus judged and filtered
out according to the spanning tree. The cleansed, noisy-free
bitmap indexing matrix can then be used to determine the
optimal feature set for some classification and clustering
problems.

3.3.1. Creating cleansing tree

Before the feature selection phase is executed, the cor-
rectness of the target table needs to be verified. If there
are some records in the target table with the same values
of all condition features, but with different ones of the deci-
sion feature, they are treated as noise records and are fil-
tered out from the target table. Intuitively, every two
records can be compared to find out the inconsistent
records in the target table. Its time complexity is O(n2m),
where n is the number of records and m is the number of
features. Below, we propose the concept of a cleansing tree
to decrease the time complexity to O(nmj), where j is the
maximum number of possible feature values of a feature
and n is usually much larger than j in general classification
and clustering problems. The formation of a cleaning tree
depends on the given feature order. We thus have the fol-
lowing definition.

Definition 6 (Spanned feature order). A spanned feature
order O is a permutation consisting of all the condition
features in a target table T.

For example in Table 1, hC1, C2, C3, C4i can be a
spanned feature order. When a spanned feature order is
given, a cleansing tree can then be built according to it. The
definition of a cleansing tree is first given below.

Definition 7 (Cleansing tree). A cleansing tree Ctree is a
tree with a root denoted root[Ctree]. Every node x in the
tree corresponds to a feature value. A node y is the parent
of a node x if the feature of y precedes the feature of x in
the given spanned feature order. A node z is the sibling of a
node x if they have the same feature, but different values.

A structure of a cleansing tree is shown in Fig. 2. Its
maximum height is m�1, where m is the number of features
in a decision table T. Each node x has three pointers, which
are p[x], left-child[x] and right-sibling[x], respectively point-
ing to its parent node, its leftmost child node and its first
right-sibling node. It also contains two additional infor-
mation, record[x] and class[x], which indicate the associ-
ated record and class vectors of x. If node x has no child,
then left-child[x] = NIL; if node x is the rightmost child of
its parent, then right-sibling[x] = NIL.


record class

record class record class record class

record class record class record class

root[Tree]

Fig. 2. The structure of a cleansing tree.

2862 W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869
As mentioned above, records may have the same values
of all condition features, but different value of the decision
feature. These records are called inconsistent. Inconsistent
records can also be found out when the cleansing tree is
built. The building algorithm uses the valid mask vector to
find the consistent records. The valid mask vector is defined
as follows.

Definition 8 (Valid mask vector). A valid mask vector
ValidMask for a target table T a bit string b1, b2, . . ., bn,
with bi set to 1 if the ith record Ri is not inconsistent with
other records, and set to 0 otherwise.

The cleansing tree for a given spanned feature order can
be built by the following CreateCleansingTree algorithm.
The ValidMask is initially set to ONEn, and will be
modified along with the execution of the CreateCleansing-
Tree algorithm.

Algorithm 2 (CreateCleansingTree).

Input: A feature matrix M, the valid mask ValidMask
and a spanned feature order O.

Output: The valid mask ValidMask.
Step 1: Create an empty node x and set it as the root

node.
Step 2: Initialize record[x] = ONEn, class[x] = ONErm

and depth = 0, where the variable depth is used
to represent the depth of the node x in the cleans-
ing tree.

Step 3: Set px = x, where px is used to keep the current
parent node.

Step 4: If class[x] is not equal to UNIQUErm and depth is
not equal to m�1, do Step 5 to build the child
nodes of node x; otherwise, go to Step 7.

Step 5: Let Cj be the current feature in the spanned fea-
ture order to be considered. For each feature-
value vector Fjk in a feature matrix Mj for Cj, if
(record[px] AND RVjk) 5 ZEROn, do the follow-
ing sub-steps:

Step 5.1: Create an empty node y.
Step 5.2: If left_child[x] = NIL, consider y as a child node

of x and set p[y] = x and left_child[x] = y; other-
wise, consider y as a sibling node of x and set
p[y] = p[x] and right_sibiling[x] = y.

Step 5.3: Set record[y] = (record[p[y]] AND RVjk) and clas-
s[y] = (class[p[y]] AND CVjk).

Step 5.4: If depth = m�1 and class½y� 6¼ UNIQUErm , set
ValidMask = (record[y] XOR ValidMask).

Step 5.5: Set x = y.

Step 6: If left_child[px] 5 NIL, set x = left_child[px],

depth = depth + 1 and go to Step 3. Otherwise,
do the next step.

Step 7: If right_sibiling[x] 5 NIL, x = right_sibiling[x]
and go to Step 3; otherwise, set x = p[x] and do
the next step.

Step 8: If x 5 Tree[root], go to Step 7; otherwise, return
ValidMask and stop the algorithm.
For example, the cleansing tree for the data in Table 1
with the spanned feature order hC1, C2, C3, C4i will be built
as shown in Fig. 3. At first, the root node is generated and
all the bits in record[root] and class[root] are set to 1. Since,
class[root] is not equal to UNIQUE3, and the current depth
is 0, not equal to m�1, the next step is executed to build the
child nodes of the root. The first feature C1 in the spanned
feature order is considered. Since, it has three possible values
and (record[root] AND RV1k), k = 1 to 3, is not equal to
ZERO10, three nodes, represented as nodes 1, 2 and 3, are
created as the children of the root. Since, node 1, the left-
child node of the root, is not NIL, it is then processed to
generate its child nodes in the same way. Nodes 4 and 5
are then created for the second feature C2 in the spanned fea-
ture order. Since, class[node 4] has been equal to ONE10, the
sibling of node 4, which is node 5, is then considered. Since,
class[node 5] has also been equal to ONE10, the sibling of
node 5, is then considered. But since node 5 has no sibling,
its parent node, node 1 is considered. The sibling of node
1, which is node 2 is then processed. The same procedure
is then executed until the whole cleansing tree is generated.

The numbers at the left of the nodes in Fig. 3 indicate
the order built. In node 15, the second and third bits of
the class vector are both ‘‘1’’. It means that the correspond-
ing record vectors will have more than one ‘‘1’’. The corre-
sponding records with bit ‘‘1’’ are then inconsistent since


1100100000 1101 0000001111 11130011010000 1112

0001010000 01170010000000 1006 0000000111 11112

0000000100 00113

0000001000 10011

0000000011 11015

1111111111 111root

C
1

C2

C3

C
4

1100000000 1004 0000100000 0105

0001010000 0118

0000010000 001100001000000 0109

0000000011 11014

Fig. 3. Cleansing tree with feature spanned order hC1, C2, C3, C4i.

W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869 2863
their values of all condition features are the same, but their
values of the decision feature are different. In this example,
the ninth and tenth records are inconsistent. The Valid-
Mask are thus modified from ‘‘1111111111’’ to
‘‘1111111100’’.

3.3.2. Finding appropriate spanned order

In the above example, the spanned feature order O is set
as hC1, C2, C3, C4i. Different orders will apparently affect
the performance of the cleansing spanning trees built. A
cleansing spanning tree with a better spanned feature order
can reduce the space and time complexities. In the past,
there were some famous tree structures for classification,
such as the decision-tree approach (Quinlan, 1986), which
was based on the entropy theory to select the next best fea-
ture. In order to reduce the computational complexity for
evaluating the spanning order of features, the following
heuristics are thus proposed.

H1: The more ‘1’ bits a record vector for a feature value
has, the more weight the feature value has.

H2: The more ‘1’ bit the class vector for a feature value
has, the less weight the feature value has.

These two heuristics show the relationship between fea-
ture values and classes. If a feature value appears in most
records with a single class, the weight of this feature value
is relatively high. These heuristics can be used to save the
computation time when compared to using the entropy the-
ory. The following FindSpanOrder algorithm is thus pro-
posed to determine the spanned feature sequence O of all
condition features by evaluating the feature weights
according to the above heuristics.

Algorithm 3 (FindSpanOrder).

Input: A feature matrix M for a table T
Output: A spanned feature order O.
Step 1: Initialize weightj = 0, where 1 6 j 6 m�1.
Step 2: For each Mj in M, set:
weightj  
Xrj
k¼1

CountðRV jkÞ
½CountðCV jkÞ�

2
;

where the function Count(x) is used to count the
number of ‘1’ bits in x.
Step 3: Order the features in O in the descendent order of
the weight values.

Step 4: Return O.
For example, according to the feature matrix in Table 2,
the weight of each feature is calculated as shown in Table 3.

The new spanned feature sequence O determined by the
above algorithm is thus hC4, C2, C3, C1i, instead of the ori-
ginal order hC1, C2, C3, C4i. The cleansing tree generated on
the new order is shown in Fig. 4.

As we can see, the cleansing tree with new feature order
O = hC4, C2, C3, C1i in Fig. 4 is much smaller than that in
Fig. 3. The number of nodes has decreased from 15 to 9.
Therefore, the computational time of generating and tra-
versing the spanning tree can be greatly reduced.

3.3.3. Cleansing feature matrix

After the cleansing tree is built, the ValidMask may not
be ONEn since inconsistent records may exist. The Valid-
Mask is then used by the following CleansingFeatureMa-
trix algorithm to remove the inconsistent records from
the feature matrix.

Algorithm 4 (CleansingFeatureMatrix).

Input: A feature matrix M for a table T and a valid
mask vector ValidMask.

Output: A cleansed feature matrix M.
Step 1: For each feature-value vector Fij in M, do follow-

ing sub-steps:
Step 1.1: RVij = RVij AND ValidMask.
Step 1.2: CVij = FindClassVector (RVij).
Step 2: Return M.


2864 W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869
For example, the ValidMask is set to ‘‘1111111100’’
after the cleansing tree for Table 1 is built. Since, the ninth
and tenth bits of the ValidMask are 0, the CleansingFeatu-
reMatrix algorithm will set these two bits of all the record
vectors in Table 2 to 0. The class vector of each feature
value is then recalculated by the FindClassVector algorithm
according to its new record vector. The revised feature
matrix is shown in Table 4.

3.3.4. Some definitions on feature sets

For effectively distinguishing the classes from the feature
values, we must extend the concepts related to a single fea-
ture to a feature sets. The following definitions are thus
needed.

Definition 9 (Power of a feature set). Cs is called the s-
power of a feature set C, if each element in Cs is composed
of s distinct condition features from C, 1 6 s 6 m�1.

Thus, we have C1 = C. For example, the power set C1

for the data in Table 1 is {{C1}, {C2}, {C3}, {C4}}. The
power set C2 is {{C1,C2}, {C1,C3}, {C1,C4}, {C2,C3},
{C2,C4}, {C3,C4}}. Let jCsj denote the cardinality of Cs.
Then:

jCsj ¼
m � 1

s

� �
:

Let Csj denote the jth element in C
s, 1 6 j 6 jCsj. Csj is

then a feature set. Also let V sj denote the domain of C
s
j,

rsj denote the number of possible values in V
s
j, and V

s
jk

denote the kth feature value of Csj. Each feature set can
be represented by a name vector, defined below.

Definition 10 (Name vector of a feature set). The name vec-
tor NV sj of a feature set C

s
j is a bit string b1, b2, . . ., bm�1,

with bi set to 1 if featureCi is included in C
s
j and set to 0

otherwise.
For the above example, C11 denotes the first element in

C
1, which is {C1}. The name vector NV

1
1 is then 1000 since

only C1 is included in C
1
1. For another example, C

2 = {{C1,
C2}, {C1,C3}, {C1,C4}, {C2,C3}, {C2,C4}, {C3,C4}}. C

2
1

denotes the first element in C2, which is {C1,C2}. The name
vector NV 21 is then 1100 since C1 and C2 are included in C

2
1.

Similar to a single feature, some terms related to a feature
set is defined below.

Definition 11 (Record vector of a feature set). A record vec-
tor RV sjk of a feature set value C

s
jk is a bit string b1, b2, . . ., bn,

with bi set to 1 for V
s
jðiÞ¼ V

s
jk and set to 0 otherwise, where

1 6 j 6 j Csj and 1 6 k 6 rsj.
Table 3
Calculating the weight of each feature

Feature Weight Old order New order

C1 3/4 + 3/9 + 4/9 = 1.53 1 4
C2 3/1 + 4/4 + 3/9 = 4.33 2 2
C3 5/9 + 2/1 + 3/4 = 3.31 3 3
C4 4/4 + 3/4 + 3/1 = 4.75 4 1
RV sjk thus keeps the information of the records with the
kth possible value of the feature set Csj. A class vector,
which is used to keep the information of the classes (values
of the decision feature) with a specific value of a feature set,
is defined below.

Definition 12 (Class vector of a feature set). A class vector
of CV sjk of a feature set value C

s
jk is a bit string b1, b2, . . ., b,

with bi set to 1 if RV
s
jk\ RVmi 5 ZEROn, and set to 0 other-

wise, where rm is the number of possible values of Cm and n
is the number of records in R.

CV sjk thus keeps the information of the classes related to
the kth possible value of the feature set Csj. A feature-value
vector of a feature set is defined below.

Definition 13 (Feature-value vector of a feature set). A fea-
ture-value vector F sjk is composed of RV

s
jk and CV

s
jk .

Definition 14 (A feature matrix for a feature set). A feature

matrix M sj for the feature set C
s
j is denoted

F sj1
F sj2
..
.

F sjrj

2
6664

3
7775, where

s
1 6 j 6 jC j and rsj is the number of possible values in C
s
j.

Definition 15 (s-feature matrix for a table T). An s-feature

matrix Ms for a table T is denoted

M s1
M s2
..
.

M sjCsj

2
6664

3
7775, where
1 6 s 6 m�1.
3.3.5. Selecting a feature set

In this sub-section, two algorithms are proposed to find
the desired feature set. The first algorithm, named the
SelectingFeatureSet algorithm, is used to find a feature
set from a given s-feature matrix. If there exists a feature
set which is sufficient to decide all the records in the given
dataset, the feature set will be returned and the feature
selection procedure stops. Otherwise, s is incremented
and the SelectingFeatureSet algorithm is executed again.
The second algorithm, named the CalculatingNextMatrix
algorithm, derives the new feature matrix from the previ-
ous feature matrix. The SelectingFeatureSet algorithm is
described as follows.

Algorithm 5 (SelectingFeatureSet).

Input: An s-feature matrix Ms for a table T.
Output: A selected feature set FS.
Step 1: Initialize FS = B, j = 1.
Step 2: If j 6 jCsj, do the next step; otherwise go to Step

7.
Step 3: Set k = 1, where k is used to keep the number of

the value currently processed in a feature set Csj.


0000011100 00130100000011 11021011100000 1101

1010000000 1004 0001100000 0105 0000000011 1107

0000000011 1108

0100000000 1006

0000000011 1109

1111111111 111root

C
4

C 2

C3

C1

Fig. 4. The cleansing tree generated on the new order hC4, C2, C3, C1i.

W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869 2865
Step 4: If k 6 rsj, do the next step; otherwise go to Step 6.
Step 5: If CV sjk 6¼ UNIQUErm , set j = j + 1 and go to Step

2; otherwise set k = k + 1 and go to Step 4.
Step 6: Set FS ¼ Csj; That is, for each i from 1 to m�1, set

FS = FS [ {Ci} if the ith bit of the name vector
NV sj for feature set C

s
j is equal to 1.

Step 7: Return FS.
Take the data in Table 1 as an example to illustrate the
above algorithm. s is set at 1 at the beginning. The 1-fea-
ture matrix M1 for the data is the same as the feature
matrix M found before. The SelectingFeatureSet algorithm
will examine the 1-feature sets one by one. The first element
M 11, which is {C1}, is then processed. The class vector CV

1
11

for the first feature value C111 is 110, which is not equal to
Table 4
The cleansed feature matrix of Table 2

Feature Feature-value Record vector Class vector

C1 V11 1100100000 110
V12 0011010000 111
V13 0000001100 001

C2 V21 1110000000 100
V22 0001111000 011
V23 0000000100 001

C3 V31 1001011100 111
V32 0110000000 100
V33 0000100000 010

C4 V41 1011100000 110
V42 0100000000 100
V43 0000011100 001

C5 V51 1110000000 100
V52 0001100000 010
V53 0000011100 001
UNIQUE3. Using the feature set {C1} can thus not com-
pletely distinguish the classes. The other elements in the
1-feature matrix M1 are then processed in a similar way.
In this example, no element is chosen. Thus B is returned.
It means no single feature can completely distinguish the
classes s is then incremented, and the SelectingFeatureSet
algorithm is then executed from the new s-feature matrix.
The new feature matrix can be easily derived from the pre-
vious feature matrix by the following CalculatingNextMa-
trix algorithm.

Algorithm 6 (CalculatingNextMatrix).

Input: An s-feature matrix Ms for a table T.
Output: An (s + 1)-feature matrix Ms+1 for a table T.
Step 1: For each j, j = 1 to jCsj� 1, do the following

steps.
Step 2: For each l, l = (j mod m) + 1 to m, do the fol-

lowing sub-steps.
Step 2.1: Set NV sþ1j ¼ NV

s
j OR NV

1
l .

Step 2.2: Set the temporary counter k to 1.
Step 2.3: For each feature-value vector F sjx in M

s
j,

1 6 x 6 jCsjj, do the following sub-steps:
Step 2.3.1: For each feature-value vector F 1ly in M

1
l ,

1 6 y 6 jC1lj, do the following sub-steps:
Step 2.3.1.1: Set RV sþ1jk ¼ RV jx AND RV

1
ly .

Step 2.3.1.2: Set CV sþ1jk ¼ CV
s
jx AND CV

1
ly .

Step 2.3.1.3: IF CV sþ1jk 6¼ UNIQUErm , set CV
sþ1
jk ¼

FindClassVectorðRV sþ1jk Þ.
Step 2.3.1.4: Set k = k+1.
Step 3: Return the (s + 1)-feature matrix Ms+1.
For example, the 2-feature matrix M2 for the data in Table
1 is generated from the 1-feature matrix M1 as follows. The
name vector for feature C21 is first calculated. Thus:


2866 W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869
NV 21 ¼ NV
1
1 OR NV

1
1

¼ 1000 OR 0100
¼ 1100:

The feature-value vector F 211 in M
2
1 is then calculated. The

record vector is found as follows:

RV 211 ¼ RV
1
11 AND RV

1
21

¼ 1100100000 AND 1110000000
¼ 1100000000:

The class vector is found as follows:

CV 211 ¼ CV
1
11 AND CV

1
21

¼ 110 AND 100
¼ 100:

In a similar way, all the feature-value vectors in the 2-fea-
ture matrix M2 can be found. The results are shown in
Table 5.

Note that in Step 2.2.1.2, the class vector derived by the
bitwise ‘‘AND’’ operator denotes only the ‘‘possible’’ class
distribution. For example, the feature-value vector F 221 con-
Table 5
The 2-feature matrix M2 found by the CalculatingNextMatrix algorithm

Feature
set

Feature set
value

Name
vector

Record
vector

Class
vector

C21 V
2
11 1100 1100000000 100

V 212 1100 0000100000 010
V 213 1100 0010000000 100
V 214 1100 0001010000 011
V 215 1100 0000001000 001
V 216 1100 0000000100 001

C22 V
2
21 1010 1000000000 100

V 222 1010 0100000000 100
V 223 1010 0000100000 010
V 224 1010 0001010000 011
V 225 1010 0010000000 100
V 226 1010 0000001100 001

C23 V
2
31 1001 1000100000 110

V 232 1001 0100000000 100
V 233 1001 0011000000 110
V 234 1001 0000010000 001
V 235 1001 0000001100 001

C24 V
2
41 0110 1000000000 100

V 242 0110 0110000000 100
V 243 0110 0001011000 011
V 244 0110 0000100000 010
V 245 0110 0000000100 001

C25 V
2
51 0101 1010000000 100

V 252 0101 0100000000 100
V 253 0101 0001100000 010
V 254 0101 0000011000 001
V 255 0101 0000000100 001

C26 V
2
61 0011 1001000000 110

V 262 0011 0000011100 001
V 263 0011 0010000000 100
V 264 0011 0100000000 100
V 265 0011 0000100000 010
sists of RV 221 ¼ 1000000000 and CV
2
21 ¼ 110 after Step

2.2.1.2. Since, each record belongs to only one class, the
above results are not correct. In fact, the class vector
CV 221 ¼ 100: Step 2.2.1.2 is used as a quick check. If
CV sþ1jk 6¼ UNIQUErm , then the FindClassVector algorithm
is run in Step 2.2.1.3 to find the correct class vector.

After the new feature matrix is derived, the SelectingFe-
atureSet algorithm is then executed again to find an appro-
priate feature set. For the above example, the 2-feature
matrix M2 is then input to the SelectingFeatureSet algo-
rithm and the feature set FS = {C2,C4} are found and
returned as the solution.

After the above method is executed, the feature set FS to
classify the given data set T is generated. FS may be over-
fitting or under-fitting for the problem since they are
derived only according to the current data set. These fea-
tures are then evaluated and modified by domain experts.
They thus serve as the candidates for the experts to have
a good initial standpoint.

4. Time and space complexity analysis

The time and space complexities of the proposed algo-
rithms are analyzed in this section. Let n be the number
of records, m be the number of features and c be the num-
ber of classes. Also define i as the maximum possible num-
ber of features in a feature set, j as the maximum number of
possible values of a feature, and s as the number of itera-
tions. The time complexity and space complexity of each
step in the FindClassVector algorithm is shown in Table 6.

The time and space complexities of each step in the Cre-
ateCleansingTree algorithm is shown in Table 7. Note that
the maximum amount of nodes within a Ctree is n.

The time and space complexities of each step in the Find-
SpanOrder algorithm is shown in Table 8.

The time and space complexities of each step in the
CleansingFeatureMatrix algorithm is shown in Table 9.
Table 6
The time and space complexities of the FindClassVector algorithm

Step no. Time complexity Space complexity

Step 1 O(1) O(c)
Step 2 O(jc) O(jc)
Step 3 O(1) O(c)
Total O(jc) O(jc)

Table 7
The time and space complexities of the CreateCleansingTree algorithm

Step no. Time complexity Space complexity

Step 1 O(1) O(1)
Step 2 O(1) O(1)
Step 3 O(1) O(1)
Step 4 O(nmj) O(n)
Step 5 O(mj) O(n)
Step 6 O(1) O(1)
Step 7 O(1) O(1)
Total O(nmj) O(n)*


Table 8
The time and space complexities of the FindSpanOrder algorithm

Step no. Time complexity Space complexity

Step 1 O(m) O(m)
Step 2 O(cm) O(cm)
Step 3 O(clgc) O(c)
Step 4 O(1) O(1)
Total O(Max(cm, clgc)) O(cm)

Table 9
The time and space complexities of the CleansingFeatureMatrix algorithm

Step no. Time complexity Space complexity

Step 1 O(mj) O(mj)
Step 2 O(1) O(1)
Total O(mj) O(mj)

Table 10
The time and space complexities of the SelectingFeatureSet algorithm

Step no. Time complexity Space complexity

Step 1 O(1) O(1)
Step 2 O(msjs) O(1)
Step 3 O(1) O(1)
Step 4 O(js) O(1)
Step 5 O(1) O(1)
Step 6 O(c) O(c)
Step 7 O(1) O(1)
Total O(msjs) O(c)

Table 11
The time and space complexities of the CalculatingNextMatrix algorithm

Step no. Time complexity Space complexity

Step 1 O(msjs) O(msjs)
Step 2 O(mj) O(mj)
Step 3 O(1) O(j)
Total O(msjs) O(msjs)

W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869 2867
The time and space complexities of each step in the
SelectingFeatureSet algorithm is shown in Table 10.

The time and space complexities of each step in the Cal-
culatingNextMatrix algorithm is shown in Table 11.
Table 12
The datasets used in the experiments

Database
name

Class
no.

Condition feature
no.

Record
no.

Missing
features

Monk1 2 6 124 No
Monk2 2 6 169 No
Monk3 2 6 122 No
Vote 2 16 300 No
Mushroom 2 22 8124 Yes
SoybeanL 19 35 683 Yes
Insurance 3 27 35000 Yes
5. Experiments

To evaluate the performance of the proposed method,
we compare it with other feature selection methods. Our
target machine is a Pentium III 1G Mhz processor system,
running on the Microsoft Windows 2000 multithreaded
OS. The system includes 512 K L2 cache and 256 MB
shared-memory.

Several datasets from the UCI Repository (Quinlan,
1986) are used for the experiments. These datasets have dif-
ferent characteristics. Some have known relevant features
(such as Monks), some have many classes (such as Soy-
beanL), and some have many instances (such Mushroom).
In addition, a large real data set about endowment insur-
ances from a world-wide financial group is used to examine
the usability of the proposed method. Experimental results
show the proposed method can discover the desired feature
sets and can thus help the enterprise to build a CBR system
for their loan promotion function of customer relationship
management system. The data set of insurance data uses 27
condition features to describe the states of 3 different insur-
ance types. Different types of attribute values including
date/time, numeric and symbolic data exist. They are all
transformed into the symbolic type by some clustering
methods. Six of them have missing values.

The characteristics of the above datasets are summa-
rized in Table 12.

In the experiments, the accuracy, the number of selected
features, and the time will be compared between our
method and the traditional rough set method. The accu-
racy is measured by the classification results of the target
table. If the selected feature set can solve the problem with-
out any error, 100% accuracy is reached; otherwise the
accuracy is calculated by the number of correctly classified
records over the total number of records. Experimental
results show both methods can reach 100% accuracy. We
then compare the feature sets found by these two
approaches. The results are shown in Table 13. Obviously,
the accuracy of all datasets is 100% since both of these two
methods discover the minimal feature sets.

Note that there may be more then one solution for the
selected features. In Table 13, only the first selected feature
set (in the alphabetical order) is listed. It is easily seen that
the selected feature sets of our proposed approach and the
traditional rough set approach are the same except for the
SoybeanL problem. The SoybeanL problem needs too
much computation time by the traditional rough set
approach.

The numbers of the selected features by the two
approaches are shown in Table 14. Both methods get the
same numbers for all problems except for SoybeanL.

At last, the computation time is compared. The data sets
are first loaded into the memory from the hard disk and the
processing times are measured. The time is rounded to 0 if
the real time is less than 0.001 seconds. The results are
shown in Table 15.

Consistent with our expectation, the proposed approach
is much faster than the traditional rough set approach.
Especially for the Insurance data, our approach needs only


Table 13
The selected feature sets found by the two approaches

Dataset Feature set Accuracy (%)

Traditional rough set approach Bitmap-based approach

Monk1 C1, C2, C5 C1, C2, C5 100
Monk2 C1–C6 C1–C6 100
Monk3 C1, C2, C4, C5 C1, C2, C4, C5 100
Vote C1–C4, C9, C11, C13, C16 C1–C4, C9, C11, C13, C16 100
Mushroom C3, C4, C11, C20 C3, C4, C11, C20 100
SoybeanL Need too much computation time C14, C20, C26, C27, C29, C30, C31, C32, C33, C34, C35 100
Insurance C4, C15, C17, C20, C22, C25 C4, C15, C17, C20, C22, C25 100

Table 14
The number of the selected features found by the two approaches

Dataset Traditional RS Bitmap-based

Monk1 3 3
Monk2 6 6
Monk3 4 4
Vote 8 8
Mushroom 4 4
SoybeanL 11 11
Insurance 6 6

Table 15
The CPU times needed by the two approaches

Dataset Traditional RS Bitmap-based

Monk1 0.07 0
Monk2 0.351 0.01
Monk3 0.141 0
Vote 428.19 1.923
Mushroom 4911.32 27.91
SoybeanL >1000000 247805
Insurance 468656 2435.66

2868 W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869
about 40 min, but the traditional rough set approach needs
much more computation time.
6. Conclusion and future work

In this paper, we have proposed a bit-based feature
selection approach to discover optimal feature sets for
the given table(dataset). In this approach, the feature val-
ues are first encoded into bitmap indices for searching the
optimal solutions efficiently. Also, the corresponding
indexing and selecting algorithms are described in details
for implementing the proposed approach. Experimental
results on different data sets have also shown the efficiency
and accuracy of the proposed approach.

The traditional rough set approach has two very time-
consuming parts, combination of features and comparison
of upper/lower approximations. In this paper, we use the
single-time-clock bitwise operations to shorten the compu-
tation time of the comparison part. Moreover, the work-
load in the combination part is highly reduced since the
new levels of combination can be generated via the pervi-
ous ones. The bitwise operations are also used to speed
up the combination generation. The proposed feature
selection approach also adopts appropriate meta-data
structures to take advantages of the computational power
of the bitwise operations.

The feature selection problem is generally an NP-com-
plete problem. Although, the proposed approach can pro-
cess a larger amount of features than the traditional rough
set approach, it still becomes unmanageable especially
when the number of features is huge or when the number
of possible values of features is large. In the future, we will
continuously investigate and design efficient heuristic
approaches to manage huge amounts of features and pos-
sible values. We will also attempt to integrate different fea-
ture selection approaches to automatically select an
appropriate one for optimal or near-optimal solutions
according to the characteristics of given data sets.

References

Almullim, H. et al. (1991). Learning with many irrelevant features. In
Proceedings of ninth national conference on artificial intelligent, pp.
547–552.

Brassard, G. et al. (1996). Fundamentals of Algorithm. New Jersey:
Prentice Hall.

Chen, W. C., Tseng, S. S., Chen, J. H., & Jiang, M. F. (2000). A
framework of feature selection for the case-based reasoning. In
Proceeding of IEEE international conference on systems, man, and

cybernetics, CD-ROM.
Chen, W. C., Tseng, S. S., Chang, L. P., & Jiang, M. F. (2001). A

similarity indexing Method for the data warehousing – bit-wise
indexing method. In Lecture notes in artificial intelligent (Vol. 2035),
pp. 525–537.

Chen, W. C., Tseng, S. S., Chang, L. P., & Hong, T. P. (2002). A
parallelized indexing method for large-scale case-based reasoning.
Expert System with Applications, 23(2), 95–102.

Chen, W. C., Yang, M. C., & Tseng, S. S. (2002). A high-speed feature
selection method for large dimensional data set. In Proceeding of
international computer symposium, CD-ROM.

Chen, W. C., Yang, M. C., & Tseng, S. S. (2003). The bitmap-based
feature selection method. In 18th ACM symposium on applied
computing (SAC), data mining track, CD-ROM.

Choubey, S. K. et al. (1998). On feature selection and effective classifiers.
Journal of ASIS, 49(5), 423–434.

Doak, J. (1992). An evaluation of feature selection methods and their
application to computer security, Technical Report, University of
California.

Gonzalez, A., & Perez, R. (2001). Selection of relevant features in a fuzzy
genetic learning. IEEE Transaction on SMC-Part B, 31(3), 417–425.

Huang, C. C., & Tseng, B. (2004). Rough set approach to case-based
reasoning. Expert Systems with Applications, 26(3), 369–385. April.


W.-C. Chen et al. / Expert Systems with Applications 34 (2008) 2858–2869 2869
John, G. H. et al. (1994). Irrelevant feature and the subset selection
problem. In Proceedings of 11th international conference on machine
learning, pp. 121–129.

Last, M., Kandel, A., & Maimon, O. (2001). Information theoretic
algorithm for feature selection. Pattern Recognition Letter, 22,
799–811.

Lee, H. M., Chen, C. M., Chen, J. M., & Jou, Y. L. (1997). An efficient
fuzzy classifier with feature selection based on fuzzy entropy. IEEE
Transaction on SMC-Part B, 27(2), 426–432.

Liu, H. et al. (1996). A probabilistic approach to feature selection – a filter
solution. In Proceedings of 13th international conference on machine
learning, pp. 319–327.

Liu, H., & Setiono, R. (1998). Incremental feature selection. Applied
Intelligence, 9, 217–230.

Miao, D. Q. et al. (1999). A heuristic algorithm of reduction for
knowledge. Journal of Computer Research and Development, 36(6),
681–684.

O’Neil, P., & Quass, D. (1997). Improved query performance with variant
indexes. ACM SIGMOD Record, 26(2), 38–49.

Pawlak, Z. (1982). Rough set. International Journal of Computer and
Information Sciences, 341–356.

Pawlak, Z. (1991). Rough Sets. Theoretical Aspects of Reasoning about
Data. Boston: Kluwer Academic Publishers.

Quinlan, J. (1986). Introduction of decision trees. Machine Learning, 1(1),
81–106.
Raymer, M. L., Punch, W. F., Goodman, E. D., & Kuhn, L. A. (2000).
Dimensionality reduction using genetic algorithm. IEEE Transaction
on Evolutionary Computation, 4(2), 164–171.

Schlimmer, J. C. et al. (1993). Efficiently inducing determinations: A
complete and systematic search algorithm that uses optimal pruning.
In Proceedings of 10th international conference on machine learning, pp.
284–290.

Skowron, A., & Rauszer, C. (1992). The discernibility matrices and
functions in information systems. Intelligent Decision Support,
331–362.

Tseng, B., Jothishankar, M. C., & Wu, T. (2004). Quality control
problem in printed circuit board manufacturing – An extended
rough set theory approach. Journal of Manufacturing Systems, 23(1),
56–72.

Wu, F. B. et al. (1999). Control and Decision, 14(3), 206–211.
Wu, M. C., & Alejandro, P. B. (1998). Encoded bitmap indexing for

data warehouses. In Proceedings of IEEE data engineering, pp. 220–
230.

Yang, Y., & Chiam, T. C. (2000). Rule discovery based on rough set
theory. In Proceedings of the third international conference on FUSION
(Vol. 1), pp. TuC4_11–TuC4_16.

Yu, H. et al. (2001). Rough set based knowledge reduction algorithms.
Computer Science, 28(5), 31–34.

Zhong, N., Dong, J., & Ohsuga, S. (2001). Using rough sets with heuristics
for feature selection. Journal of Intelligent Systems, 16, 199–214.


	An efficient bit-based feature selection method
	Introduction
	Review of feature selection and rough sets
	The proposed bitmap-based feature selection method
	Problem definitions
	Bitmap indexing phase
	Feature selection phase
	Creating cleansing tree
	Finding appropriate spanned order
	Cleansing feature matrix
	Some definitions on feature sets
	Selecting a feature set


	Time and space complexity analysis
	Experiments
	Conclusion and future work
	References