Improved multiclass feature selection via list combination


Accepted Manuscript

Improved multiclass feature selection via list combination

Javier Izetta, Pablo F. Verdes, Pablo M. Granitto

PII: S0957-4174(17)30467-0
DOI: 10.1016/j.eswa.2017.06.043
Reference: ESWA 11414

To appear in: Expert Systems With Applications

Received date: 8 March 2017
Revised date: 30 June 2017
Accepted date: 30 June 2017

Please cite this article as: Javier Izetta, Pablo F. Verdes, Pablo M. Granitto, Improved multi-
class feature selection via list combination, Expert Systems With Applications (2017), doi:
10.1016/j.eswa.2017.06.043

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.

http://dx.doi.org/10.1016/j.eswa.2017.06.043
http://dx.doi.org/10.1016/j.eswa.2017.06.043


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Highlights

• We introduce new SVM-RFE feature selection methods for multiclass
problems

• We use binary decomposition followed by strategies to combine lists of
features

• We discuss statistical approaches and voting theory methods

• One-vs-One methods give better results than One-vs-All methods

• The new K-First method is the more effective in selecting relevant
features

1


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Improved multiclass feature selection

via list combination

Javier Izettaa, Pablo F. Verdesa, Pablo M. Granittoa,∗

aCIFASIS, French Argentine International Center for Information and Systems
Sciences, UNR–CONICET,

Bv. 27 de Febrero 210 Bis, 2000 Rosario, Argentina

Abstract

Feature selection is a crucial machine learning technique aimed at reducing
the dimensionality of the input space. By discarding useless or redundant
variables, not only it improves model performance but also facilitates its in-
terpretability. The well-known Support Vector Machines–Recursive Feature
Elimination (SVM-RFE) algorithm provides good performance with mod-
erate computational efforts, in particular for wide datasets. When using
SVM-RFE on a multiclass classification problem, the usual strategy is to
decompose it into a series of binary ones, and to generate an importance
statistics for each feature on each binary problem. These importances are
then averaged over the set of binary problems to synthesize a single value
for feature ranking. In some cases, however, this procedure can lead to poor
selection. In this paper we discuss six new strategies, based on list combi-
nation, designed to yield improved selections starting from the importances
given by the binary problems. We evaluate them on artificial and real-world
datasets, using both One–Vs–One (OVO) and One–Vs–All (OVA) strategies.
Our results suggest that the OVO decomposition is most effective for feature
selection on multiclass problems. We also find that in most situations the
new K-First strategy can find better subsets of features than the traditional
weight average approach.

Keywords:

∗Corresponding author
Email addresses: izetta@cifasis-conicet.gov.ar (Javier Izetta),

verdes@cifasis-conicet.gov.ar (Pablo F. Verdes),
granitto@cifasis-conicet.gov.ar (Pablo M. Granitto)

Preprint submitted to Expert Systems with Applications July 5, 2017


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Feature Selection, Multiclass problems, Support Vector Machine

3


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

1. Introduction1

Many important problems in Machine Learning, as well as in in-silico2
Chemistry (Raies & Bajic, 2016), Biology, “high-throughput” technologies3
(Golub et al., 1999; Leek et al., 2010) or text processing (Forman, 2003;4
Uysal, 2016), share the property of involving much more features than mea-5
sured samples are available (Guyon & Elisseeff, 2003). The datasets associ-6
ated to these problems are, unsurprisingly, called “wide”. Usually, most of7
these variables carry a relatively low importance for the problem at hand.8
Furthermore, in some cases they interfere with the learning process instead9
of helping it, a scenario usually referred to as “curse of dimensionality”.10

Feature selection is an important pre–processing technique of Machine11
Learning aimed at coping with this curse (Kohavi & John, 1997). Its main12
goal is to find a small subset of the measured variables that improve, or at13
least do not degrade, the performance of the modeling method applied to the14
dataset. But feature selection methods do not only avoid the curse of dimen-15
sionality: they also allow for a considerable reduction in model complexity,16
an easier visualization and, in particular, a better interpretation of the data17
under analysis and the developed models (Liu et al., 2005).18

Several methods have been introduced in recent years, from general ones19
like Wrappers (Kohavi & John, 1997) and filters (Kira & Rendell, 1992) to20
very specific ones developed for SVM (Weston et al., 2000; Nguyen & De la21
Torre, 2010) and RVM (Mohsenzadeh et al., 2013, 2016) classifiers. Amongst22
other methods in the field (Hua et al., 2009), the well-known Recursive Fea-23
ture Elimination (RFE) algorithm provides good performance with moderate24
computational efforts (Guyon et al., 2002) on wide datasets. The original and25
most popular version of this method uses a linear Support Vector Machine26
(SVM) (Vapnik, 2013) to select the candidate features to be eliminated. Ac-27
cording to the SVM–RFE algorithm, the importance of an input variable28
i is directly correlated with the corresponding component (wi) of the vec-29
tor defining the separating hyperplane (w). The method is widely used in30
Bioinformatics (Guyon et al., 2002; Statnikov et al., 2005). Alternative RFE31
methods using other classifiers have also been introduced in the literature32
(Granitto et al., 2006; You et al., 2014).33

Typical feature selection algorithms are designed for binary classification34
problems, as the original version of RFE. Multiclass problems have received35
much less attention because of their increased difficulty. Also, because some36
classifiers involved in the selection process are designed to solve binary prob-37

4


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

lems. Most methods available for feature selection on multiclass problems38
are simple extensions of base methods. For example, RFE can be associated39
to a multiclass classifier like Random Forest (Breiman, 2001; Granitto et al.,40
2006).41

Although SVM was originally developed to deal only with binary prob-42
lems, it was extended to directly solve multiclass problems in different man-43
ners (Weston & Watkins, 1999; Crammer & Singer, 2001; Hsu & Lin, 2002),44
but with a modest success attributed mainly to the increased complexity of45
the solutions. On the other hand, in the last years several methods were46
developed to solve a multiclass problem using an appropriate combination of47
binary classifiers (Allwein et al., 2000; Hsu & Lin, 2002). The most usually48
followed strategy for multiclass SVM is known as “One–vs–One” (OVO). Ac-49
cording to this approach, a classification problem with c classes is replaced50
with M = c(c−1)/2 reduced binary ones, each one of them consisting of dis-51
criminating a pair of classes. In order to classify a new example, it is passed52
through all binary classifiers and the most voted class is selected. Another53
useful strategy is “One–vs–All” (OVA). In this second case, a problem with54
c classes is replaced with M = c reduced binary problems, each one of them55
consisting of discriminating a single class from all remaining ones.56

Therefore, the most usual approach to implement a multiclass SVM–RFE57
method is to directly apply the RFE algorithm over an OVO or OVA multi-58
class SVM (Ramaswamy et al., 2001; Duan et al., 2007; Zhou & Tuck, 2007).59
The pioneering work of Ramaswany et al. (Ramaswamy et al., 2001) pro-60
posed the OVA solution, but also compared results with the OVO strategy.61
Duan et al. (Duan et al., 2007) and Zhou et al. (Zhou & Tuck, 2007) devel-62
oped slight variations of the method, always considering both OVA and OVO63
implementations. Zhou et al. (Zhou & Tuck, 2007) also considered solutions64
to the RFE problem using a direct multiclass implementation.65

Interestingly, the solutions to the multiclass SVM–RFE problem that we66
have just described involve an important decision about the feature selection67
process which is usually neglected: they rank features by simply averag-68
ing components over the binary problems. For an input variable i they use69
< |wij| >j, the mean importance over all binary problems j, as the corre-70
sponding importance. As we discuss in the next section, this strategy can71
lead to sub–optimal selections in many cases. Once the original multiclass72
problem has been divided into multiple binary ones, the feature selection73
problem can be treated in a similar way. Then, a possible solution is to cast74
the multiclass feature selection problem as the problem of selecting candidate75

5


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

features from multiple lists (Jurman et al., 2008), each list corresponding to76
a different binary sub-problem.77

Similar solutions have been studied in related fields. In Bioinformatics,78
for example, Haury et al. (Haury et al., 2011) discussed the combination of79
multiple lists of genes from bootstraps of the same gene-expression dataset.80
Zhou and Dickerson (Zhou & Dickerson, 2014) and Zhou and Wang (Zhou &81
Wang, 2016) proposed the use of class–dependent features (different features82
for each binary problem) for biomarker discovery. Dittman et al. (Dittman83
et al., 2013) showed that combining multiple lists in binary classification84
problems can improve the feature selection results. In a short work in text85
categorization, Neumayer et al. (Neumayer et al., 2011) suggested that the86
combination of rankings generated by diverse methods can improve the re-87
sults of using a single method. Kanth and Saraswathi (Kanth & Saraswathi,88
2015) used class–dependent features for speech emotion recognition, but us-89
ing independent features for each class, not a final unique list.90

In this work we discuss in depth the use of combination of multiple lists in91
feature selection for multiclass classification problems. We first introduce a92
simple mathematical framework for multiple lists. Using this framework, we93
propose diverse strategies to produce improved selection of feature subsets94
with SVM-RFE. Also, we use some specifically–designed artificial datasets95
and real–world examples to evaluate them extensively, using both the OVO96
and OVA strategies.97

The rest of this article is organized as follows: in Section 2, we describe the98
feature selection methods introduced in this work. In Section 3 we evaluate99
these methods on diverse datasets and experimental setups. Finally, we draw100
our conclusions in Section 4.101

2. List combination methods for SVM–RFE102

The RFE selection method is a recursive process that ranks variables103
according to a given importance measure. At each iteration of the algo-104
rithm, the importance of each feature is calculated and the less relevant one105
is removed —in order to speed up the process, not one but a group of low106
relevance features is usually removed. Recursion is needed because the rel-107
ative importance of each feature can change substantially when evaluated108
over a different subset of features during the stepwise elimination process, in109
particular for highly correlated features. The inverse order in which features110

6


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

are eliminated is used to create a final ranking. Then, the feature selection111
process itself is reduced to take the first n features from this ranking.112

In the original binary version of SVM–RFE (Guyon et al., 2002), the113
projection of w (the normal vector to SVM’s decision hyperplane) in the114
direction of feature i, wi, is used as the importance measure. The method115
was efficiently extended to multiclass problems, employing the well-known116
OVO or OVA strategies to decompose the multiclass problem into a series117
of related binary ones (Ramaswamy et al., 2001; Duan et al., 2007; Zhou &118
Tuck, 2007). In both cases a set of M related binary problems is generated,119
each one solved by a vector wj. For each binary problem j, the importance120
of feature i is given by the corresponding component, wij.121

In order to obtain a unique importance for each feature in this setup, the122
simplest solution is to average the absolute value of the components |wij|123
over all related binary problems. We will call this method “Average” in the124
following. The Average solution is implemented, to the best of our knowledge,125
in all available RFE software packages, including the most popular amongst126
researchers (MATLAB, R and PYTHON platforms).127

However, the only real advantage of the Average strategy is its simplicity.128
Two main drawbacks of this approach should be taken into consideration but129
are usually ignored:130

1. The first issue can be called the flattening problem. Consider, for131
example, a feature e which is able to separate class j from all remaining132
classes, but is uninformative in other cases. Component wej will be133
large, but components wek with k 6= j will be small, giving a low value134
for < wej >j. Consider now another feature d which can give a modest135
help in separating any class from the others, obtaining always moderate136
values of wdj, and therefore giving a medium value for < wdj >j. The137
Average strategy will clearly rank the latter over the former, but in138
most scenarios it will be desirable to keep the first variable over the139
second.140

2. The second issue with the Average solution refers to relative scales.141
The length of vector wj is different for each binary problem, as it de-142
pends on the margin of the solution, which can change considerably for143
classes that are relatively close or far away in feature space. Averaging144
components of vectors of different lengths can lead to the selection of145
sub–optimal subsets.146

New strategies for feature selection able to overcome these drawbacks are

7


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Ranking List 1 List 2 . . . List M

1 f2 f3 . . . f1
2 f1 f7 . . . f3
3 f5 f2 . . . f6
...

...
...

. . .
...

p f8 f4 . . . f7

Table 1: List of ranked features for each binary problem.

needed. Here we propose to cast the problem as a selection of candidate
features from multiple ranking lists (Jurman et al., 2008). We start by de-
composing the multiclass problem into a set of M related binary problems
(through the OVA or OVO strategies). The problem involves a set of p fea-
tures, F = {f1, f2, . . . fp}. SVM–RFE produces a ranking (an ordered list)
for each individual problem using the components wij. An example is shown
in Table 1. This set of lists can be arranged in a matrix (Table 2) where each
row shows the position of each feature in the ranking produced for the bi-
nary problem shown on each column. We can now define a matrix of relative
ranking positions as:

ri,j =
p−posi,j

p
= 1 − posi,j

p
,

where ri,j is the relative ranking of feature fi in the list corresponding to bi-147
nary problem j, posi,j is the position of the same feature in the corresponding148
ranking (Table 2), and p is the total number of features in the problem. No-149
tice that the values of ri,j belong to the unit interval [0, 1] and depend linearly150
on the ranking position (a value of 1 − 1/p must be interpreted as the first151
position in the ranking).152

Important features should reflect in high values of ri,j for some i, j, mean-153
ing they are relevant to at least some of the binary classification problems.154
Two main strategies can be used to select those relevant features from this155
matrix and are discussed in the following.156

2.1. Methods based on relative ranking statistics157

The first strategy consists of measuring an appropriate statistic for each158
feature over all binary problems, and then using it to elaborate a final ranking159
of features. We selected the following four methods:160

8


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Features List 1 List 2 . . . List M

f1 2 5 . . . 1
f2 1 4 . . . 7
f3 4 1 . . . 2
...

... posi,j
. . .

...
fp posp,1 . . . . . . posp,M

Table 2: Matrix showing the position of each feature in the ranking of each binary problem.
Rows correspond to features and columns to binary problems.

2.1.1. Average-SD161

In this method, feature ranking is given by the average value of the relative162
position over all binary problems:163

Ri =< ri,j >j,

where Ri is the ranking of feature fi in the final ordered list, used to select164
features in the multiclass problem. Ties are broken by the standard deviation165
(SD) of the relative position (higher is better). We show in the next section166
that features with higher SD are preferable over lower SD ones, because a167
larger SD means that the feature has some better–than–average rankings.168

Average-SD can be considered as the base strategy for multiple lists. It169
can overcome the relative scales problem on averaging weights, but is not170
expected to solve the flattening problem.171

2.1.2. Best Ranking172

In this second approach we rank every feature according to the best rel-173
ative ranking that it reaches over the set of binary problems:174

Ri = max(ri,j)j.

Ties are broken by the mean value of the relative position over all prob-175
lems. A similar method has been used to select the winning class in multiple176
classifier systems (Ho et al., 1994). This strategy can be viewed as an ex-177
treme case, considering for each feature just one of the multiple rankings it178
receives and disregarding the rest. On the other hand, it is most aggressive179
in dealing with the flattening problem.180

9


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

2.1.3. 3Q-SD181

The third method orders features according to the 3rd quartile of the182
distribution of relative rankings:183

Ri = 3Q(ri,j)j,

where the 3Q function returns the 3rd quartile of its argument. As in184
Average-SD, ties are broken by the SD. This approach is intermediate be-185
tween the two previous ones, searching for features that reach a high relative186
position, but also considering the full relative rankings distribution.187

2.1.4. K-First188

This method is adapted from a strategy to select relevant documents in189
information retrieval (Nuray & Can, 2006). The idea is to only consider190
features located in the top k positions of each individual list. We re-scale the191
relative rankings with a linear mapping reaching 0 for the k + 1 feature, and192
then take the average of this new relative importance:193

r′i,j = max(1 −
posi,j
k

, 0)

Ri =< r
′
i,j >j,

where r′i,j is the re-scaled relative weight for feature fi and k is the number194
of features to be considered from each list (k < p). As in the Best Ranking195
method, ties are broken by the mean value of the original relative ranking,196
< ri,j >j. We discuss the set of parameter values k in the next section.197
This strategy is aimed at searching for features which are highly relevant for198
some of the problems, but is not limited to searching for the most relevant199
features —as the Best Ranking method is. It can potentially overcome both200
drawbacks of Average: relative scaling and flattening.201

2.2. Methods based on voting theory202

The second general strategy is related to voting theory (Saari, 2001;203
Young, 1988). In this setup we consider each binary problem as a voter,204
producing a ranking over a set of p candidates. Multiple methods were de-205
veloped over the years to solve the problem of combining elector preferences206
to find winner candidates —the most useful of them are known collectively207
as ”Condorcet Methods”. We focused on two popular procedures as selection208
methods for relevant features over multiple lists:209

10


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

2.2.1. Condorcet210

The most basic Condorcet method is known as Copeland’s method, or211
simply as Condorcet method (we will use the latter name in this work). It212
confronts each pair of features on every list (all binary problems), and then213
counts the number of wins minus the number of defeats for each feature214
(Young, 1988). A feature wins over another if it is ranked higher in the215
considered list. The global difference between wins and loses is used to216
rank features in the multiclass problem. Ties are broken by average relative217
rankings.218

2.2.2. Schulze219

This method, introduced by Schulze informally in 1997 and published220
later (Schulze, 2011), represents an improvement over previous Condorcet221
methods. It begins by counting wins and loses over each pair of features and222
all lists, storing these numbers in a pairwise preference matrix. Then a graph223
is constructed, with features as nodes and values in the matrix as weights.224
Finally, using a variant of the FloydWarshall algorithm, the strongest paths225
over the graph are selected for each pair of features, and their strengths226
are used to compare features. The strength of a path is defined as that227
of its weakest link (i.e., lower value in the matrix of preferences). A path228
between two nodes is valid if there is a sequence of strictly decreasing weights229
connecting them (Schulze, 2011). Features with more wins upon strength230
comparison are ranked first. The method is expected to perform better than231
basic Condorcet, but the computational load involved is significant.232

3. Evaluation on artificial datasets233

We first consider artificial classification problems in order to evaluate spe-234
cific aspects of the new methods and to be able to compare their capabilities235
in a controlled manner.236

3.1. Experimental setup237

As in previous works (Granitto et al., 2006), we strive to use an appro-238
priate computational setup for feature selection. We perform n = 100 times239
a random split (75%−25%) of each dataset in training and testing sets (the240
former are used to select features and train the classifiers, while the latter241
for model accuracy estimation). The testing sets are completely external to242

11


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

the feature selection process, thus providing unbiased estimates of classifica-243
tion errors for different number of features. The results of the n replicated244
experiments are then aggregated to yield mean error rate estimations and245
their corresponding SD.246

SVM–RFE was implemented using the OVO strategy unless specified247
otherwise. In both cases, we created the corresponding binary problems and248
produced a ranking of features for each of them. To create a ranking we used249
the standard SVM–RFE (linear kernel), as described by their authors (Guyon250
et al., 2002), eliminating 10% of the features at each iteration until there were251
less than 20 features left, when we slowed the procedure to eliminate 1 feature252
at each iteration. The fixed set of lists of ranked features were combined253
using the methods described before, producing a final list of features for each254
method under evaluation. Finally, for each method we fitted a multiclass255
SVM for a varying number of features, from 2 to p, using only the training256
data, and measured the classification error using the testing set. The C257
parameter was estimated in all cases using 5-fold cross validation of the258
training set.259

3.2. Artificial datasets260

We created three different multiclass datasets that provide diverse chal-261
lenges to our methods. In all cases, each class is sampled from a Gaussian dis-262
tribution with diagonal covariance matrix. For each dataset we can identify,263
by construction, a group of relevant features that can discriminate amongst264
classes and another group of irrelevant features containing Gaussian noise.265
All noisy features have the same mean (0) and SD (1) for all classes. Each266
dataset is composed of 3000 points evenly distributed among classes. The267
number of noisy features is fixed at 500.268

In the first dataset, called Artificial-1, there is a group of 5 features that269
is relevant for each class, i.e., class-specific features. The set of 5 features270
together shift the class center away from other classes. All relevant features271
have the same importance for the problem. The SD of the Gaussian distri-272
butions corresponding to relevant features are always set to 0.5.273

A different situation arises when there are sets of features which are rel-274
evant to some of the classes (more than one) but not for all of them. We275
created a second classification problem, Artificial-2, to evaluate this chal-276
lenge. The dataset has 8 classes and 25 relevant features, all sampled from277
Gaussian distributions with a SD of 0.5. The first 5 features are relevant for278
the first 3 classes of the problem. The following 5 are relevant for classes 4279

12


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Dataset Classes Relevant features Noisy features

Artificial-1-3C 3 15 500
Artificial-1-4C 4 20 500
Artificial-1-5C 5 25 500
Artificial-1-8C 8 40 500
Artificial-1-16C 16 80 500
Artificial-2-8C 8 25 500
Artificial-3-3C 3 15 500
Artificial-3-8C 8 40 500
Artificial-3-16C 16 80 500

Table 3: Details of the artificial datasets used in this work.

and 5 only, and are less relevant than the first 5. The rest of the features are280
relevant for a single class, 5 for each of the remaining 3 classes. These last281
features are less relevant than the first 10 features.282

Finally, we created a third problem, called Artificial-3, where all relevant283
features are equally useful for all classes at the same time. As in Artificial-1,284
there are 5 features for each class, all sampled from Gaussian distributions285
with a SD of 0.5.286

In all problems there is an overlap among classes, giving a nonzero Bayes287
error. We created five datasets for Artificial-1, with an increasing number288
of classes, and 3 datasets for Artificial-3 in the same way. Table 3 collects289
technical details of the datasets.290

291

3.3. Methodological setup292

3.3.1. K-First293

The K-First method is the only approach involving a parameter that294
needs to be set, k. The value of this parameter regulates the number of295
variables that receive a relative ranking. A very low value would make the296
method similar to Best Ranking, while a high one would turn the method297
into 3Q-SD (furthermore, k = p would convert the method into Average-SD).298

We evaluated several values of k (increasing fractions of p) over all ar-299
tificial datasets considered. Figure 1 shows the corresponding error curves300
as a function of the number of features selected by the method, for some301
representative problems. Error curves for all other artificial problems are302

13


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

similar to the reported ones (we include more figures in the Additional Ma-303
terial section). The vertical dotted lines show the correct number of relevant304
features for the problem, i.e. where the minimum of the curve should ideally305
be located. When possible, we show with a gray horizontal line the chance306
error level for the corresponding problem. The top row shows typical results307
for the Artificial-3 problem. In this case there are no class-specific features308
and, as a consequence, the results are almost independent of k. The bottom309
row shows results for class-dependent problems, Artificial-1 and 2. In this310
case the results clearly depend on k. We found that a value of 10% of p gives311
consistently good results in all artificial cases considered here, therefore we312
will use this value for the rest of the paper.313

3.3.2. Average-SD and 3Q-SD314

As noted before, these methods use the SD of the relative rankings as a315
breaking tie criterion, considering larger values of SD as better than smaller316
ones. This is based on the assumption that a large SD is associated with317
high rankings for some of the binary problems, and that such behavior is318
able to highlight class-dependent features over flat ones. In order to confirm319
this, we compared for both methods over a set of artificial problems the320
use of maximum versus minimum SD to break ties. Figure 2 shows the321
corresponding results for some representative cases. They are similar in all322
other cases (some of which are shown in the Additional Material section).323
As this figure shows, using maximum values always leads to equal or better324
performance than using minimum ones.325

3.3.3. OVA–SVM vs. OVO–SVM326

We applied both OVA and OVO strategies, combined with our feature327
selection methods, to all artificial datasets. We compared all results and328
found that the OVO strategy yields equal or superior performance in all329
cases. In Figure 3 we show some representative examples of this comparison,330
using the Artificial-1-8C and Artificial-2-8C datasets. On the left column331
we show OVA results, while the OVO case is depicted on the right column.332
We use the same scale for the corresponding panels. We also included the333
Bayes error for both datasets as dotted horizontal lines, and the true number334
of relevant features as a dotted vertical lines. More datasets are included in335
the Additional Material section.336

It is interesting to note that the two methods more directly aimed at find-337
ing class-relevant features (K-First and Best Ranking) are the ones showing338

14


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

●

●

● ●
●

●

●

● ●

●

●

●

●
●

●

●

●

●
●

●
● ●

●

●

● ●
●

●

●

●

●

●
● ●

● ●

●
●

●

●

●

●

●

●

●
●

●

2 5 10 20 50 100 200 500

0
.6

3
0

0
.6

4
0

0
.6

5
0

0
.6

6
0

Features

E
rr

o
r 

ra
te

● K First 2,5%
K First 5%
K First 10% 
K First 20%
K First 40%

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●

●
●

●

●

●

●

●
● ●

●

●
●

●

●

●
●

●

●

●

●
●

●
●

●
●

●

●
●

●
●

●
●

2 5 10 20 50 100 200 500

0
.6

5
0

.7
0

0
.7

5
0

.8
0

0
.8

5
0

.9
0

Features

E
rr

o
r 

ra
te

● K First 2,5%
K First 5%
K First 10% 
K First 20%
K First 40%

(a) (b)

●

●

●
●

●
●

●

●
●

●
●

●
● ●

●
●

●
● ●

●

●
●

●

●

●

●

●

●

●

●
●

●

●
●

●
● ●

●
●

●
●

●
● ●

●
●

● ●
●

2 5 10 20 50 100 200 500

0
.7

5
0

.8
0

0
.8

5
0

.9
0

0
.9

5

Features

E
rr

o
r 

ra
te

● K First 2,5%
K First 5%
K First 10%
K First 20%
K First 40%

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●
● ● ● ●

●
● ●

● ●
● ●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

2 5 10 20 50 100 200 500

0
.5

0
.6

0
.7

0
.8

Features

E
rr

o
r 

ra
te

● K First 2,5%
K First 5%
K First 10%
K First 20%
K First 40%

(c) (d)

Figure 1: Evaluation of different values of k for the K-First method. Each line shows
average error rates as a function of the number of features selected by the corresponding
method, with 1 SD error bars. (a) Artificial-3-3C (b) Artificial-3-8C (c) Artificial-1-16C
(d) Artificial-2-8C (chance error 0.875).

the bigger gains under the OVO strategy. Probably the OVO strategy can339
filter some of the noisy features more efficiently than OVA, as it considers340
significantly more lists of features (M = c(c − 1)/2 vs. M = c). After this341
comparison we will only use OVO-SVM to evaluate our new methods.342

3.4. Evaluation of the Methods on Artificial Datasets343

Figure 4 shows the results for 4 versions of the Artificial-1 problem and 2344
of the Artificial-3 problem. The remaining dataset from Artificial-1 is shown345
on Panel (c) of Figure 3. Results for Artificial-2 are shown on Panel (d) of346
the same figure. Additional datasets are included in the Additional Material347
section. Overall, artificial problems show that the K-First method is the348

15


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

●

●

●

●

●

●

●

●

●

●

●
●

●
●

●

● ●

●

●

●

●

●

●

●

●
● ● ●

●
●

●

●
●

●

●

●

●
●

●

●

●

●
●

●

●

●

●
●

2 5 10 20 50 100 200 500

0
.5

5
0

.6
0

0
.6

5
0

.7
0

0
.7

5
0

.8
0

0
.8

5

Features

E
rr

o
r 

ra
te

● Average−SD Max 
Average−SD Min 
3Q−SD Max 
3Q−SD Min 

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●

●

●

●

●

●
●

●

● ●
● ● ● ● ●

● ●
● ●

● ●
●

● ●
● ●

● ●
● ●

● ●
●

2 5 10 20 50 100 200 500

0
.5

5
0

.6
0

0
.6

5
0

.7
0

0
.7

5
0

.8
0

0
.8

5

Features

E
rr

o
r 

ra
te

● Average−SD Max 
Average−SD Min 
3Q−SD Max 
3Q−SD Min 

(a) (b)

Figure 2: Comparison of tie breaking using maximum or minimum SD for Average-SD
and 3Q-SD. Details are similar to Figure 1. Chance error of 0.875 for both panels. (a)
Artificial-1-8C. (b) Artificial-2-8C.

most efficient one in finding subsets of features with low classification error,349
followed closely by the Best-Ranking method. The Schulze method shows350
good performance in several datasets. The other 3 methods show similar351
results, though not as good as the first group.352

On the Artificial-3 datasets (all relevant features are useful for all classes)353
the differences among methods are clearly smaller than on the other 2 prob-354
lems (with class-dependent features). Differences in performance increase355
with the number of classes for the Artificial-1 dataset.356

Comparing the two methods based on voting theory, the low performance357
of Condorcet compared with Schulze is notorious. Taking a closer look at the358
method, we noticed that Condorcet produces a lot of ties in the rankings,359
which are broken using average positions. This produces a bias towards360
features with good global average values instead of features highly relevant361
for a few lists.362

Another interesting analysis that can be made with artificial datasets is363
the position occupied by the truly relevant features on the rankings produced364
by the different methods, as we know in advance which features are noisy365
and which ones are informative. A perfect method should rank all relevant366
features first, with all noisy features following.367

For each artificial problem, we analyzed the distribution of rankings given368
by each selection method to the set of relevant features and to the set of369
noisy features. We then computed some descriptive statistics of those two370

16


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

●

●

●

●

●

●

●

●

●
●

●
●

●
●

●

●

●
●

●

●

●

●

●

● ● ●
●

●
●

●
●

●
●

●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

2 5 10 20 50 100 200 500

0
.4

0
.5

0
.6

0
.7

0
.8

0
.9

1
.0

Features

E
rr

o
r 

ra
te

● Best ranking 
K first
3Q−SD
Average−SD
Condorcet
Schulze

●

●

●

●

●

●

●

●

●
●

●

●

●

●

●

●

●
●

●

●

●

●

●

● ● ●
●

●
●

●

●

●

●

●

●

●

●

●
●

●
●

●
●

●
●

●
●

●

2 5 10 20 50 100 200 500

0
.4

0
.5

0
.6

0
.7

0
.8

0
.9

1
.0

Features

E
rr

o
r 

ra
te

● Best ranking 
K first
3Q−SD
Average−SD
Condorcet
Schulze

(a) (c)

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●

●
●

●

●

●
●

●
●

● ● ● ●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●

2 5 10 20 50 100 200 500

0
.4

0
.5

0
.6

0
.7

0
.8

0
.9

Features

E
rr

o
r 

ra
te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●
●

●

●

●
●

●

●
● ● ●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

2 5 10 20 50 100 200 500

0
.4

0
.5

0
.6

0
.7

0
.8

0
.9

Features

E
rr

o
r 

ra
te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

(b) (d)

Figure 3: OVA-SVM vs. OVO-SVM on two artificial problems. Details are similar to
Figure 1. (a) RFE-OVA-SVM on Artificial-1-8C. (b) RFE-OVA-SVM on Artificial-2-8C.
(c) RFE-OVO-SVM on Artificial-1-8C (d) RFE-OVO-SVM on Artificial-2-8C.

distributions (Best, 1st. quartile, Mean, 3rd. quartile and Worst). In Table371
4 we show these statistics on dataset Artificial-1-8C, which is representative372
of the results obtained on the other versions of this problem. All six methods373
rank relevant features at the first positions and noisy features at the last ones,374
but there are important differences. Looking at the Mean and 3rd. Quartile375
of the distributions, it is clear that K-First, Best Ranking and Schulze, in that376
order, are the most accurate ones in ranking most of the features according377
to their global relevance. These results confirm that the low error rates on378
the figures discussed before are directly related to a better feature selection379
by those methods.380

In Table 5 we show the corresponding statistics for the Artificial-2-8C381

17


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Relevant 3Q-SD Av-SD Best Rank. K-First Condorcet Schulze

Best 1 1 1 1 1 1
1st Q. 14 14 12 11 17 13
Mean 70 65 33 22 79 48
3rd Q. 98 86 41 31 110 61
Worst 436 476 357 495 474 410

Noisy 3Q-SD Av-SD Best Rank. K-First Condorcet Schulze

Best 1 1 1 3 1 1
1st Q. 160 161 165 165 159 163
Mean 287 287 290 290 286 288
3rd Q. 415 415 415 415 415 415
Worst 540 540 540 540 540 540

Table 4: Statistics of the rankings given to relevant and noisy features by the diverse
methods considered in this study for the Artificial-1-8C dataset. Values are rounded when
needed.

dataset. This is the most interesting problem, as it contains subsets of rele-382
vant features with diverse levels of relevance. In the table we separated the383
relevant features into 3 subsets. As it can be observed in the table, K-First384
is the most effective strategy in separating the subsets of relevant features,385
and not only relevant from noisy features.386

Finally, in Table 6 we show statistics for the Artificial-3-8C dataset —387
it is representative of all versions of this problem. As discussed before, all388
methods are almost equivalent on this dataset.389

A last comment is in order about the Best Ranking method. As it can390
be seen in the tables, it can give high rankings to noisy features more easily391
than the K-First method, as it bases the final ranking on a single value for392
each feature.393

4. Evaluation on real–world datasets394

We used 14 real–world datasets to evaluate our new methods. Details and395
origins of the datasets are collected on Table 7. We selected datasets from396
three different domains. The first 4 datasets collect mass spectrometry mea-397
surements of food products. All recorded peaks are present in the datasets.398
Some of the products under analysis can present class-specific features, re-399
flecting particularities of some products, such as origin or manufacturing400

18


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

1 to 5 3Q-SD Av-SD Best Rank. K First Condorcet Schulze

Best 1 1 1 1 1 1
1st Q. 2 2 2 2 2 2
Mean 22 5 9 3 15 4
3rd Q. 28 5 17 4 18 5
Worst 481 62 45 7 85 18

6 to 10 3Q-SD Av-SD Best Rank. K First Condorcet Schulze

Best 1 1 1 1 1 1
1st Q. 17 7 6 7 13 6
Mean 80 24 13 8 50 9
3rd Q. 158 30 19 9 58 11
Worst 523 241 44 14 453 70

11 to 25 3Q-SD Av-SD Best Rank. K First Condorcet Schulze

Best 1 2 1 8 1 4
1st Q. 35 19 12 14 29 15
Mean 161 87 22 18 101 65
3rd Q. 265 116 30 22 185 70
Worst 524 500 57 29 524 501

Noisy 3Q-SD Av-SD Best Rank. K First Condorcet Schulze

Best 1 3 3 10 1 8
1st Q. 142 147 151 151 142 148
Mean 269 273 275 275 274 274
3rd Q. 399 400 400 400 399 400
Worst 525 525 525 525 525 525

Table 5: Statistics of the rankings given to relevant and noisy features by the diverse
methods considered in this study for the Artificial-2-8C dataset. The relevant features
are divided into three subsets, and ordered according to their relevance by construction.
Values are rounded when needed.

19


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Relevant 3Q-SD Av-SD Best Rank. K First Condorcet Schulze

Best 1 1 1 1 1 1
1st Q. 11 11 11 11 11 11
Mean 27 29 41 26 28 29
3rd Q. 33 37 53 33 33 37
Worst 440 431 368 250 342 252

Noisy 3Q-SD Av-SD Best Rank. K First Condorcet Schulze

Best 5 2 2 8 3 7
1st Q. 165 165 164 165 165 165
Mean 290 290 290 290 290 290
3rd Q. 415 415 415 415 415 415
Worst 540 540 540 540 540 540

Table 6: Statistics of the rankings given to relevant and noisy features by the diverse
methods considered in this study for the Artificial-3-8C dataset. Values are rounded when
needed.

method. The following 6 datasets come from the UCI repository. These401
are more traditional datasets, with more samples than features and multiple402
classes, involving typical pattern recognition problems. Finally, we selected403
4 gene expression datasets from human tissues. These datasets were filtered404
by curators to obtain circa 1000 genes with high signal–to–noise ratio in each405
case.406

In order to compare our results against previous methods we implemented407
3 versions of MSVM-RFE, as described by Zhou & Tuck (2007). The first408
method uses the multiclass SVM developed by Crammer & Singer (2001).409
We will denote it ”Zhou C & S”. The second one uses the method of Weston410
& Watkins (1999), in the following denoted by ”Zhou W & W”. Finally, we411
implemented MSVM-RFE with the OVO decomposition of the traditional412
binary SVM. We will refer to this method as ”Zhou OVO”. Notice that it413
is equivalent to the Average Weights methodology, which is implemented by414
default in most available Machine Learning packages, as previously explained.415

On Figures 5 and 6 we show the results for eight datasets, while the416
remaining cases are shown in the Additional Material section. In general,417
differences in results for real world data are less notorious than for the Ar-418
tificial problems. For UCI and Mass-Spectrometry datasets, K-First is in419
general the method showing the best results in finding small subsets with420
reduced classification error, followed by Best Ranking and 3rd Quartile. In421

20


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

some problems, like Apples and Libra, differences are more notorious. On the422
gene expression datasets all methods show small differences, but in general423
the variants of MSVM-RFE exhibit better results than on the other domains.424
These datasets have been filtered by curators and as a consequence all fea-425
tures are informative. We believe that this improves the performance of426
averaging methods over methods that search for class-specific features.427

Error curves show the complete behavior of the methods as a function428
of the number of features, but occasionally diverse methods are more effi-429
cient in selecting a high number or just a few features. To produce a more430
concrete comparison, we measured for two fixed numbers of selected features431
(10 and 20) the proportion of runs on which method A shows a smaller error432
than method B, for each of the three domains under evaluation. The full433
resulting matrices are shown in the Additional Material section. From these434
matrices we computed a ranking for each method, counting the number of435
other methods that it excels. We show the corresponding results in Table 8.436
They confirm the information extracted from the error curves: on the UCI437
and Mass-Spectrometry domains K-first shows the best results, but Best438
Ranking and 3rd Quartile also have high rankings. On the gene expression439
domains the best results come from one of the MSVM-RFE methods.440

5. Computational burden441

We evaluated the burden of the 6 new methods as a function of the442
number of features and samples using the Artificial-1 dataset. In panel (a)443
of Figure 7 we show how the running time scales with the number of features444
in the problems, using a log-scale for times. We include the 3 versions of the445
method by Zhou et al. as a comparison. It is clear from this figure that all but446
the Schulze method scale almost linearly with the number of features, being447
Condorcet and 3rd Quartile the slowest methods. Schulze is cubic in the448
number of features, as it involves a variant of the FloydWarshall algorithm449
to find shortest paths in a graph. On panel (b) of the same figure we show450
the dependence on the number of samples in the dataset. All new methods,451
including Schulze, scale almost linearly with the number of samples. The452
two variants of Zhou’s method using direct multiclass SVMs show power–453
law scaling with the number of samples.454

21


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

6. Conclusions455

In this work we discussed in depth the use of combinations of lists of456
features (instead of averaging individual importances) in SVM–RFE for fea-457
ture selection on multiclass problems. Using an appropriate mathematical458
framework we introduced 6 different methods to produce the final ranking of459
features starting from a set of ranked features list produced by each binary460
problem. We evaluated them in a series of artificial and real world datasets.461

Our first conclusion is that the OVO strategy should be preferred over462
OVA for multiclass feature selection. Probably the higher number of binary463
problems in OVO helps in filtering out some noisy features that receive high464
rankings from just one or only a few binary problems, a similar beneficial465
effect to the use of ensembles in general.466

Our second conclusion is that, overall, the K-First method is the most467
consistent one in selecting subsets of relevant features that lead to smaller468
classification errors. The idea is well–known in the document retrieval liter-469
ature, only considers the top k values of each list, and adapts efficiently to470
feature selection. We showed with several artificial and real–world datasets471
that this new method is superior to the typical weights averaging that is472
implemented by default in all current Machine Learning libraries.473

Finally, two other methods also showed good results but present some474
drawbacks. The Best Ranking strategy is simple and efficient, but can lose475
performance on some problems, such as Artificial-2. Also, the use of a single476
value to characterize the behavior of a feature can give high rankings to noisy477
features by chance. The Schulze strategy, based on voting theory, shows a478
very good performance on some artificial datasets but does not compare well479
on real–world ones, and is by far the most complex and time–consuming480
strategy out of the six methods under evaluation.481

Overall, the new methods were designed for problems with class-specific482
features, which is where they show their best performance. As they employ483
the OVO strategy, they are also resistant to noisy features. Filtered domains,484
with lots of low-relevance features and little noise like our gene expression485
datasets, seem to represent a more challenging domain for our new methods.486

Work in progress includes a more extensive evaluation and the use of a487
penalty term to help discard correlated features.488

22


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

References489

Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass490
to binary: A unifying approach for margin classifiers. Journal of machine491
learning research, 1 , 113–141.492

Breiman, L. (2001). Random forests. Machine learning , 45 , 5–32.493

Cappellin, L., Soukoulis, C., Aprea, E., Granitto, P., Dallabetta, N., Costa,494
F., Viola, R., Märk, T. D., Gasperi, F., & Biasioli, F. (2012). Ptr-495
tof-ms and data mining methods: a new tool for fruit metabolomics.496
Metabolomics , 8 , 761–770.497

Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of498
multiclass kernel-based vector machines. Journal of machine learning re-499
search, 2 , 265–292.500

Del Pulgar, J. S., Soukoulis, C., Carrapiso, A., Cappellin, L., Granitto, P.,501
Aprea, E., Romano, A., Gasperi, F., & Biasioli, F. (2013). Effect of the502
pig rearing system on the final volatile profile of iberian dry-cured ham as503
detected by ptr-tof-ms. Meat science, 93 , 420–428.504

Dittman, D. J., Khoshgoftaar, T. M., Wald, R., & Napolitano, A. (2013).505
Classification performance of rank aggregation techniques for ensemble506
gene selection. In FLAIRS Conference.507

Duan, K.-B., Rajapakse, J. C., & Nguyen, M. N. (2007). One-versus-one and508
one-versus-all multiclass svm-rfe for gene selection in cancer classification.509
In European Conference on Evolutionary Computation, Machine Learning510
and Data Mining in Bioinformatics (pp. 47–56). Springer.511

Fabris, A., Biasioli, F., Granitto, P. M., Aprea, E., Cappellin, L., Schuhfried,512
E., Soukoulis, C., Märk, T. D., Gasperi, F., & Endrizzi, I. (2010). Ptr-tof-513
ms and data-mining methods for rapid characterisation of agro-industrial514
samples: influence of milk storage conditions on the volatile compounds515
profile of trentingrana cheese. Journal of mass spectrometry , 45 , 1065–516
1074.517

Forman, G. (2003). An extensive empirical study of feature selection metrics518
for text classification. Journal of machine learning research, 3 , 1289–1305.519

23


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,520
Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A.521
et al. (1999). Molecular classification of cancer: class discovery and class522
prediction by gene expression monitoring. science, 286 , 531–537.523

Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive524
feature elimination with random forest for ptr-ms analysis of agroindustrial525
products. Chemometrics and Intelligent Laboratory Systems , 83 , 83–90.526

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature527
selection. Journal of Machine Learning Research, 3 , 1157–1182.528

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for529
cancer classification using support vector machines. Machine learning , 46 ,530
389–422.531

Haury, A.-C., Gestraud, P., & Vert, J.-P. (2011). The influence of feature532
selection methods on accuracy, stability and interpretability of molecular533
signatures. PloS one, 6 , e28210.534

Ho, T. K., Hull, J. J., & Srihari, S. N. (1994). Decision combination in535
multiple classifier systems. IEEE transactions on pattern analysis and536
machine intelligence, 16 , 66–75.537

Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass538
support vector machines. IEEE transactions on Neural Networks , 13 ,539
415–425.540

Hua, J., Tembe, W. D., & Dougherty, E. R. (2009). Performance of feature-541
selection methods in the classification of high-dimension data. Pattern542
Recognition, 42 , 409–424.543

Jurman, G., Merler, S., Barla, A., Paoli, S., Galea, A., & Furlanello, C.544
(2008). Algebraic stability indicators for ranked lists in molecular profiling.545
Bioinformatics , 24 , 258–264.546

Kanth, N. R., & Saraswathi, S. (2015). Efficient speech emotion recognition547
using binary support vector machines & multiclass svm. In Computational548
Intelligence and Computing Research (ICCIC), 2015 IEEE International549
Conference on (pp. 1–6). IEEE.550

24


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional551
methods and a new algorithm. In AAAI (pp. 129–134). volume 2.552

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection.553
Artificial intelligence, 97 , 273–324.554

Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson,555
W. E., Geman, D., Baggerly, K., & Irizarry, R. A. (2010). Tackling the556
widespread and critical impact of batch effects in high-throughput data.557
Nature Reviews Genetics , 11 , 733–739.558

Lichman, M. (2013). UCI machine learning repository.559

Liu, H., Dougherty, E. R., Dy, J. G., Torkkola, K., Tuv, E., Peng, H., Ding,560
C., Long, F., Berens, M., Parsons, L. et al. (2005). Evolving feature selec-561
tion. IEEE Intelligent systems , 20 , 64–76.562

Mohsenzadeh, Y., Sheikhzadeh, H., & Nazari, S. (2016). Incremental rel-563
evance sample-feature machine: A fast marginal likelihood maximization564
approach for joint feature selection and classification. Pattern Recognition,565
60 , 835–848.566

Mohsenzadeh, Y., Sheikhzadeh, H., Reza, A. M., Bathaee, N., & Kalayeh,567
M. M. (2013). The relevance sample-feature machine: A sparse bayesian568
learning approach to joint feature-sample selection. IEEE transactions on569
cybernetics , 43 , 2241–2254.570

Monti, S., Tamayo, P., Mesirov, J. P., & Golub, T. R. (2003). Consensus clus-571
tering: A resampling-based method for class discovery and visualization of572
gene expression microarray data. Machine Learning , 52 , 91–118.573

Neumayer, R., Mayer, R., & Nørv̊ag, K. (2011). Combination of feature574
selection methods for text categorisation. In European Conference on In-575
formation Retrieval (pp. 763–766). Springer.576

Nguyen, M. H., & De la Torre, F. (2010). Optimal feature selection for577
support vector machines. Pattern recognition, 43 , 584–591.578

Nuray, R., & Can, F. (2006). Automatic ranking of information retrieval579
systems using data fusion. Information processing & management , 42 ,580
595–614.581

25


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Raies, A. B., & Bajic, V. B. (2016). In silico toxicology: computational582
methods for the prediction of chemical toxicity. Wiley Interdisciplinary583
Reviews: Computational Molecular Science, 6 , 147–172.584

Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.-H., An-585
gelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P. et al. (2001).586
Multiclass cancer diagnosis using tumor gene expression signatures. Pro-587
ceedings of the National Academy of Sciences , 98 , 15149–15154.588

Saari, D. (2001). Chaotic elections!: A mathematician looks at voting . Amer-589
ican Mathematical Soc.590

Schulze, M. (2011). A new monotonic, clone-independent, reversal symmet-591
ric, and condorcet-consistent single-winner election method. Social Choice592
and Welfare, 36 , 267–303.593

Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005).594
A comprehensive evaluation of multicategory classification methods for595
microarray gene expression cancer diagnosis. Bioinformatics , 21 , 631–643.596

Uysal, A. K. (2016). An improved global feature selection scheme for text597
classification. Expert systems with Applications , 43 , 82–92.598

Vapnik, V. (2013). The nature of statistical learning theory . Springer science599
& business media.600

Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V.601
(2000). Feature selection for svms. In Proceedings of the 13th International602
Conference on Neural Information Processing Systems (pp. 647–653). MIT603
Press.604

Weston, J., & Watkins, C. (1999). Support vector machines for multi-class605
pattern recognition. In ESANN (pp. 219–224). volume 99.606

You, W., Yang, Z., & Ji, G. (2014). Feature selection for high-dimensional607
multi-category data using pls-based local recursive feature elimination. Ex-608
pert Systems with Applications , 41 , 1463–1475.609

Young, H. P. (1988). Condorcet’s theory of voting. The American Political610
Science Review , (pp. 1231–1244).611

26


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Zhou, N., & Wang, L. (2016). Processing bio-medical data with class-612
dependent feature selection. In Advances in Neural Networks (pp. 303–613
310). Springer.614

Zhou, W., & Dickerson, J. A. (2014). A novel class dependent feature se-615
lection method for cancer biomarker discovery. Computers in biology and616
medicine, 47 , 66–75.617

Zhou, X., & Tuck, D. P. (2007). Msvm-rfe: extensions of svm-rfe for mul-618
ticlass gene selection on dna microarray data. Bioinformatics , 23 , 1106–619
1114.620

27


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

●

●

●

●

●

●

●

●

●

●

●
●

●
●

● ● ● ●
●

●
●

●

●
●

●

●
●

●

●

●
●

●

●

●
●

●
●

●
●

●
●

●
●

● ●
● ●

2 5 10 20 50 100 200 500

0
.1

0
.2

0
.3

0
.4

0
.5

0
.6

Features

E
rr

o
r 

ra
te

● Best ranking 
K first
3Q−SD
Average−SD
Condorcet
Schulze

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●
● ● ●

●

●
●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●
●

●
●

●
● ●

2 5 10 20 50 100 200 500

0
.2

0
.3

0
.4

0
.5

0
.6

0
.7

Features
E

rr
o

r 
ra

te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

(a) (b)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
● ●

● ●
●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●
●

●
●

●
●

● ● ●

2 5 10 20 50 100 200 500

0
.2

0
.3

0
.4

0
.5

0
.6

0
.7

0
.8

Features

E
rr

o
r 

ra
te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

●

●
●

●

● ●
●

●
● ●

●
●

●
● ● ●

● ●
●

●

●

●

●
●

●

●

●

●

●

●

●
●

● ●

●
● ●

●
●

●
●

●

●
●

●

●

●

●
●

2 5 10 20 50 100 200 500

0
.7

5
0

.8
0

0
.8

5
0

.9
0

0
.9

5
1

.0
0

Features

E
rr

o
r 

ra
te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

(c) (d)

●

●

●

●

●

●
●

●

●
●

● ● ● ●
●

●
● ●

● ●

●
●

●

● ●

● ●
● ● ●

●

●
●

●
● ●

●

●
● ●

● ●
●

●
● ● ●

2 5 10 20 50 100 200 500

0
.6

2
0

.6
4

0
.6

6
0

.6
8

Features

E
rr

o
r 

ra
te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

●

●

●

●

●

●

●

●

●

●
●

●
●

●
●

●
● ●

●

●
●

● ● ●
●

● ● ●
●

●
●

●

●
●

●
●

●
●

●
●

●
●

●
●

●
●

● ●

2 5 10 20 50 100 200 500

0
.6

5
0

.7
0

0
.7

5
0

.8
0

0
.8

5
0

.9
0

0
.9

5

Features

E
rr

o
r 

ra
te

● Best ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze

(e) (f)

Figure 4: Error curves for the six methods on some artificial datasets. Details are similar
to Figure 1. (a) Artificial-1-3C (chance error 0.666) (b) Artificial-1-4C (chance error 0.75)
(c) Artificial-1-5C (d) Artificial-1-16C (e) Artificial-3-3C (f) Artificial-3-8C

28


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

Dataset F S C Description

Apple 714 150 15 Mass spectrometry measurements over 15
varieties of Apple clones (Cappellin et al.,
2012)

Cheese 117 72 8 Mass spectrometry measurements over 8 va-
rieties of Italian cheese (Fabris et al., 2010)

Ham 1338 382 11 Mass spectrometry measurements over 11
varieties of Iberian Hams (Del Pulgar et al.,
2013)

Strawberry 232 233 9 Mass spectrometry measurements over 9 va-
rieties of strawberries (Granitto et al., 2006)

Multi-F 649 2000 10 Features of handwritten numerals extracted
from a collection of Dutch utility maps
(Lichman, 2013)

Libras 90 360 15 Diverse hand movements from the Brazilian
hands language (Lichman, 2013)

Robot1 90 88 4 Robot Execution Failures Data Set, from
UCI. Failures in approach to grasp position
(Lichman, 2013)

Robot3 90 47 4 Same as Robot1. Position of part after a
transfer failure

Robot4 90 117 3 Same as Robot1. Failures in approach to
ungrasp position

Robot5 90 164 5 Same as Robot1. Failures in motion with
part

Leukemia 985 248 6 Gene expression of Bone marrow samples
with 6 subtypes of Leukemia (Monti et al.,
2003)

Lung 1000 197 4 Gene expression of lung tissues with 4 can-
cer types (Monti et al., 2003)

CNS 989 42 5 Gene expression of 5 tumor types of the cen-
tral nervous system (Monti et al., 2003)

Novartis 1000 103 4 Gene expression of tissue samples from 4
distinct cancer types (Monti et al., 2003)

Table 7: Details on the 14 real–world datasets used in this work. Columns show the
number of features (F), samples (S) and classes (C).

29


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT●

● ●
●

●

●

●

●

●

●

●

●

●
●●

●

●
●

●

●

●
●

●

●

●
●

●

●
●

●

●
●

●
●

● ● ●
● ●

●
●

● ● ●

● ● ●

●
● ●

●

2 5 10 20 50 100 200 500

0
.6

0
0

.6
4

0
.6

8
0

.7
2

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

●

●

●

●

●
● ●

●
● ● ● ● ● ● ●

●●●
● ● ● ●

● ● ●
●

● ●
● ● ● ●

●
●

● ●
● ●

● ●

2 5 10 20 50 100 200

0
.1

0
.2

0
.3

0
.4

0
.5

Features
E

rr
o

r 
ra

te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

(a) (b)

●

●

●

●

●

●

●

●

●
●

●
●

●

●

●●●● ●

● ● ● ● ● ● ● ●
●

●
● ●

●
●

● ● ●
● ●

●
● ● ●

● ●
● ● ●

● ● ● ●
●

● ●
● ●

2 5 10 20 50 100 200 500 1000

0
.2

5
0

.3
5

0
.4

5
0

.5
5

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

●

●

●

●

●

●

●

●

●

●

●

●

● ●
●

● ● ●

●
●

● ● ● ● ● ● ● ● ● ● ●

2 5 10 20 50 100

0
.2

0
.3

0
.4

0
.5

0
.6

0
.7

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

(c) (d)

Figure 5: Results on some real world datasets. Details are similar to Figure 1. (a) Apples.
(b) Strawberry. (c) Ham. (d) Libras.

Group 3Q-SD Av-SD Best K-First Cond. Schulze Z. C&S Z. W&W Z. OVO

UCI-10 Feat 7 5 5 8 3 2 0 2 5
MS -10 Feat 0 3 7 8 2 1 5 4 6
GE -10 Feat 1 3 6 4 2 0 7 5 8
UCI-20 Feat 5 2 6 8 2 0 3 7 3
MS -20 Feat 0 2 7 8 3 1 4 5 6
GE -20 Feat 1 6 5 4 2 0 8 3 7

Table 8: Rankings of methods (higher is better) counting the number of times that one
method outperforms another on each domain and number of selected features.

30


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

●

●

●

●

●
●

●

●

●

●
● ●

●
● ●

● ●

●

●

●
●

●

●

●

●

●
●

●

● ● ●

2 5 10 20 50 100

0
.3

0
0

.3
5

0
.4

0

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

●

●

●

●

●

●
●

●

●

● ●

● ●

● ●
●

●
●

●

●

●
●

●
● ●

● ●

●
●

●
●

2 5 10 20 50 100

0
.3

0
0

.3
5

0
.4

0
0

.4
5

0
.5

0

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

(a) (b)

●

●

●

●

●

●
●

●
●

●
● ●

●●
●

●
●●●

● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2 5 10 20 50 100 200 500 1000

0
.0

0
0

.1
0

0
.2

0
0

.3
0

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

●

●

●

●

●

●

●

●

●

●
● ●●

●●●
●

●

●
● ●

●
● ●

●
● ● ●

●
●

●
●

●
● ●

●
●

● ● ●
● ● ● ● ●

● ●
● ●

● ● ●
●

2 5 10 20 50 100 200 500 1000

0
.1

0
.2

0
.3

0
.4

0
.5

Features

E
rr

o
r 

ra
te

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

(c) (d)

Figure 6: Results on some real world datasets. Details are similar to Figure 1. (a) Robot1.
(b) Robot3. (c) Leukemia. (d) CNS.

31


ACCEPTED MANUSCRIPT

A
CC

EP
TE

D
 M

A
N

U
SC

RI
PT

2
5

1
0

2
0

5
0

2
0

0
5

0
0

Features

T
im

e

●

●

●

●

●

100 200 300 400 500

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

2
5

1
0

2
0

5
0

Samples

T
im

e

●

●

●

●

200 400 600 800

● Best Ranking
K first
3Q−SD
Average−SD
Condorcet
Schulze
Zhou C&S
Zhou W&W
Zhou OVO

(a) (b)

Figure 7: Comparison of running times for all methods evaluated in this work as a function
of (a) the number of features and (b) the number of samples. Times (in seconds) are in
log–scale.

32