Improved multiclass feature selection via list combination Accepted Manuscript Improved multiclass feature selection via list combination Javier Izetta, Pablo F. Verdes, Pablo M. Granitto PII: S0957-4174(17)30467-0 DOI: 10.1016/j.eswa.2017.06.043 Reference: ESWA 11414 To appear in: Expert Systems With Applications Received date: 8 March 2017 Revised date: 30 June 2017 Accepted date: 30 June 2017 Please cite this article as: Javier Izetta, Pablo F. Verdes, Pablo M. Granitto, Improved multi- class feature selection via list combination, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.06.043 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. http://dx.doi.org/10.1016/j.eswa.2017.06.043 http://dx.doi.org/10.1016/j.eswa.2017.06.043 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Highlights • We introduce new SVM-RFE feature selection methods for multiclass problems • We use binary decomposition followed by strategies to combine lists of features • We discuss statistical approaches and voting theory methods • One-vs-One methods give better results than One-vs-All methods • The new K-First method is the more effective in selecting relevant features 1 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Improved multiclass feature selection via list combination Javier Izettaa, Pablo F. Verdesa, Pablo M. Granittoa,∗ aCIFASIS, French Argentine International Center for Information and Systems Sciences, UNR–CONICET, Bv. 27 de Febrero 210 Bis, 2000 Rosario, Argentina Abstract Feature selection is a crucial machine learning technique aimed at reducing the dimensionality of the input space. By discarding useless or redundant variables, not only it improves model performance but also facilitates its in- terpretability. The well-known Support Vector Machines–Recursive Feature Elimination (SVM-RFE) algorithm provides good performance with mod- erate computational efforts, in particular for wide datasets. When using SVM-RFE on a multiclass classification problem, the usual strategy is to decompose it into a series of binary ones, and to generate an importance statistics for each feature on each binary problem. These importances are then averaged over the set of binary problems to synthesize a single value for feature ranking. In some cases, however, this procedure can lead to poor selection. In this paper we discuss six new strategies, based on list combi- nation, designed to yield improved selections starting from the importances given by the binary problems. We evaluate them on artificial and real-world datasets, using both One–Vs–One (OVO) and One–Vs–All (OVA) strategies. Our results suggest that the OVO decomposition is most effective for feature selection on multiclass problems. We also find that in most situations the new K-First strategy can find better subsets of features than the traditional weight average approach. Keywords: ∗Corresponding author Email addresses: izetta@cifasis-conicet.gov.ar (Javier Izetta), verdes@cifasis-conicet.gov.ar (Pablo F. Verdes), granitto@cifasis-conicet.gov.ar (Pablo M. Granitto) Preprint submitted to Expert Systems with Applications July 5, 2017 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Feature Selection, Multiclass problems, Support Vector Machine 3 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 1. Introduction1 Many important problems in Machine Learning, as well as in in-silico2 Chemistry (Raies & Bajic, 2016), Biology, “high-throughput” technologies3 (Golub et al., 1999; Leek et al., 2010) or text processing (Forman, 2003;4 Uysal, 2016), share the property of involving much more features than mea-5 sured samples are available (Guyon & Elisseeff, 2003). The datasets associ-6 ated to these problems are, unsurprisingly, called “wide”. Usually, most of7 these variables carry a relatively low importance for the problem at hand.8 Furthermore, in some cases they interfere with the learning process instead9 of helping it, a scenario usually referred to as “curse of dimensionality”.10 Feature selection is an important pre–processing technique of Machine11 Learning aimed at coping with this curse (Kohavi & John, 1997). Its main12 goal is to find a small subset of the measured variables that improve, or at13 least do not degrade, the performance of the modeling method applied to the14 dataset. But feature selection methods do not only avoid the curse of dimen-15 sionality: they also allow for a considerable reduction in model complexity,16 an easier visualization and, in particular, a better interpretation of the data17 under analysis and the developed models (Liu et al., 2005).18 Several methods have been introduced in recent years, from general ones19 like Wrappers (Kohavi & John, 1997) and filters (Kira & Rendell, 1992) to20 very specific ones developed for SVM (Weston et al., 2000; Nguyen & De la21 Torre, 2010) and RVM (Mohsenzadeh et al., 2013, 2016) classifiers. Amongst22 other methods in the field (Hua et al., 2009), the well-known Recursive Fea-23 ture Elimination (RFE) algorithm provides good performance with moderate24 computational efforts (Guyon et al., 2002) on wide datasets. The original and25 most popular version of this method uses a linear Support Vector Machine26 (SVM) (Vapnik, 2013) to select the candidate features to be eliminated. Ac-27 cording to the SVM–RFE algorithm, the importance of an input variable28 i is directly correlated with the corresponding component (wi) of the vec-29 tor defining the separating hyperplane (w). The method is widely used in30 Bioinformatics (Guyon et al., 2002; Statnikov et al., 2005). Alternative RFE31 methods using other classifiers have also been introduced in the literature32 (Granitto et al., 2006; You et al., 2014).33 Typical feature selection algorithms are designed for binary classification34 problems, as the original version of RFE. Multiclass problems have received35 much less attention because of their increased difficulty. Also, because some36 classifiers involved in the selection process are designed to solve binary prob-37 4 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT lems. Most methods available for feature selection on multiclass problems38 are simple extensions of base methods. For example, RFE can be associated39 to a multiclass classifier like Random Forest (Breiman, 2001; Granitto et al.,40 2006).41 Although SVM was originally developed to deal only with binary prob-42 lems, it was extended to directly solve multiclass problems in different man-43 ners (Weston & Watkins, 1999; Crammer & Singer, 2001; Hsu & Lin, 2002),44 but with a modest success attributed mainly to the increased complexity of45 the solutions. On the other hand, in the last years several methods were46 developed to solve a multiclass problem using an appropriate combination of47 binary classifiers (Allwein et al., 2000; Hsu & Lin, 2002). The most usually48 followed strategy for multiclass SVM is known as “One–vs–One” (OVO). Ac-49 cording to this approach, a classification problem with c classes is replaced50 with M = c(c−1)/2 reduced binary ones, each one of them consisting of dis-51 criminating a pair of classes. In order to classify a new example, it is passed52 through all binary classifiers and the most voted class is selected. Another53 useful strategy is “One–vs–All” (OVA). In this second case, a problem with54 c classes is replaced with M = c reduced binary problems, each one of them55 consisting of discriminating a single class from all remaining ones.56 Therefore, the most usual approach to implement a multiclass SVM–RFE57 method is to directly apply the RFE algorithm over an OVO or OVA multi-58 class SVM (Ramaswamy et al., 2001; Duan et al., 2007; Zhou & Tuck, 2007).59 The pioneering work of Ramaswany et al. (Ramaswamy et al., 2001) pro-60 posed the OVA solution, but also compared results with the OVO strategy.61 Duan et al. (Duan et al., 2007) and Zhou et al. (Zhou & Tuck, 2007) devel-62 oped slight variations of the method, always considering both OVA and OVO63 implementations. Zhou et al. (Zhou & Tuck, 2007) also considered solutions64 to the RFE problem using a direct multiclass implementation.65 Interestingly, the solutions to the multiclass SVM–RFE problem that we66 have just described involve an important decision about the feature selection67 process which is usually neglected: they rank features by simply averag-68 ing components over the binary problems. For an input variable i they use69 < |wij| >j, the mean importance over all binary problems j, as the corre-70 sponding importance. As we discuss in the next section, this strategy can71 lead to sub–optimal selections in many cases. Once the original multiclass72 problem has been divided into multiple binary ones, the feature selection73 problem can be treated in a similar way. Then, a possible solution is to cast74 the multiclass feature selection problem as the problem of selecting candidate75 5 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT features from multiple lists (Jurman et al., 2008), each list corresponding to76 a different binary sub-problem.77 Similar solutions have been studied in related fields. In Bioinformatics,78 for example, Haury et al. (Haury et al., 2011) discussed the combination of79 multiple lists of genes from bootstraps of the same gene-expression dataset.80 Zhou and Dickerson (Zhou & Dickerson, 2014) and Zhou and Wang (Zhou &81 Wang, 2016) proposed the use of class–dependent features (different features82 for each binary problem) for biomarker discovery. Dittman et al. (Dittman83 et al., 2013) showed that combining multiple lists in binary classification84 problems can improve the feature selection results. In a short work in text85 categorization, Neumayer et al. (Neumayer et al., 2011) suggested that the86 combination of rankings generated by diverse methods can improve the re-87 sults of using a single method. Kanth and Saraswathi (Kanth & Saraswathi,88 2015) used class–dependent features for speech emotion recognition, but us-89 ing independent features for each class, not a final unique list.90 In this work we discuss in depth the use of combination of multiple lists in91 feature selection for multiclass classification problems. We first introduce a92 simple mathematical framework for multiple lists. Using this framework, we93 propose diverse strategies to produce improved selection of feature subsets94 with SVM-RFE. Also, we use some specifically–designed artificial datasets95 and real–world examples to evaluate them extensively, using both the OVO96 and OVA strategies.97 The rest of this article is organized as follows: in Section 2, we describe the98 feature selection methods introduced in this work. In Section 3 we evaluate99 these methods on diverse datasets and experimental setups. Finally, we draw100 our conclusions in Section 4.101 2. List combination methods for SVM–RFE102 The RFE selection method is a recursive process that ranks variables103 according to a given importance measure. At each iteration of the algo-104 rithm, the importance of each feature is calculated and the less relevant one105 is removed —in order to speed up the process, not one but a group of low106 relevance features is usually removed. Recursion is needed because the rel-107 ative importance of each feature can change substantially when evaluated108 over a different subset of features during the stepwise elimination process, in109 particular for highly correlated features. The inverse order in which features110 6 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT are eliminated is used to create a final ranking. Then, the feature selection111 process itself is reduced to take the first n features from this ranking.112 In the original binary version of SVM–RFE (Guyon et al., 2002), the113 projection of w (the normal vector to SVM’s decision hyperplane) in the114 direction of feature i, wi, is used as the importance measure. The method115 was efficiently extended to multiclass problems, employing the well-known116 OVO or OVA strategies to decompose the multiclass problem into a series117 of related binary ones (Ramaswamy et al., 2001; Duan et al., 2007; Zhou &118 Tuck, 2007). In both cases a set of M related binary problems is generated,119 each one solved by a vector wj. For each binary problem j, the importance120 of feature i is given by the corresponding component, wij.121 In order to obtain a unique importance for each feature in this setup, the122 simplest solution is to average the absolute value of the components |wij|123 over all related binary problems. We will call this method “Average” in the124 following. The Average solution is implemented, to the best of our knowledge,125 in all available RFE software packages, including the most popular amongst126 researchers (MATLAB, R and PYTHON platforms).127 However, the only real advantage of the Average strategy is its simplicity.128 Two main drawbacks of this approach should be taken into consideration but129 are usually ignored:130 1. The first issue can be called the flattening problem. Consider, for131 example, a feature e which is able to separate class j from all remaining132 classes, but is uninformative in other cases. Component wej will be133 large, but components wek with k 6= j will be small, giving a low value134 for < wej >j. Consider now another feature d which can give a modest135 help in separating any class from the others, obtaining always moderate136 values of wdj, and therefore giving a medium value for < wdj >j. The137 Average strategy will clearly rank the latter over the former, but in138 most scenarios it will be desirable to keep the first variable over the139 second.140 2. The second issue with the Average solution refers to relative scales.141 The length of vector wj is different for each binary problem, as it de-142 pends on the margin of the solution, which can change considerably for143 classes that are relatively close or far away in feature space. Averaging144 components of vectors of different lengths can lead to the selection of145 sub–optimal subsets.146 New strategies for feature selection able to overcome these drawbacks are 7 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Ranking List 1 List 2 . . . List M 1 f2 f3 . . . f1 2 f1 f7 . . . f3 3 f5 f2 . . . f6 ... ... ... . . . ... p f8 f4 . . . f7 Table 1: List of ranked features for each binary problem. needed. Here we propose to cast the problem as a selection of candidate features from multiple ranking lists (Jurman et al., 2008). We start by de- composing the multiclass problem into a set of M related binary problems (through the OVA or OVO strategies). The problem involves a set of p fea- tures, F = {f1, f2, . . . fp}. SVM–RFE produces a ranking (an ordered list) for each individual problem using the components wij. An example is shown in Table 1. This set of lists can be arranged in a matrix (Table 2) where each row shows the position of each feature in the ranking produced for the bi- nary problem shown on each column. We can now define a matrix of relative ranking positions as: ri,j = p−posi,j p = 1 − posi,j p , where ri,j is the relative ranking of feature fi in the list corresponding to bi-147 nary problem j, posi,j is the position of the same feature in the corresponding148 ranking (Table 2), and p is the total number of features in the problem. No-149 tice that the values of ri,j belong to the unit interval [0, 1] and depend linearly150 on the ranking position (a value of 1 − 1/p must be interpreted as the first151 position in the ranking).152 Important features should reflect in high values of ri,j for some i, j, mean-153 ing they are relevant to at least some of the binary classification problems.154 Two main strategies can be used to select those relevant features from this155 matrix and are discussed in the following.156 2.1. Methods based on relative ranking statistics157 The first strategy consists of measuring an appropriate statistic for each158 feature over all binary problems, and then using it to elaborate a final ranking159 of features. We selected the following four methods:160 8 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Features List 1 List 2 . . . List M f1 2 5 . . . 1 f2 1 4 . . . 7 f3 4 1 . . . 2 ... ... posi,j . . . ... fp posp,1 . . . . . . posp,M Table 2: Matrix showing the position of each feature in the ranking of each binary problem. Rows correspond to features and columns to binary problems. 2.1.1. Average-SD161 In this method, feature ranking is given by the average value of the relative162 position over all binary problems:163 Ri =< ri,j >j, where Ri is the ranking of feature fi in the final ordered list, used to select164 features in the multiclass problem. Ties are broken by the standard deviation165 (SD) of the relative position (higher is better). We show in the next section166 that features with higher SD are preferable over lower SD ones, because a167 larger SD means that the feature has some better–than–average rankings.168 Average-SD can be considered as the base strategy for multiple lists. It169 can overcome the relative scales problem on averaging weights, but is not170 expected to solve the flattening problem.171 2.1.2. Best Ranking172 In this second approach we rank every feature according to the best rel-173 ative ranking that it reaches over the set of binary problems:174 Ri = max(ri,j)j. Ties are broken by the mean value of the relative position over all prob-175 lems. A similar method has been used to select the winning class in multiple176 classifier systems (Ho et al., 1994). This strategy can be viewed as an ex-177 treme case, considering for each feature just one of the multiple rankings it178 receives and disregarding the rest. On the other hand, it is most aggressive179 in dealing with the flattening problem.180 9 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 2.1.3. 3Q-SD181 The third method orders features according to the 3rd quartile of the182 distribution of relative rankings:183 Ri = 3Q(ri,j)j, where the 3Q function returns the 3rd quartile of its argument. As in184 Average-SD, ties are broken by the SD. This approach is intermediate be-185 tween the two previous ones, searching for features that reach a high relative186 position, but also considering the full relative rankings distribution.187 2.1.4. K-First188 This method is adapted from a strategy to select relevant documents in189 information retrieval (Nuray & Can, 2006). The idea is to only consider190 features located in the top k positions of each individual list. We re-scale the191 relative rankings with a linear mapping reaching 0 for the k + 1 feature, and192 then take the average of this new relative importance:193 r′i,j = max(1 − posi,j k , 0) Ri =< r ′ i,j >j, where r′i,j is the re-scaled relative weight for feature fi and k is the number194 of features to be considered from each list (k < p). As in the Best Ranking195 method, ties are broken by the mean value of the original relative ranking,196 < ri,j >j. We discuss the set of parameter values k in the next section.197 This strategy is aimed at searching for features which are highly relevant for198 some of the problems, but is not limited to searching for the most relevant199 features —as the Best Ranking method is. It can potentially overcome both200 drawbacks of Average: relative scaling and flattening.201 2.2. Methods based on voting theory202 The second general strategy is related to voting theory (Saari, 2001;203 Young, 1988). In this setup we consider each binary problem as a voter,204 producing a ranking over a set of p candidates. Multiple methods were de-205 veloped over the years to solve the problem of combining elector preferences206 to find winner candidates —the most useful of them are known collectively207 as ”Condorcet Methods”. We focused on two popular procedures as selection208 methods for relevant features over multiple lists:209 10 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 2.2.1. Condorcet210 The most basic Condorcet method is known as Copeland’s method, or211 simply as Condorcet method (we will use the latter name in this work). It212 confronts each pair of features on every list (all binary problems), and then213 counts the number of wins minus the number of defeats for each feature214 (Young, 1988). A feature wins over another if it is ranked higher in the215 considered list. The global difference between wins and loses is used to216 rank features in the multiclass problem. Ties are broken by average relative217 rankings.218 2.2.2. Schulze219 This method, introduced by Schulze informally in 1997 and published220 later (Schulze, 2011), represents an improvement over previous Condorcet221 methods. It begins by counting wins and loses over each pair of features and222 all lists, storing these numbers in a pairwise preference matrix. Then a graph223 is constructed, with features as nodes and values in the matrix as weights.224 Finally, using a variant of the FloydWarshall algorithm, the strongest paths225 over the graph are selected for each pair of features, and their strengths226 are used to compare features. The strength of a path is defined as that227 of its weakest link (i.e., lower value in the matrix of preferences). A path228 between two nodes is valid if there is a sequence of strictly decreasing weights229 connecting them (Schulze, 2011). Features with more wins upon strength230 comparison are ranked first. The method is expected to perform better than231 basic Condorcet, but the computational load involved is significant.232 3. Evaluation on artificial datasets233 We first consider artificial classification problems in order to evaluate spe-234 cific aspects of the new methods and to be able to compare their capabilities235 in a controlled manner.236 3.1. Experimental setup237 As in previous works (Granitto et al., 2006), we strive to use an appro-238 priate computational setup for feature selection. We perform n = 100 times239 a random split (75%−25%) of each dataset in training and testing sets (the240 former are used to select features and train the classifiers, while the latter241 for model accuracy estimation). The testing sets are completely external to242 11 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT the feature selection process, thus providing unbiased estimates of classifica-243 tion errors for different number of features. The results of the n replicated244 experiments are then aggregated to yield mean error rate estimations and245 their corresponding SD.246 SVM–RFE was implemented using the OVO strategy unless specified247 otherwise. In both cases, we created the corresponding binary problems and248 produced a ranking of features for each of them. To create a ranking we used249 the standard SVM–RFE (linear kernel), as described by their authors (Guyon250 et al., 2002), eliminating 10% of the features at each iteration until there were251 less than 20 features left, when we slowed the procedure to eliminate 1 feature252 at each iteration. The fixed set of lists of ranked features were combined253 using the methods described before, producing a final list of features for each254 method under evaluation. Finally, for each method we fitted a multiclass255 SVM for a varying number of features, from 2 to p, using only the training256 data, and measured the classification error using the testing set. The C257 parameter was estimated in all cases using 5-fold cross validation of the258 training set.259 3.2. Artificial datasets260 We created three different multiclass datasets that provide diverse chal-261 lenges to our methods. In all cases, each class is sampled from a Gaussian dis-262 tribution with diagonal covariance matrix. For each dataset we can identify,263 by construction, a group of relevant features that can discriminate amongst264 classes and another group of irrelevant features containing Gaussian noise.265 All noisy features have the same mean (0) and SD (1) for all classes. Each266 dataset is composed of 3000 points evenly distributed among classes. The267 number of noisy features is fixed at 500.268 In the first dataset, called Artificial-1, there is a group of 5 features that269 is relevant for each class, i.e., class-specific features. The set of 5 features270 together shift the class center away from other classes. All relevant features271 have the same importance for the problem. The SD of the Gaussian distri-272 butions corresponding to relevant features are always set to 0.5.273 A different situation arises when there are sets of features which are rel-274 evant to some of the classes (more than one) but not for all of them. We275 created a second classification problem, Artificial-2, to evaluate this chal-276 lenge. The dataset has 8 classes and 25 relevant features, all sampled from277 Gaussian distributions with a SD of 0.5. The first 5 features are relevant for278 the first 3 classes of the problem. The following 5 are relevant for classes 4279 12 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Dataset Classes Relevant features Noisy features Artificial-1-3C 3 15 500 Artificial-1-4C 4 20 500 Artificial-1-5C 5 25 500 Artificial-1-8C 8 40 500 Artificial-1-16C 16 80 500 Artificial-2-8C 8 25 500 Artificial-3-3C 3 15 500 Artificial-3-8C 8 40 500 Artificial-3-16C 16 80 500 Table 3: Details of the artificial datasets used in this work. and 5 only, and are less relevant than the first 5. The rest of the features are280 relevant for a single class, 5 for each of the remaining 3 classes. These last281 features are less relevant than the first 10 features.282 Finally, we created a third problem, called Artificial-3, where all relevant283 features are equally useful for all classes at the same time. As in Artificial-1,284 there are 5 features for each class, all sampled from Gaussian distributions285 with a SD of 0.5.286 In all problems there is an overlap among classes, giving a nonzero Bayes287 error. We created five datasets for Artificial-1, with an increasing number288 of classes, and 3 datasets for Artificial-3 in the same way. Table 3 collects289 technical details of the datasets.290 291 3.3. Methodological setup292 3.3.1. K-First293 The K-First method is the only approach involving a parameter that294 needs to be set, k. The value of this parameter regulates the number of295 variables that receive a relative ranking. A very low value would make the296 method similar to Best Ranking, while a high one would turn the method297 into 3Q-SD (furthermore, k = p would convert the method into Average-SD).298 We evaluated several values of k (increasing fractions of p) over all ar-299 tificial datasets considered. Figure 1 shows the corresponding error curves300 as a function of the number of features selected by the method, for some301 representative problems. Error curves for all other artificial problems are302 13 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT similar to the reported ones (we include more figures in the Additional Ma-303 terial section). The vertical dotted lines show the correct number of relevant304 features for the problem, i.e. where the minimum of the curve should ideally305 be located. When possible, we show with a gray horizontal line the chance306 error level for the corresponding problem. The top row shows typical results307 for the Artificial-3 problem. In this case there are no class-specific features308 and, as a consequence, the results are almost independent of k. The bottom309 row shows results for class-dependent problems, Artificial-1 and 2. In this310 case the results clearly depend on k. We found that a value of 10% of p gives311 consistently good results in all artificial cases considered here, therefore we312 will use this value for the rest of the paper.313 3.3.2. Average-SD and 3Q-SD314 As noted before, these methods use the SD of the relative rankings as a315 breaking tie criterion, considering larger values of SD as better than smaller316 ones. This is based on the assumption that a large SD is associated with317 high rankings for some of the binary problems, and that such behavior is318 able to highlight class-dependent features over flat ones. In order to confirm319 this, we compared for both methods over a set of artificial problems the320 use of maximum versus minimum SD to break ties. Figure 2 shows the321 corresponding results for some representative cases. They are similar in all322 other cases (some of which are shown in the Additional Material section).323 As this figure shows, using maximum values always leads to equal or better324 performance than using minimum ones.325 3.3.3. OVA–SVM vs. OVO–SVM326 We applied both OVA and OVO strategies, combined with our feature327 selection methods, to all artificial datasets. We compared all results and328 found that the OVO strategy yields equal or superior performance in all329 cases. In Figure 3 we show some representative examples of this comparison,330 using the Artificial-1-8C and Artificial-2-8C datasets. On the left column331 we show OVA results, while the OVO case is depicted on the right column.332 We use the same scale for the corresponding panels. We also included the333 Bayes error for both datasets as dotted horizontal lines, and the true number334 of relevant features as a dotted vertical lines. More datasets are included in335 the Additional Material section.336 It is interesting to note that the two methods more directly aimed at find-337 ing class-relevant features (K-First and Best Ranking) are the ones showing338 14 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .6 3 0 0 .6 4 0 0 .6 5 0 0 .6 6 0 Features E rr o r ra te ● K First 2,5% K First 5% K First 10% K First 20% K First 40% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 0 .9 0 Features E rr o r ra te ● K First 2,5% K First 5% K First 10% K First 20% K First 40% (a) (b) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .7 5 0 .8 0 0 .8 5 0 .9 0 0 .9 5 Features E rr o r ra te ● K First 2,5% K First 5% K First 10% K First 20% K First 40% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .5 0 .6 0 .7 0 .8 Features E rr o r ra te ● K First 2,5% K First 5% K First 10% K First 20% K First 40% (c) (d) Figure 1: Evaluation of different values of k for the K-First method. Each line shows average error rates as a function of the number of features selected by the corresponding method, with 1 SD error bars. (a) Artificial-3-3C (b) Artificial-3-8C (c) Artificial-1-16C (d) Artificial-2-8C (chance error 0.875). the bigger gains under the OVO strategy. Probably the OVO strategy can339 filter some of the noisy features more efficiently than OVA, as it considers340 significantly more lists of features (M = c(c − 1)/2 vs. M = c). After this341 comparison we will only use OVO-SVM to evaluate our new methods.342 3.4. Evaluation of the Methods on Artificial Datasets343 Figure 4 shows the results for 4 versions of the Artificial-1 problem and 2344 of the Artificial-3 problem. The remaining dataset from Artificial-1 is shown345 on Panel (c) of Figure 3. Results for Artificial-2 are shown on Panel (d) of346 the same figure. Additional datasets are included in the Additional Material347 section. Overall, artificial problems show that the K-First method is the348 15 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .5 5 0 .6 0 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 Features E rr o r ra te ● Average−SD Max Average−SD Min 3Q−SD Max 3Q−SD Min ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .5 5 0 .6 0 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 Features E rr o r ra te ● Average−SD Max Average−SD Min 3Q−SD Max 3Q−SD Min (a) (b) Figure 2: Comparison of tie breaking using maximum or minimum SD for Average-SD and 3Q-SD. Details are similar to Figure 1. Chance error of 0.875 for both panels. (a) Artificial-1-8C. (b) Artificial-2-8C. most efficient one in finding subsets of features with low classification error,349 followed closely by the Best-Ranking method. The Schulze method shows350 good performance in several datasets. The other 3 methods show similar351 results, though not as good as the first group.352 On the Artificial-3 datasets (all relevant features are useful for all classes)353 the differences among methods are clearly smaller than on the other 2 prob-354 lems (with class-dependent features). Differences in performance increase355 with the number of classes for the Artificial-1 dataset.356 Comparing the two methods based on voting theory, the low performance357 of Condorcet compared with Schulze is notorious. Taking a closer look at the358 method, we noticed that Condorcet produces a lot of ties in the rankings,359 which are broken using average positions. This produces a bias towards360 features with good global average values instead of features highly relevant361 for a few lists.362 Another interesting analysis that can be made with artificial datasets is363 the position occupied by the truly relevant features on the rankings produced364 by the different methods, as we know in advance which features are noisy365 and which ones are informative. A perfect method should rank all relevant366 features first, with all noisy features following.367 For each artificial problem, we analyzed the distribution of rankings given368 by each selection method to the set of relevant features and to the set of369 noisy features. We then computed some descriptive statistics of those two370 16 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze (a) (c) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze (b) (d) Figure 3: OVA-SVM vs. OVO-SVM on two artificial problems. Details are similar to Figure 1. (a) RFE-OVA-SVM on Artificial-1-8C. (b) RFE-OVA-SVM on Artificial-2-8C. (c) RFE-OVO-SVM on Artificial-1-8C (d) RFE-OVO-SVM on Artificial-2-8C. distributions (Best, 1st. quartile, Mean, 3rd. quartile and Worst). In Table371 4 we show these statistics on dataset Artificial-1-8C, which is representative372 of the results obtained on the other versions of this problem. All six methods373 rank relevant features at the first positions and noisy features at the last ones,374 but there are important differences. Looking at the Mean and 3rd. Quartile375 of the distributions, it is clear that K-First, Best Ranking and Schulze, in that376 order, are the most accurate ones in ranking most of the features according377 to their global relevance. These results confirm that the low error rates on378 the figures discussed before are directly related to a better feature selection379 by those methods.380 In Table 5 we show the corresponding statistics for the Artificial-2-8C381 17 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Relevant 3Q-SD Av-SD Best Rank. K-First Condorcet Schulze Best 1 1 1 1 1 1 1st Q. 14 14 12 11 17 13 Mean 70 65 33 22 79 48 3rd Q. 98 86 41 31 110 61 Worst 436 476 357 495 474 410 Noisy 3Q-SD Av-SD Best Rank. K-First Condorcet Schulze Best 1 1 1 3 1 1 1st Q. 160 161 165 165 159 163 Mean 287 287 290 290 286 288 3rd Q. 415 415 415 415 415 415 Worst 540 540 540 540 540 540 Table 4: Statistics of the rankings given to relevant and noisy features by the diverse methods considered in this study for the Artificial-1-8C dataset. Values are rounded when needed. dataset. This is the most interesting problem, as it contains subsets of rele-382 vant features with diverse levels of relevance. In the table we separated the383 relevant features into 3 subsets. As it can be observed in the table, K-First384 is the most effective strategy in separating the subsets of relevant features,385 and not only relevant from noisy features.386 Finally, in Table 6 we show statistics for the Artificial-3-8C dataset —387 it is representative of all versions of this problem. As discussed before, all388 methods are almost equivalent on this dataset.389 A last comment is in order about the Best Ranking method. As it can390 be seen in the tables, it can give high rankings to noisy features more easily391 than the K-First method, as it bases the final ranking on a single value for392 each feature.393 4. Evaluation on real–world datasets394 We used 14 real–world datasets to evaluate our new methods. Details and395 origins of the datasets are collected on Table 7. We selected datasets from396 three different domains. The first 4 datasets collect mass spectrometry mea-397 surements of food products. All recorded peaks are present in the datasets.398 Some of the products under analysis can present class-specific features, re-399 flecting particularities of some products, such as origin or manufacturing400 18 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 1 to 5 3Q-SD Av-SD Best Rank. K First Condorcet Schulze Best 1 1 1 1 1 1 1st Q. 2 2 2 2 2 2 Mean 22 5 9 3 15 4 3rd Q. 28 5 17 4 18 5 Worst 481 62 45 7 85 18 6 to 10 3Q-SD Av-SD Best Rank. K First Condorcet Schulze Best 1 1 1 1 1 1 1st Q. 17 7 6 7 13 6 Mean 80 24 13 8 50 9 3rd Q. 158 30 19 9 58 11 Worst 523 241 44 14 453 70 11 to 25 3Q-SD Av-SD Best Rank. K First Condorcet Schulze Best 1 2 1 8 1 4 1st Q. 35 19 12 14 29 15 Mean 161 87 22 18 101 65 3rd Q. 265 116 30 22 185 70 Worst 524 500 57 29 524 501 Noisy 3Q-SD Av-SD Best Rank. K First Condorcet Schulze Best 1 3 3 10 1 8 1st Q. 142 147 151 151 142 148 Mean 269 273 275 275 274 274 3rd Q. 399 400 400 400 399 400 Worst 525 525 525 525 525 525 Table 5: Statistics of the rankings given to relevant and noisy features by the diverse methods considered in this study for the Artificial-2-8C dataset. The relevant features are divided into three subsets, and ordered according to their relevance by construction. Values are rounded when needed. 19 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Relevant 3Q-SD Av-SD Best Rank. K First Condorcet Schulze Best 1 1 1 1 1 1 1st Q. 11 11 11 11 11 11 Mean 27 29 41 26 28 29 3rd Q. 33 37 53 33 33 37 Worst 440 431 368 250 342 252 Noisy 3Q-SD Av-SD Best Rank. K First Condorcet Schulze Best 5 2 2 8 3 7 1st Q. 165 165 164 165 165 165 Mean 290 290 290 290 290 290 3rd Q. 415 415 415 415 415 415 Worst 540 540 540 540 540 540 Table 6: Statistics of the rankings given to relevant and noisy features by the diverse methods considered in this study for the Artificial-3-8C dataset. Values are rounded when needed. method. The following 6 datasets come from the UCI repository. These401 are more traditional datasets, with more samples than features and multiple402 classes, involving typical pattern recognition problems. Finally, we selected403 4 gene expression datasets from human tissues. These datasets were filtered404 by curators to obtain circa 1000 genes with high signal–to–noise ratio in each405 case.406 In order to compare our results against previous methods we implemented407 3 versions of MSVM-RFE, as described by Zhou & Tuck (2007). The first408 method uses the multiclass SVM developed by Crammer & Singer (2001).409 We will denote it ”Zhou C & S”. The second one uses the method of Weston410 & Watkins (1999), in the following denoted by ”Zhou W & W”. Finally, we411 implemented MSVM-RFE with the OVO decomposition of the traditional412 binary SVM. We will refer to this method as ”Zhou OVO”. Notice that it413 is equivalent to the Average Weights methodology, which is implemented by414 default in most available Machine Learning packages, as previously explained.415 On Figures 5 and 6 we show the results for eight datasets, while the416 remaining cases are shown in the Additional Material section. In general,417 differences in results for real world data are less notorious than for the Ar-418 tificial problems. For UCI and Mass-Spectrometry datasets, K-First is in419 general the method showing the best results in finding small subsets with420 reduced classification error, followed by Best Ranking and 3rd Quartile. In421 20 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT some problems, like Apples and Libra, differences are more notorious. On the422 gene expression datasets all methods show small differences, but in general423 the variants of MSVM-RFE exhibit better results than on the other domains.424 These datasets have been filtered by curators and as a consequence all fea-425 tures are informative. We believe that this improves the performance of426 averaging methods over methods that search for class-specific features.427 Error curves show the complete behavior of the methods as a function428 of the number of features, but occasionally diverse methods are more effi-429 cient in selecting a high number or just a few features. To produce a more430 concrete comparison, we measured for two fixed numbers of selected features431 (10 and 20) the proportion of runs on which method A shows a smaller error432 than method B, for each of the three domains under evaluation. The full433 resulting matrices are shown in the Additional Material section. From these434 matrices we computed a ranking for each method, counting the number of435 other methods that it excels. We show the corresponding results in Table 8.436 They confirm the information extracted from the error curves: on the UCI437 and Mass-Spectrometry domains K-first shows the best results, but Best438 Ranking and 3rd Quartile also have high rankings. On the gene expression439 domains the best results come from one of the MSVM-RFE methods.440 5. Computational burden441 We evaluated the burden of the 6 new methods as a function of the442 number of features and samples using the Artificial-1 dataset. In panel (a)443 of Figure 7 we show how the running time scales with the number of features444 in the problems, using a log-scale for times. We include the 3 versions of the445 method by Zhou et al. as a comparison. It is clear from this figure that all but446 the Schulze method scale almost linearly with the number of features, being447 Condorcet and 3rd Quartile the slowest methods. Schulze is cubic in the448 number of features, as it involves a variant of the FloydWarshall algorithm449 to find shortest paths in a graph. On panel (b) of the same figure we show450 the dependence on the number of samples in the dataset. All new methods,451 including Schulze, scale almost linearly with the number of samples. The452 two variants of Zhou’s method using direct multiclass SVMs show power–453 law scaling with the number of samples.454 21 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 6. Conclusions455 In this work we discussed in depth the use of combinations of lists of456 features (instead of averaging individual importances) in SVM–RFE for fea-457 ture selection on multiclass problems. Using an appropriate mathematical458 framework we introduced 6 different methods to produce the final ranking of459 features starting from a set of ranked features list produced by each binary460 problem. We evaluated them in a series of artificial and real world datasets.461 Our first conclusion is that the OVO strategy should be preferred over462 OVA for multiclass feature selection. Probably the higher number of binary463 problems in OVO helps in filtering out some noisy features that receive high464 rankings from just one or only a few binary problems, a similar beneficial465 effect to the use of ensembles in general.466 Our second conclusion is that, overall, the K-First method is the most467 consistent one in selecting subsets of relevant features that lead to smaller468 classification errors. The idea is well–known in the document retrieval liter-469 ature, only considers the top k values of each list, and adapts efficiently to470 feature selection. We showed with several artificial and real–world datasets471 that this new method is superior to the typical weights averaging that is472 implemented by default in all current Machine Learning libraries.473 Finally, two other methods also showed good results but present some474 drawbacks. The Best Ranking strategy is simple and efficient, but can lose475 performance on some problems, such as Artificial-2. Also, the use of a single476 value to characterize the behavior of a feature can give high rankings to noisy477 features by chance. The Schulze strategy, based on voting theory, shows a478 very good performance on some artificial datasets but does not compare well479 on real–world ones, and is by far the most complex and time–consuming480 strategy out of the six methods under evaluation.481 Overall, the new methods were designed for problems with class-specific482 features, which is where they show their best performance. As they employ483 the OVO strategy, they are also resistant to noisy features. Filtered domains,484 with lots of low-relevance features and little noise like our gene expression485 datasets, seem to represent a more challenging domain for our new methods.486 Work in progress includes a more extensive evaluation and the use of a487 penalty term to help discard correlated features.488 22 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT References489 Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass490 to binary: A unifying approach for margin classifiers. Journal of machine491 learning research, 1 , 113–141.492 Breiman, L. (2001). Random forests. Machine learning , 45 , 5–32.493 Cappellin, L., Soukoulis, C., Aprea, E., Granitto, P., Dallabetta, N., Costa,494 F., Viola, R., Märk, T. D., Gasperi, F., & Biasioli, F. (2012). Ptr-495 tof-ms and data mining methods: a new tool for fruit metabolomics.496 Metabolomics , 8 , 761–770.497 Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of498 multiclass kernel-based vector machines. Journal of machine learning re-499 search, 2 , 265–292.500 Del Pulgar, J. S., Soukoulis, C., Carrapiso, A., Cappellin, L., Granitto, P.,501 Aprea, E., Romano, A., Gasperi, F., & Biasioli, F. (2013). Effect of the502 pig rearing system on the final volatile profile of iberian dry-cured ham as503 detected by ptr-tof-ms. Meat science, 93 , 420–428.504 Dittman, D. J., Khoshgoftaar, T. M., Wald, R., & Napolitano, A. (2013).505 Classification performance of rank aggregation techniques for ensemble506 gene selection. In FLAIRS Conference.507 Duan, K.-B., Rajapakse, J. C., & Nguyen, M. N. (2007). One-versus-one and508 one-versus-all multiclass svm-rfe for gene selection in cancer classification.509 In European Conference on Evolutionary Computation, Machine Learning510 and Data Mining in Bioinformatics (pp. 47–56). Springer.511 Fabris, A., Biasioli, F., Granitto, P. M., Aprea, E., Cappellin, L., Schuhfried,512 E., Soukoulis, C., Märk, T. D., Gasperi, F., & Endrizzi, I. (2010). Ptr-tof-513 ms and data-mining methods for rapid characterisation of agro-industrial514 samples: influence of milk storage conditions on the volatile compounds515 profile of trentingrana cheese. Journal of mass spectrometry , 45 , 1065–516 1074.517 Forman, G. (2003). An extensive empirical study of feature selection metrics518 for text classification. Journal of machine learning research, 3 , 1289–1305.519 23 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,520 Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A.521 et al. (1999). Molecular classification of cancer: class discovery and class522 prediction by gene expression monitoring. science, 286 , 531–537.523 Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive524 feature elimination with random forest for ptr-ms analysis of agroindustrial525 products. Chemometrics and Intelligent Laboratory Systems , 83 , 83–90.526 Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature527 selection. Journal of Machine Learning Research, 3 , 1157–1182.528 Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for529 cancer classification using support vector machines. Machine learning , 46 ,530 389–422.531 Haury, A.-C., Gestraud, P., & Vert, J.-P. (2011). The influence of feature532 selection methods on accuracy, stability and interpretability of molecular533 signatures. PloS one, 6 , e28210.534 Ho, T. K., Hull, J. J., & Srihari, S. N. (1994). Decision combination in535 multiple classifier systems. IEEE transactions on pattern analysis and536 machine intelligence, 16 , 66–75.537 Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass538 support vector machines. IEEE transactions on Neural Networks , 13 ,539 415–425.540 Hua, J., Tembe, W. D., & Dougherty, E. R. (2009). Performance of feature-541 selection methods in the classification of high-dimension data. Pattern542 Recognition, 42 , 409–424.543 Jurman, G., Merler, S., Barla, A., Paoli, S., Galea, A., & Furlanello, C.544 (2008). Algebraic stability indicators for ranked lists in molecular profiling.545 Bioinformatics , 24 , 258–264.546 Kanth, N. R., & Saraswathi, S. (2015). Efficient speech emotion recognition547 using binary support vector machines & multiclass svm. In Computational548 Intelligence and Computing Research (ICCIC), 2015 IEEE International549 Conference on (pp. 1–6). IEEE.550 24 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional551 methods and a new algorithm. In AAAI (pp. 129–134). volume 2.552 Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection.553 Artificial intelligence, 97 , 273–324.554 Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson,555 W. E., Geman, D., Baggerly, K., & Irizarry, R. A. (2010). Tackling the556 widespread and critical impact of batch effects in high-throughput data.557 Nature Reviews Genetics , 11 , 733–739.558 Lichman, M. (2013). UCI machine learning repository.559 Liu, H., Dougherty, E. R., Dy, J. G., Torkkola, K., Tuv, E., Peng, H., Ding,560 C., Long, F., Berens, M., Parsons, L. et al. (2005). Evolving feature selec-561 tion. IEEE Intelligent systems , 20 , 64–76.562 Mohsenzadeh, Y., Sheikhzadeh, H., & Nazari, S. (2016). Incremental rel-563 evance sample-feature machine: A fast marginal likelihood maximization564 approach for joint feature selection and classification. Pattern Recognition,565 60 , 835–848.566 Mohsenzadeh, Y., Sheikhzadeh, H., Reza, A. M., Bathaee, N., & Kalayeh,567 M. M. (2013). The relevance sample-feature machine: A sparse bayesian568 learning approach to joint feature-sample selection. IEEE transactions on569 cybernetics , 43 , 2241–2254.570 Monti, S., Tamayo, P., Mesirov, J. P., & Golub, T. R. (2003). Consensus clus-571 tering: A resampling-based method for class discovery and visualization of572 gene expression microarray data. Machine Learning , 52 , 91–118.573 Neumayer, R., Mayer, R., & Nørv̊ag, K. (2011). Combination of feature574 selection methods for text categorisation. In European Conference on In-575 formation Retrieval (pp. 763–766). Springer.576 Nguyen, M. H., & De la Torre, F. (2010). Optimal feature selection for577 support vector machines. Pattern recognition, 43 , 584–591.578 Nuray, R., & Can, F. (2006). Automatic ranking of information retrieval579 systems using data fusion. Information processing & management , 42 ,580 595–614.581 25 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Raies, A. B., & Bajic, V. B. (2016). In silico toxicology: computational582 methods for the prediction of chemical toxicity. Wiley Interdisciplinary583 Reviews: Computational Molecular Science, 6 , 147–172.584 Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.-H., An-585 gelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P. et al. (2001).586 Multiclass cancer diagnosis using tumor gene expression signatures. Pro-587 ceedings of the National Academy of Sciences , 98 , 15149–15154.588 Saari, D. (2001). Chaotic elections!: A mathematician looks at voting . Amer-589 ican Mathematical Soc.590 Schulze, M. (2011). A new monotonic, clone-independent, reversal symmet-591 ric, and condorcet-consistent single-winner election method. Social Choice592 and Welfare, 36 , 267–303.593 Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005).594 A comprehensive evaluation of multicategory classification methods for595 microarray gene expression cancer diagnosis. Bioinformatics , 21 , 631–643.596 Uysal, A. K. (2016). An improved global feature selection scheme for text597 classification. Expert systems with Applications , 43 , 82–92.598 Vapnik, V. (2013). The nature of statistical learning theory . Springer science599 & business media.600 Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V.601 (2000). Feature selection for svms. In Proceedings of the 13th International602 Conference on Neural Information Processing Systems (pp. 647–653). MIT603 Press.604 Weston, J., & Watkins, C. (1999). Support vector machines for multi-class605 pattern recognition. In ESANN (pp. 219–224). volume 99.606 You, W., Yang, Z., & Ji, G. (2014). Feature selection for high-dimensional607 multi-category data using pls-based local recursive feature elimination. Ex-608 pert Systems with Applications , 41 , 1463–1475.609 Young, H. P. (1988). Condorcet’s theory of voting. The American Political610 Science Review , (pp. 1231–1244).611 26 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Zhou, N., & Wang, L. (2016). Processing bio-medical data with class-612 dependent feature selection. In Advances in Neural Networks (pp. 303–613 310). Springer.614 Zhou, W., & Dickerson, J. A. (2014). A novel class dependent feature se-615 lection method for cancer biomarker discovery. Computers in biology and616 medicine, 47 , 66–75.617 Zhou, X., & Tuck, D. P. (2007). Msvm-rfe: extensions of svm-rfe for mul-618 ticlass gene selection on dna microarray data. Bioinformatics , 23 , 1106–619 1114.620 27 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze (a) (b) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .7 5 0 .8 0 0 .8 5 0 .9 0 0 .9 5 1 .0 0 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze (c) (d) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .6 2 0 .6 4 0 .6 6 0 .6 8 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 0 .9 0 0 .9 5 Features E rr o r ra te ● Best ranking K first 3Q−SD Average−SD Condorcet Schulze (e) (f) Figure 4: Error curves for the six methods on some artificial datasets. Details are similar to Figure 1. (a) Artificial-1-3C (chance error 0.666) (b) Artificial-1-4C (chance error 0.75) (c) Artificial-1-5C (d) Artificial-1-16C (e) Artificial-3-3C (f) Artificial-3-8C 28 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Dataset F S C Description Apple 714 150 15 Mass spectrometry measurements over 15 varieties of Apple clones (Cappellin et al., 2012) Cheese 117 72 8 Mass spectrometry measurements over 8 va- rieties of Italian cheese (Fabris et al., 2010) Ham 1338 382 11 Mass spectrometry measurements over 11 varieties of Iberian Hams (Del Pulgar et al., 2013) Strawberry 232 233 9 Mass spectrometry measurements over 9 va- rieties of strawberries (Granitto et al., 2006) Multi-F 649 2000 10 Features of handwritten numerals extracted from a collection of Dutch utility maps (Lichman, 2013) Libras 90 360 15 Diverse hand movements from the Brazilian hands language (Lichman, 2013) Robot1 90 88 4 Robot Execution Failures Data Set, from UCI. Failures in approach to grasp position (Lichman, 2013) Robot3 90 47 4 Same as Robot1. Position of part after a transfer failure Robot4 90 117 3 Same as Robot1. Failures in approach to ungrasp position Robot5 90 164 5 Same as Robot1. Failures in motion with part Leukemia 985 248 6 Gene expression of Bone marrow samples with 6 subtypes of Leukemia (Monti et al., 2003) Lung 1000 197 4 Gene expression of lung tissues with 4 can- cer types (Monti et al., 2003) CNS 989 42 5 Gene expression of 5 tumor types of the cen- tral nervous system (Monti et al., 2003) Novartis 1000 103 4 Gene expression of tissue samples from 4 distinct cancer types (Monti et al., 2003) Table 7: Details on the 14 real–world datasets used in this work. Columns show the number of features (F), samples (S) and classes (C). 29 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 0 .6 0 0 .6 4 0 .6 8 0 .7 2 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 0 .1 0 .2 0 .3 0 .4 0 .5 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO (a) (b) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 1000 0 .2 5 0 .3 5 0 .4 5 0 .5 5 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO (c) (d) Figure 5: Results on some real world datasets. Details are similar to Figure 1. (a) Apples. (b) Strawberry. (c) Ham. (d) Libras. Group 3Q-SD Av-SD Best K-First Cond. Schulze Z. C&S Z. W&W Z. OVO UCI-10 Feat 7 5 5 8 3 2 0 2 5 MS -10 Feat 0 3 7 8 2 1 5 4 6 GE -10 Feat 1 3 6 4 2 0 7 5 8 UCI-20 Feat 5 2 6 8 2 0 3 7 3 MS -20 Feat 0 2 7 8 3 1 4 5 6 GE -20 Feat 1 6 5 4 2 0 8 3 7 Table 8: Rankings of methods (higher is better) counting the number of times that one method outperforms another on each domain and number of selected features. 30 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 0 .3 0 0 .3 5 0 .4 0 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 0 .3 0 0 .3 5 0 .4 0 0 .4 5 0 .5 0 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO (a) (b) ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 1000 0 .0 0 0 .1 0 0 .2 0 0 .3 0 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 5 10 20 50 100 200 500 1000 0 .1 0 .2 0 .3 0 .4 0 .5 Features E rr o r ra te ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO (c) (d) Figure 6: Results on some real world datasets. Details are similar to Figure 1. (a) Robot1. (b) Robot3. (c) Leukemia. (d) CNS. 31 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 2 5 1 0 2 0 5 0 2 0 0 5 0 0 Features T im e ● ● ● ● ● 100 200 300 400 500 ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO 2 5 1 0 2 0 5 0 Samples T im e ● ● ● ● 200 400 600 800 ● Best Ranking K first 3Q−SD Average−SD Condorcet Schulze Zhou C&S Zhou W&W Zhou OVO (a) (b) Figure 7: Comparison of running times for all methods evaluated in this work as a function of (a) the number of features and (b) the number of samples. Times (in seconds) are in log–scale. 32