Label Ranking Forests Cláudio Rebelo de Sá1,3 Carlos Soares2,3 Arno Knobbe1 Paulo Cortez4 1 LIACS, Universiteit Leiden, Netherlands 2 Faculdade de Engenharia, Universidade do Porto, Portugal 3 INESCTEC Porto, Porto, Portugal 4 ALGORITMI Centre, Department of Information Systems, University of Minho, Portugal Abstract The problem of Label Ranking is receiving increasing attention from several research communities. The algorithms that have been developed/adapted to treat rankings of a fixed set of labels as the tar- get object, include several different types of decision trees (DT). One DT-based algorithm, which has been very successful in other tasks but which has not been adapted for label ranking is the Random Forests (RF) algorithm. RFs are an ensemble learning method that combines different trees obtained using different randomization techniques. In this work, we propose an ensemble of decision trees for Label Ranking, based on Random Forests, which we refer to as Label Ranking Forests (LRF). Two different algorithms that learn DT for label ranking are used to obtain the trees. We then compare and discuss the results of LRF with standalone decision tree approaches. The results indicate that the method is highly competitive. 1 Introduction Label Ranking (LR) is an increasingly popular topic in the machine learn- ing literature (Ribeiro et al., 2012; de Sá et al., 2011; Cheng & Hüllermeier, 2011; Cheng et al., 2012; Vembu & Gärtner, 2010). LR studies a problem of learning a mapping from instances to rankings over a finite number of prede- fined labels. It can be considered a natural generalization of the conventional 1 classification problem, where the goal is to predict a single label instead of a ranking of all the labels (Cheng et al., 2009). Some application of Label Ranking approaches are (Hüllermeier et al., 2008): Meta-learning (Brazdil & Soares, 1999), where we try to predict a ranking of a set of algorithms according to the best expected accuracy on a given dataset; Microarray analysis (Hüllermeier et al., 2008) to find patterns in genes from Yeast on five different micro-array experiments (spo, heat, dtt, cold and diau); Image categorization (Fürnkranz et al., 2008) of landscape pictures from several categories (beach, sunset, field, fall foliage, mountain, urban). There are two main approaches to the problem of LR: methods that transform the ranking problem into multiple binary problems and methods that were developed or adapted to treat the rankings as target objects, with- out any transformation. An example of the former is the ranking by pair- wise comparisons (Hüllermeier et al., 2008). Examples of algorithms that were adapted to deal with rankings as the target objects include decision trees (Todorovski et al., 2002; Cheng et al., 2009), naive Bayes (Aiguzhinov et al., 2010) and k -Nearest Neighbor (Brazdil et al., 2003; Cheng et al., 2009). Some of the latter adaptations are based on statistical distribution of rank- ings (e.g., (Cheng et al., 2010)) while others are based on ranking distance measures (e.g., (Todorovski et al., 2002; de Sá et al., 2011)). Tree-based models have been used in classification (Quinlan, 1986), re- gression (Breiman et al., 1984) and also label ranking (Todorovski et al., 2002; Cheng et al., 2009; de Sá et al., 2015) tasks. These methods are popular for a number of reasons, including how they can clearly express information about the problem, because their structure is relatively easy to interpret even for people without a background in learning algorithms. In classification, combining the predictive power of an ensemble of trees often comes with significant accuracy improvements (Breiman, 2001). One of the earliest examples of ensemble methods is bagging (a contraction of bootstrap-aggregating) (Breiman, 1996). In bagging, an ensemble of trees is generated and each one is learned on a random selection of examples from the training set. A popular ensemble method is Random Forests (Breiman, 2001) which combines different randomization techniques. Considering the success of Random Forests in terms of improved accu- racy for classification and regression problems (Biau, 2012), some approaches have been proposed to deal with different targets, such as bipartite rankings (Clémençon et al., 2013). Label Ranking Forests should also be seen as a potential robust approach for LR. Adapting RF to Label Ranking can be a straightforward process once you have adapted decision trees. In this work, we propose an approach of ensemble learners which we 2 refer to as Label Ranking Forests (LRF). The proposed method is a natural adaptation of Random Forests for LR, combining the task-independent RF algorithm with the traditional algorithm for top-down induction of decision trees adapted for label ranking. The available adaptations of decision tree algorithms for LR include Label Ranking Trees (LRT) (Cheng et al., 2009), Ranking Trees (Rebelo et al., 2008) and Entropy-based Ranking Trees (de Sá et al., 2015). Considering that the set of trees, in most cases, predict distinct rankings, one should also take into account ranking aggregation methods. This paper extends previous work (de Sá et al., 2015), in which we pro- posed a new version of decision trees for LR, called the Entropy-based Rank- ing Trees and empirically compared them to existing approaches. The main contribution in this paper is the new Label Ranking Forests algorithm, which is an adaptation of the RF ensemble method, using Entropy-based Ranking Trees as the base level algorithm. The results indicate that LRF are compet- itive with state of the art methods and improve the accuracy of standalone decision trees. An additional contribution is an extension of the original experimental study on Entropy-based Ranking Trees, by analyzing model complexity. 2 Label Ranking In this section, we start by formalizing the problem of label ranking (Sec- tion 2.1) and then we discuss the adaptation of the decision trees algorithm for label ranking (Section 2.2) and one such adaptation, Entropy Ranking Trees (Section 2.3). 2.1 Formalization The Label Ranking (LR) task is similar to classification. In classification, given an instance x from the instance space X, the goal is to predict the label (or class) λ to which x belongs, from a pre-defined set L = {λ1, . . . ,λk}. In LR, the goal is to predict the ranking of the labels in L that is associated with x (Hüllermeier et al., 2008). A ranking can be represented as a total order over L defined on the permutation space Ω. A total order can be seen as a permutation π of the set {1, . . . ,k}, such that π(a) is the position of λa in π. As in classification, we do not assume the existence of a deterministic X → Ω mapping. Instead, every instance is associated with a probability distribution over Ω (Cheng et al., 2009). This means that, for each x ∈ X, there exists a probability distribution P(·|x) such that, for every π ∈ 3 Ω, P(π|x) is the probability that π is the ranking associated with x. The goal in LR is to learn the mapping X → Ω. The training data is a set of instances D = {〈xi,πi〉}, i = 1, . . . ,n, where xi is a vector containing the values x j i,j = 1, . . . ,m of m independent variables describing instance i and πi is the corresponding target ranking. Given an instance xi with label ranking πi, and the ranking π̂i predicted by an LR model, we can evaluate the accuracy of the prediction with loss functions on Ω. Some of these measures are based in the number of discordant label pairs: D(π,π̂) = #{(a,b)|π(a) > π(b) ∧ π̂(a) < π̂(b)} If normalized to the interval [−1, 1], this function is equivalent to Kendall’s τ coefficient, which is a correlation measure where D(π,π) = 1 and D(π,π−1) = −1, where π−1 denotes the inverse order of π (e.g. π = (1, 2, 3, 4) and π−1 = (4, 3, 2, 1)). The accuracy of a model can be estimated by averaging this coefficient over a set of examples. Other correlation measures, such as Spearman’s rank correlation coefficient (Spearman, 1904), have also been used (Brazdil et al., 2003). Although we assume total orders, it may be the case that two labels are tied in the same rank (i.e. πi(a) = πi(b),a 6= b). In this case, a variation of Kendall’s τ, the tau− b (Agresti, 2010) can be used. 2.2 Ranking Trees Tree-based models have been used in classification (Quinlan, 1986), regres- sion (Breiman et al., 1984), and label ranking (Todorovski et al., 2002; Cheng et al., 2009; de Sá et al., 2015) tasks. These methods are popular for a num- ber of reasons, including how they can clearly express information about the problem, because their structure is relatively easy to interpret even for people without a background in learning algorithms. It is also possible to obtain information about the importance of the various attributes for the prediction depending on how close to the root they are used. The Top-Down Induction of Decision Trees (TDIDT) algorithm is com- monly used for induction of decision trees (Mitchell, 1997). It is a recursive partitioning algorithm that iteratively splits data into smaller subsets which are increasingly more homogeneous in terms of the target variable (Algo- rithm 1). A split is a test on one of the attributes that divides the dataset into two disjoint subsets. For instance, given a numerical attribute x2, a split could be x2 ≥ 5. Given a splitting criterion that represents the gain in purity obtained with a split, the algorithm chooses the split that optimizes 4 its value in each iteration. In its simplest form, the TDIDT algorithm only stops when the nodes are pure, i.e., when the value of the target attribute is the same for all examples in the node. This usually causes the algorithm to overfit, i.e., to generate models that capture the noise in the data, as well as the regularities that are of general usefulness. One approach to address this problem is to introduce a stopping criterion in the algorithm that tests whether the best split is significantly improving the quality of the model. If not, the algorithm stops and returns a leaf node. The algorithm is executed recursively for the subsets of the data obtained based on the best split until the stopping criterion is met. A leaf node is represented by a value of the target attribute generated by a rule that solves potential conflicts in the set of training examples that are in the node. That value is the prediction that will be made for new examples that fall into that node. In classification, the prediction rule is usually the most frequent class among the training examples. Algorithm 1 TDIDT algorithm Input: Dataset D BestSplit = Test of the attributes that optimizes the SPLITTING CRI- TERION if STOPPING CRITERION == TRUE then Determine leaf prediction based on the target values in D Return a leaf node with the corresponding LEAF PREDICTION else LeftSubtree = TDIDT(D¬BestSplit) RightSubtree = TDIDT(DBestSplit) end if The adaptation of this algorithm for label ranking involves an appropriate choice of the splitting criterion, stopping criterion and the prediction rule (Algorithm 1). Splitting Criterion The splitting criterion is a measure that quantifies the quality of a given partition of the data. It is usually applied to all the possible splits of the data that can be made with tests on the values of individual attributes. In Ranking Trees (RT) the goal is to obtain leaf nodes that contain ex- amples with target rankings as similar between themselves as possible. To assess the similarity between the rankings of a set of training examples, the mean correlation between them is calculated using Kendall, Spearman or any other ranking correlation coefficient. The quality of the split is given by the 5 Table 1: Illustration of the splitting criterion Attribute Condition=true Condition=false values rank corr. values rank corr. x1 a 0.3 {b,c} -0.2 b 0.2 {a,c} 0.1 c 0.5 {a,b} 0.2 x2 < 5 -0.1 ≥ 5 0.1 weighted mean correlation of the values obtained for the subsets, where the weight is given by the number of examples in each subset. For simplicity, if we ignore the weights, the splitting criterion of ranking trees is illustrated both for nominal and numerical attributes in Table 1. The nominal attribute x1 has three values (a, b and c). Therefore, three binary splits are possible. For the numerical attribute x2, a split can be made in between every pair of consecutive values. In this case, the best split is x1 = c, with a mean correlation of 0.5, in comparison to a mean correlation of 0.2 for the remaining, i.e., the training examples for which x1 = {a,b}. Stopping Criterion The stopping criterion is used to determine if it is worthwhile to make a split or if there is a significant risk of overfit- ting (Mitchell, 1997). A split should only be made if the similarity between examples in the subsets increases substantially. Let Sparent be the similarity between the examples in the parent node and Ssplit the weighted mean sim- ilarity in the subsets obtained with the best split. The stopping criterion is defined as follows (Rebelo et al., 2008): (1 + Sparent) ≥ γ(1 + Ssplit) (1) Note that the relevance of the increase in similarity is controlled by the γ parameter. A γ ≥ 1 does not ensure increased purity of child nodes. On the other hand, small γ values require splits with very large increase in purity, which means that the algorithm will stop the recursion early. Prediction Rule The prediction rule is a method to generate a prediction from the (possibly conflicting) target values of the training examples in a leaf node. In LR, the aggregation of rankings is not so straightforward as in other tasks (e.g. classification or regression) and is known as the ranking aggregation problem (Yasutake et al., 2012). It is a classical problem in social choice literature (de Borda, 1781) but also in information retrieval 6 Table 2: Illustration of the prediction rule λ1 λ2 λ3 λ4 π1 1 3 2 4 π2 2 1 4 3 π 1.5 2 3 3.5 π̂ 1 2 3 4 tasks (Dwork et al., 2001). A consensus ranking minimizes the distance to all rankings (Kemeny & Snell, 1972). A simple approach, which we adopted in this work, is to compute the average ranking (Brazdil et al., 2003) of the predictions. It is calculated by averaging the rank for each label λj, π (j) = ∑ i πi (j) /n. The predicted ranking π̂ is the ranking π of the labels λj obtained based on the average ranks π (j). Table 2 illustrates the prediction rule used in this work. 2.3 Entropy Ranking Trees Recently, we proposed an alternative approach to decision trees for ranking data, the Entropy-based Ranking Trees (ERT) (de Sá et al., 2015). ERT uses an adaptation of Information Gain (IG) (de Sá et al., 2016) to as- sess the splitting points and the Minimum Description Length Principle Cut (MDLPC) (Fayyad & Irani, 1993) as the stopping criterion. To explain this method, we start by presenting the IG for rankings measure and then the adapted splitting and stopping criteria. Decision trees for classification, such as ID3 (Quinlan, 1986), use Infor- mation Gain (IG) as a splitting criterion to determine the best split points. IG is a statistical property that measures the gain in entropy, between the prior and actual state (Mitchell, 1997). In this case, we measure it in terms of the distribution of the target variable, before and after the split. In other words, considering a set S of size nS, since entropy, H, is a measure of disor- der, IG is basically how much uncertainty in S is eliminated after splitting on a numerical attribute xa: IG (xa,T; S) = H (S) − |S1| nS H (S1) − |S2| nS H (S2) where |S1| and |S2| are the number of instances on the left side (S1) and the number of instances on the right side (S2), respectively, of the cut point T in attribute xa. 7 In cases where S is a set of rankings, we can use the entropy for rankings (de Sá et al., 2016) which is defined as: Hranking (S) = K∑ i=1 P (πi,S) log (P (πi,S)) log ( kt (S) ) (2) where P (πi,S) is the proportion of rankings equal to πi in S, K is the number of distinct rankings in S and kt (S) is the average normalized Kendall τ (Kendall & Gibbons, 1970) distance in the subset S: kt (S) = ∑K i=1 ∑n j=1 τ(πi,πj)+1 2 K ×nS . As in Section 2.2, the leaves of the tree should not be forced to be pure. Instead, a stopping criterion should be used to avoid overfitting and be ro- bust to noise in rankings. Given an entropy measure, the adaptation of the splitting and stopping criteria comes in a natural way. As shown in (de Sá et al., 2016), the MDLPC Criterion can be used as a splitting criterion with the adapted version of entropy Hranking. This entropy measure also works with partial orders, however, in this work, we only use total orders. 3 Random Forests Random Forests (RF) (Breiman, 2001) are an ensemble method originally proposed for classification and regression problems. It essentially consists of the generation of multiple decision trees obtained using different randomiza- tion techniques. The set of predictions made by each of these trees is then aggregated to obtain the prediction of the ensemble. The RF algorithm is related to another popular ensemble method by the same author, Bagging (Breiman, 1996), which stands for bootstrap-aggregating. This is an ensemble method that takes a predefined number s of samples (without replacement) from the training data to construct s models. Given a new example, s predictions are generated, which are then aggregated, usu- ally with average or mode, to obtain a combined prediction. RF can be regarded as an extension of bagging. Given a forest size s and a training dataset D, a set of bootstrap samples, {D′1 . . . ,D′s} is generated by sampling with repetition from D. A decision tree is learned from each {D′1 . . . ,D′s}, which is grown in a slightly different way from the original. At each node, only a random subset of the m features can be used for splitting. In classification, the number of random features used in each split is usually√ m and in regression log2 m. This results in what is usually referred to as 8 random trees. As in bagging, each of the s random trees makes predictions on the test data, which are then combined using a suitable aggregation method. One of the reasons for the popularity of RF lays in the fact that they have few parameters to tune and can be applied to various tasks (Scornet et al., 2014). They require a simple implementation and even with small sample sizes it usually gives accurate results. Moreover, considering that it uses s independent learners, it can be parallelized. One of the reasons that makes RF a popular approach is that it is possible to take advantage of the algorithm to assess variable importance (Genuer et al., 2010). 3.1 Label Ranking Forests Considering the success of Random Forests in terms of improved accuracy for classification and regression problems, some approaches have been proposed to deal with different targets, such as bipartite rankings (Clémençon et al., 2013). Label Ranking Forests should also be seen as a potential robust approach for LR. Adapting RF to Label Ranking can be a straightforward process once you have adapted decision trees. Thus, we propose a new ensemble LR algorithm, the Label Ranking Forests based on Random Forests. With this approach, we expect to in- crease the accuracy of Label Ranking tree methods. In classification and regression, the aggregation of predictions is done in a simple way, mode and mean, respectively. However, as discussed in Section 2.2, the aggregation of rankings is not so straightforward. Like in Ranking Trees, we use the average ranking (Brazdil et al., 2003) to aggregate the predictions. Given the similarity of the LR task to classification, the number of random subset features we use in each split is √ m, the same value that is used in RF for classification. When the algorithm is not able to find a good split on any of the √ m selected features for the root node, it looks for a split on all the m features instead. This prevents the random feature selection mechanism from gener- ating empty trees. 4 Empirical Study In this section we describe the empirical study to investigate the performance of LRF and the tree methods used at the base level. We start by describ- ing the experimental setup (Section 4.1), then the results of the base-level 9 algorithms (Section 4.2) and finally the results of the new algorithm (Sec- tion 4.3). 4.1 Experimental setup The experiments are carried out on datasets from the KEBI Data Repository at the Philipps University of Marburg (Cheng et al., 2009) that are typically used in LR research (Table 3). They are based on classification and regression datasets, obtained using two different transformation methods: A) the target ranking is a permutation of the classes of the original target attribute, derived from the probabilities generated by a Naive Bayes classifier; B) the target ranking is derived for each example from the order of the values of a set of numerical variables, which are then no longer used as independent variables. A few basic statistics of the datasets used in our experiments are presented in Table 3. Although these are somewhat artificial datasets, they are quite useful as benchmarks for LR algorithms. A simple measure of the diversity of the target rankings is the Unique Ranking’s Proportion, Uπ. Uπ is the proportion of distinct target rankings for a given dataset (Table 3). As a practical example, the iris dataset has 5 distinct rankings for 150 instances, which yields Uπ = 5 150 ≈ 3%. This means that all the 150 rankings are duplicates of these 5. The code for all the experiments presented in this paper has been written in R (R Development Core Team, 2010).1 The generalization performance of the LR methods was estimated using a methodology that has been used previously for this purpose (Hüllermeier et al., 2008). The evaluation measure is Kendall’s τ and the performance of the methods was estimated using ten-fold cross-validation. 4.2 Results with Label Ranking Trees We evaluate the two variants of ranking trees described earlier: ranking trees (RT) and entropy ranking trees (ERT) (Sections 2.2 and 2.3). The RT algorithm has a parameter γ, that can affect the accuracy of the model. Based on previous results, we use γ = 0.98 for RT (de Sá et al., 2015). Table 4 presents the results obtained by the two decision tree approaches, RT and ERT, in comparison to the results for Label Ranking Trees (LRT), that are reproduced from the original paper (Cheng et al., 2009). We note that we have no information about the depth of the trees obtained with the latter and thus such information is omitted in Table 4. Even though LRT 1The code is available at https://github.com/rebelosa/labelrankingforests. 10 Table 3: Summary of the KEBI datasets Datasets type #examples #labels #attributes Uπ autorship A 841 4 70 2% bodyfat B 252 7 7 94% calhousing B 20,640 4 4 0.1% cpu-small B 8,192 5 6 1% elevators B 16,599 9 9 1% fried B 40,769 5 9 0.3% glass A 214 6 9 14% housing B 506 6 6 22% iris A 150 3 4 3% pendigits A 10,992 10 16 19% segment A 2310 7 18 6% stock B 950 5 5 5% vehicle A 846 4 18 2% vowel A 528 11 10 56% wine A 178 3 13 3% wisconsin B 194 16 16 100% performs best in most of the cases presented, both RT and ERT are also competitive methods. Figure 1 shows how much smaller ERT trees are, in general. By generating smaller trees, ERT provides more interpretable models when compared with RT. An exception is the calhousing dataset, where ERT generates larger trees. However, in this case, the increase in size is justified by a reasonable increase of accuracy (Table 4). To compare different ranking methods, we use a combination of Fried- man’s test and Dunn’s Multiple Comparison Procedure (Neave & Worthing- ton, 1992), which has been used before for this purpose (Brazdil et al., 2003). First we run the Friedman’s test to check whether the results are different or not, with the following hypotheses: H0 : The distributions of Kendall’s τ are equal H1 : The distributions of Kendall’s τ are not equal Using the Friedman test (implemented in the stats package (R Development Core Team, 2010)) we obtained a p-value < 1%, which shows strong evi- dence against H0. This means that there is a high probability that the three methods have different performance. 11 Table 4: Results obtained for Ranking Trees on KEBI datasets (the mean accuracy is represented in terms of Kendall’s tau, τ; the best mean accuracy values are in bold) RT ERT LRT mean mean mean accuracy depth accuracy depth accuracy authorship .883 8.0 .889 4.0 .882 bodyfat .111 11.9 .182 2.7 .117 calhousing .182 1.0 .291 11.6 .324 cpu-small .458 17.2 .437 6.1 .447 elevators .746 18.9 .757 7.9 .760 fried .797 20.2 .774 13.2 .890 glass .871 8.2 .854 3.0 .883 housing .794 12.9 .704 3.4 .797 iris .963 4.3 .853 2.0 .947 pendigits .871 14.0 .838 5.9 .935 segment .929 12.0 .901 5.0 .949 stock .897 10.8 .859 5.0 .895 vehicle .817 11.0 .787 4.1 .827 vowel .833 12.5 .598 3.6 .794 wine .905 4.0 .906 2.0 .882 wisconsin .334 10.0 .337 2.3 .343 average .712 11.1 .685 5.1 .730 Thus, we tested which of the three methods are different from one an- other with the Dunns Multiple Comparison Procedure (Neave & Worthing- ton, 1992). Using the R package dunn.test (Dinno, 2015), we tested the following hypotheses for each pair of methods a and b: H0 The distributions of Kendall’s τ for a and b are equal H1 The distributions of Kendall’s τ for a and b are not equal Table 5 indicates that there is no statistical evidence that the methods are different. The statistical tests confirm our observation that, although LRT generally obtains better results than RT and ERT, the latter approaches are competitive. 12 Figure 1: Comparison of the average depth of the trees obtained with RT (blue) and ERT (red) on KEBI datasets Table 5: Dunn’s test results (p-values) RT ERT LRT RT 0.22 0.37 ERT 0.22 0.13 LRT 0.37 0.13 13 4.3 Results with Label Ranking Forests We generated forests with 100 trees and aggregated the predicted rankings with the average ranking method (Brazdil et al., 2003). Table 6 presents the results obtained by the Label Ranking Forests using RT and ERT, referred to as LRF-RT and LRF-ERT, respectively. The average depth of the trees for LRF-RT is, for most cases, smaller than that of the tree obtained with the RT algorithm, while the accuracy is better. On average, for each 0.019 increase in accuracy there was a decrease of 1.8 in the average depth of the trees. One exception is the elevators dataset, with suffered a significant decrease in accuracy by using the LRF method. The comparison between ERT and LRF-ERT leads to different observa- tions. The average depth of the trees increases when using LRF. This can be explained by the fact that the measure of entropy for rankings used in ERT is very robust to noise in rankings (de Sá et al., 2016). Hence, it requires a larger amount of dissimilarity in a set of rankings to find a partition. As noted in Section 4.3 (Figure 1) the depth of the trees is much smaller with ERT than with RT. An additional observation is that the result obtained using LRF with ERT yielded a significant reduction in accuracy. In Figure 2, we can observe how much the accuracy increases/decreases with LRF when compared to the corresponding base-level trees alone. In the vast majority of datasets, there is some improvement in accuracy. The only exception is the elevators dataset, as mentioned above. Using the same statistical tests as before (Section 4.2), we compare LRF- RT and LRF-ERT with the RT, ERT and LRT methods. With the Fried- man’s test we got a p-value < 1%, which shows strong evidence against H0. Then, now that we know that there are some differences between the 2 meth- ods we will test which are different from one another with the Dunns Mul- tiple Comparison Procedure. Since we got a p-value around 25%, between LRF-RT and the LRF-ERT, we cannot conclude that there is no statistical evidence that the methods are different. On the pairwise comparisons of the methods Table 8, we measure how many time each method wins, in terms of accuracy. In this analysis, we conclude that Label Ranking Forests using RT give the best results, proving the effectiveness of the approach. On the other hand, even though LRF-ERT shows some improvement in terms of accuracy relatively to ERT, it did not behave much better than RT or LRT (Table 8). Again, this might be caused by the fact that the measure of entropy for rankings used in ERT is very robust to noise. For this reason, the depth of trees in LRF-ERT is, on average, 70% the depth of trees in LRF-RT. While this can be an advantage in terms of Label Ranking Trees, 14 Table 6: Results obtained for Label Ranking Forests on KEBI datasets, using two different label ranking trees, RT and ERT (the mean accuracy is represented in terms of Kendall’s tau, τ; the best mean accuracy values are in bold) LRF-RT LRF-ERT mean mean accuracy depth accuracy depth authorship .912 8.3 .906 7.7 bodyfat .212 10.6 .211 5.3 calhousing .185 1.0 .294 8.3 cpu-small .469 13.9 .471 7.8 elevators .605 10.0 .721 9.5 fried .887 15.5 .841 14.5 glass .874 6.0 .849 2.7 housing .780 10.9 .699 3.7 iris .973 4.9 .933 2.3 segment .930 10.8 .917 5.2 stock .892 9.9 .869 5.5 vehicle .850 10.0 .849 9.4 vowel .844 11.5 .701 4.9 wine .932 4.3 .925 2.8 wisconsin .460 8.8 .429 3.7 average .720 9.1 .708 6.2 Table 7: Dunn’s test for all the methods (p-values) RT ERT LRT LRF-RT LRF-ERT RT 0.23 0.34 0.31 0.44 ERT 0.23 0.13 0.11 0.28 LRT 0.34 0.13 0.46 0.29 LRF-RT 0.31 0.11 0.46 0.25 LRF-ERT 0.44 0.28 0.29 0.25 15 Figure 2: Accuracy gained/lost per dataset for using the ensemble method LRF, instead of standalone decision trees RT (blue) and ERT (red) on KEBI datasets Table 8: Pairwise comparisons of the methods in terms of win statistics. RT ERT LRT LRF-RT LRF-ERT Total (Rank) RT 9 6 3 7 25 (4) ERT 6 3 2 4 15 (5) LRT 9 12 7 9 37 (2) LRF-RT 12 13 8 13 46 (1) LRF-ERT 8 11 6 2 27 (3) 16 in Label Ranking Forests it is less relevant because it is hard to interpret 100 trees per dataset. 5 Conclusions In this work, we propose an ensemble of decision tree methods for Label Ranking, called Label Ranking Forests (LRF). The method is tested with two different base-level methods Ranking Trees (RT) and Entropy-based Ranking Trees (ERT). We present an empirical evaluation using well known datasets in this field. We also extend the analysis from previous work for tree-based methods, RT and ERT, and compare with the state of the art Label Ranking Trees (LRT) approach. The analysis on the decision trees shows that both RT and ERT are valid and competitive approaches. While RT usually gives better accuracy, on the other hand, ERT generates trees with much smaller depth (around 50% less, in comparison to RT). Our results were also compared with the published results for Label Ranking Trees (LRT) (Cheng et al., 2009). LRT has in general better accuracy than RT and ERT, however, statistical tests showed that none of the methods is significantly different. This means that both RT and ERT are competitive approaches, and, since they are distance-based methods, we can also say that this kind of approaches is worth pursuing. The two ensemble approaches, LRF-RT and LRF-ERT, used the base ranking tree models RT and ERT, respectively. Similarly to the application of Random Forests to other tasks, there was a general increase in accuracy when compared to the corresponding base-level methods. The results confirm that both LRF-RT and LRF-ERT are highly competitive LR methods. LRF- RT, in particular, stands out as a clear winner in terms of accuracy. As future work, we might improve the comparison with LRT method (Cheng et al., 2009), by implementing it and testing it both as learning algorithm and as the base-level method for Label Ranking Forests. Also, LRF can po- tentially produce similar benefits as the Random Forest method, in terms of feature selection or input variable importance measurement, when applied to LR datasets. Finally, the experiments in this paper were carried out on a set of standard benchmark datasets, which represent artificial LR problems. We plan to apply these approaches on real world datasets e.g. related with user preferences (Kamishima, 2003). 17 Acknowledgments This work is financed by the ERDF - European Regional Development Fund through the Operational Programme for Competitiveness and Internationali- sation - COMPETE 2020 Programme within project POCI-01-0145-FEDER- 006961, and by National Funds through the FC - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013. References Agresti, A. (2010), Analysis of ordinal categorical data, Vol. 656, John Wiley & Sons. Aiguzhinov, A., Soares, C. & Serra, A. P. (2010), A similarity-based adap- tation of naive bayes for label ranking: Application to the metalearning problem of algorithm recommendation, in ‘Discovery Science - 13th In- ternational Conference, DS 2010, Canberra, Australia, October 6-8, 2010. Proceedings’, pp. 16–26. Biau, G. (2012), ‘Analysis of a random forests model’, Journal of Machine Learning Research 13, 1063–1095. Brazdil, P. & Soares, C. (1999), Exploiting Past Experience in Ranking Clas- sifiers, in H. Bacelar-Nicolau, F. C. Nicolau & J. Janssen, eds, ‘Applied Stochastic Models and Data Analysis’, Instituto Nacional de Estat́ıstica, pp. 299–304. Brazdil, P., Soares, C. & da Costa, J. P. (2003), ‘Ranking learning algo- rithms: Using IBL and meta-learning on accuracy and time results’, Ma- chine Learning 50(3), 251–277. Breiman, L. (1996), ‘Bagging predictors’, Machine Learning 24(2), 123–140. Breiman, L. (2001), ‘Random forests’, Machine Learning 45(1), 5–32. Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984), Classifi- cation and Regression Trees, Wadsworth. Cheng, W. & Hüllermeier, E. (2011), ‘Label ranking with abstention: Pre- dicting partial orders by thresholding probability distributions (extended abstract)’, Computing Research Repository, CoRR. 18 Cheng, W., Dembczynski, K. & Hüllermeier, E. (2010), Label ranking meth- ods based on the plackett-luce model, in ‘ICML’, pp. 215–222. Cheng, W., Huhn, J. C. & Hüllermeier, E. (2009), Decision tree and instance- based learning for label ranking, in ‘Proceedings of the 26th Annual Inter- national Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009’, pp. 161–168. Cheng, W., Hüllermeier, E., Waegeman, W. & Welker, V. (2012), Label ranking with partial abstention based on thresholded probabilistic models, in ‘Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States.’, pp. 2510–2518. Clémençon, S., Depecker, M. & Vayatis, N. (2013), ‘Ranking forests’, Journal of Machine Learning Research 14(1), 39–73. de Borda, J. C. (1781), ‘Mémoire sur les élections au scrutin’. de Sá, C. R., Rebelo, C., Soares, C. & Knobbe, A. J. (2015), Distance- based decision tree algorithms for label ranking, in ‘Progress in Artificial Intelligence - 17th Portuguese Conference on Artificial Intelligence, EPIA 2015, Coimbra, Portugal, September 8-11, 2015. Proceedings’, pp. 525– 534. de Sá, C. R., Soares, C. & Knobbe, A. J. (2016), ‘Entropy-based discretiza- tion methods for ranking data’, Inf. Sci. 329, 921–936. de Sá, C. R., Soares, C., Jorge, A. M., Azevedo, P. J. & da Costa, J. P. (2011), Mining association rules for label ranking, in ‘Advances in Knowledge Dis- covery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, Part II’, pp. 432–443. Dinno, A. (2015), dunn.test: Dunn’s Test of Multiple Comparisons Using Rank Sums. R package version 1.2.3. Dwork, C., Kumar, R., Naor, M. & Sivakumar, D. (2001), Rank aggregation methods for the web, in ‘Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001’, pp. 613–622. Fayyad, U. M. & Irani, K. B. (1993), Multi-interval discretization of continuous-valued attributes for classification learning, in ‘Proceedings 19 of the 13th International Joint Conference on Artificial Intelligence. Chambéry, France, August 28 - September 3, 1993’, pp. 1022–1029. Fürnkranz, J., Hüllermeier, E., Loza Menćıa, E. & Brinker, K. (2008), ‘Multilabel classification via calibrated label ranking’, Machine Learning 73(2), 133–153. Genuer, R., Poggi, J. & Tuleau-Malot, C. (2010), ‘Variable selection using random forests’, Pattern Recognition Letters 31(14), 2225–2236. Hüllermeier, E., Fürnkranz, J., Cheng, W. & Brinker, K. (2008), ‘Label ranking by learning pairwise preferences’, Artificial Intelligence 172(16- 17), 1897–1916. Kamishima, T. (2003), Nantonac collaborative filtering: recommendation based on order responses, in ‘Proceedings of the Ninth ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, Wash- ington, DC, USA, August 24 - 27, 2003’, pp. 583–588. Kemeny, J. & Snell, J. (1972), Mathematical Models in the Social Sciences, MIT Press. Kendall, M. & Gibbons, J. (1970), Rank correlation methods, Griffin London. Mitchell, T. (1997), Machine Learning, McGraw-Hill. Neave, H. & Worthington, P. (1992), Distribution-free Tests, Routledge. Quinlan, J. R. (1986), ‘Induction of decision trees’, Machine Learning 1(1), 81–106. R Development Core Team (2010), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Rebelo, C., Soares, C. & Costa, J. (2008), Empirical Evaluation of Rank- ing Trees on Some Metalearning Problems, in J. Chomicki, V. Conitzer, U. Junker & P. Perny, eds, ‘Proceedings 4th AAAI Multidisciplinary Work- shop on Advances in Preference Handling’. Ribeiro, G., Duivesteijn, W., Soares, C. & Knobbe, A. J. (2012), Multilayer perceptron for label ranking, in ‘Artificial Neural Networks and Machine Learning - ICANN 2012 - 22nd International Conference on Artificial Neu- ral Networks, Lausanne, Switzerland, September 11-14, 2012, Proceedings, Part II’, pp. 25–32. 20 Scornet, E., Biau, G. & Vert, J.-P. (2014), ‘Consistency of random forests’, ArXiv e-prints. Spearman, C. (1904), ‘The proof and measurement of association between two things’, American Journal of Psychology 15, 72–101. Todorovski, L., Blockeel, H. & Džeroski, S. (2002), Ranking with Predictive Clustering Trees, in T. Elomaa, H. Mannila & H. Toivonen, eds, ‘Proc. of the 13th European Conf. on Machine Learning’, number 2430 in ‘LNAI’, Springer-Verlag, pp. 444–455. Vembu, S. & Gärtner, T. (2010), Label ranking algorithms: A survey, in J. Fürnkranz & E. Hüllermeier, eds, ‘Preference Learning’, Springer- Verlag, pp. 45–64. Yasutake, S., Hatano, K., Takimoto, E. & Takeda, M. (2012), Online rank ag- gregation, in ‘Proceedings of the 4th Asian Conference on Machine Learn- ing, ACML 2012, Singapore, Singapore, November 4-6, 2012’, pp. 539–553. 21