1 Introduction

Data analysis is a pervasive activity in several areas of industrial and scientific activities, often involving Machine Learning (ML) approaches, which have been successfully applied in a wide variety of domains [35]. However, the variability of dataset characteristics makes it challenging to accurately select the most adequate ML approach to handle new data. This issue has also been tackled from a ML perspective, where the task is learning, based on meta-data extracted from several datasets related to a given application or domain, a model to predict which ML approach may be better suited for exploring that particular domain. This approach, commonly referred to as Meta-Learning (MtL), has shown promising results in predicting classifier performances for a problem, based on statistical or geometrical characteristics of a dataset [16].

MtL approaches consist of the data-driven selection of techniques, through knowledge extracted from previous tasks, which may be then applied by a recommendation system to predict the preferred approach for a new, previously unseen problem [4]. This depends on the construction of a meta-dataset from the information on a group of datasets in the class of the problem. For each dataset in the group, its descriptive characteristics (meta-features) are extracted and combined into one or more meta-examples, which are then labeled according to the observed performances of different ML algorithms or an order-preserving transformation such as their rank [4], composing the target feature to be predicted. From the meta-dataset, a meta-model can be induced by a learning algorithm and then used in a recommendation system to predict which algorithm is expected to have the best performance when a new problem needs to be addressed [34].

Despite their success, most MtL studies still lack an in-depth analysis of the meta-features [29] which are known to be crucial in the successful use of MtL [3]. Several works propose new meta-features [15,16,17], but only a few present important details such as their asymptotic computational cost, the degree of information presented, or the importance of the meta-features for the investigated problems [16, 24].

This paper presents the use of clustering measures as meta-features in an MtL framework, to learn a recommendation system for classifiers. These clustering measures are based on internal indices which extract information, such as compactness or separation, to evaluate the goodness of a clustering structure. Although some of these measures have been used in unsupervised MtL scenarios [26, 37], this work proposes the use of those measures on classification problems. The proposed approach uses class labels as cluster indicators rather than the numerical results of a clustering algorithm, and is expected to extract informative measures for quantifying statistical or geometrical characteristics of datasets.

The main goal of this paper is to investigate whether clustering measures can be applied in this context, and whether they influence the choices of the recommendation system. The experimental results suggest that including these measures as meta-features contributes to more accurate recommendations in the problem of classifier selection. Additionally, an initial evaluation of computational costs indicates that this MtL approach is substantially less computationally expensive than testing all classifiers on the dataset for the selection of the best one.

2 Background

This section presents the background information necessary to describe the proposed approach: Sect. 2.1 introduces the Meta-Learning framework, including the process of building a meta-dataset and how to recommend algorithms. Section 2.2 elaborates on clustering meta-features, a subset of internal indices based on compactness and separation of the goodness of a clustering structure.

2.1 Meta-learning

Different algorithms have distinct learning strategies, thus the essence of learning can only be captured by considering different learning algorithms with diverse biases to acquire domain specific information [4]. This concept was initially introduced to make the algorithm selection problem systematic [30], and the goal is to predict the best algorithm to solve a specific given problem, when more than one option is available.

The components of this model are the space of problem instances (\(\mathcal {P}\)); the space of instance meta-features (\(\mathcal {F}\)); the space of algorithms (\(\mathcal {A}\)); and the space of evaluation measures (\(\mathcal {Y}\)). From these components, an MtL system can obtain a model capable of mapping a dataset or problem \(p\in \mathcal {P}\), described by meta-features \(f\in \mathcal {F}\), into one algorithm \(\alpha \in \mathcal {A}\) able to solve the problem with a good predictive performance according to measure \(y\in \mathcal {Y}\). The meta-learner recommendation would be the algorithm with the best expected \(y(\alpha (p))\). This can be further improved, for instance, by the inclusion of components which may guide theoretical support to refine the recommendation system [34].

A crucial component of these previous approaches is the definition of the meta-features (\(\mathcal {F}\)) used to describe general properties of datasets [3]. They must be able to provide evidence on the future performance of the algorithms in \(\mathcal {A}\) and to discriminate, with an acceptable computational cost, the performance of a group of algorithms. The main meta-features used in the MtL literature can be divided into six main groups [31], called standard meta-features in this work. They represent meta-features based general high-level summaries of the data, statistical and information theory properties of the data, properties of Decision Trees (DTs) induced from the data and the performance of simple and fast learning algorithms [9, 25, 29].

Another concern is the definition of the set of problem instances (\(\mathcal {P}\)), since the use of a large number of diverse datasets is recommended to induce a reliable meta-model [4]. Attempts to reduce the bias in this choice include using datasets from different data repositories, such as UCI [13] and OpenML [36]. The importance of problem diversity is based on the underlying assumption that the meta-model is expected to generalize the acquired knowledge when faced with new problem instances without explicit constraints in terms of expected problem characteristics.

The selected algorithms \(\mathcal {A}\) represent a set of candidate algorithms to be recommended in the algorithm selection process. Ideally, these algorithms should also be sufficiently different from each other and represent all regions in the algorithm space [34]. The models induced by the algorithms can be evaluated by different measures. Quality measures \(\mathcal {Y}\) for assessing the models depend on the nature of the problem. Classifiers, for example, can be evaluated by different measures such as accuracy, \(F_\beta \), area under the ROC curve (AUC) or the Kappa coefficient, among others.

The step following the extraction of meta-features from the datasets and training the set of algorithms is the labeling of meta-examples in the meta-base. The three properties most frequently used in this task are [4]: (i) the best performing algorithm on the meta-example’s dataset (a meta-classification problem); (ii) the ranking of the algorithm according to its performance (a meta-ranking problem); and (iii) the raw performance value of each algorithm on the dataset (a meta-regression problem).

Differently from previous works on MtL applied on clustering data [26, 37], this work presents a MtL regression task based on the use of clustering meta-features using internal indices, to extract information like separation and compactness on a supervised scenario using the class labels. In this case, the main objective is to improve the algorithm recommendation using internal indices, motivated by their generally low computational cost and high degree of information. The standard meta-features [31] are considered to provide a comparison baseline and to allow an objective analysis of the variation in performance resulting from using the new clustering meta-features. The evaluation of the recommender system includes assessing the performance in the base- and meta-levels and the cost in execution time.

2.2 Clustering Meta-features

A few definitions must be presented before the description of the clustering measures. Let \(X\in \mathbb {R}^{N\times {q}}\) be a matrix of N observations, each represented by a q-dimensional vector \(\mathbf {x}_\ell \in \mathbb {R}^q\); and \(U_K(X) = \{X_1,X_2,\dotsc ,X_K\}\) be an exhaustive partition of X into K mutually exclusive clusters \(X_k\) with sizes \(n_k > 0,~k=1,\dotsc ,K\). In all definitions below, the clustering meta-features are functions of a partition \(U_K(X)\), which in the case of this paper is given by the class labels. The \(U_K(X)\) is kept implicit in the definitions in order to keep the notation cleaner, but the reader should keep this relationship in mind.

Let \(\mathbf {\bar{x}}_k\) denote the mean point of all observations belonging to cluster \(X_k\), and \(\mathbf {\bar{\bar{x}}}\) denote the grand mean of all observations in X. Also, let \(dist(\cdot ,\cdot )\) denote the distance (e.g., Euclidean) between two points. Then,

$$\begin{aligned} \delta _{ij} \triangleq \min _{\begin{array}{c} \mathbf {x}\in X_i\\ \mathbf {y}\in X_j \end{array}} dist(\mathbf {x},\mathbf {y}) \end{aligned}$$
(1)

denotes the single-linkage distance between the two clusters \(X_i,~X_j\), and

$$\begin{aligned} \varDelta _k \triangleq \max _{\mathbf {x},\mathbf {y}\in X_k} dist(\mathbf {x},\mathbf {y}) \end{aligned}$$
(2)

represents the diameter of a cluster \(X_k\).

Additionally, consider some vectors of distances. Let \(\mathbf {d}^+\in \mathbb {R}^{N_W}\) be a vector of all \(N_W\) within-cluster distances (i.e., all distances between pairs of points having equal class labels in the data), \(\mathbf {d}^-\in \mathbb {R}^{N_B}\) be a vector containing all \(N_B\) between-cluster distances (between pairs of points having different class labels), and \(\mathbf {d}^\bullet \in \mathbb {R}^{N_T}\) denote a vector containing the distances between all \(N_T\) pairs of points in X, with \(N_T = N_W + N_B\).

Finally, let the following quantities be defined: the within-groups sum of squares,

$$\begin{aligned} WGSS = \sum _{k=1}^K\sum _{\mathbf {x}_i\in X_k} \left[ dist(\mathbf {x}_i, \mathbf {\bar{x}}_k)\right] ^2, \end{aligned}$$
(3)

and the between-groups sum of squares,

$$\begin{aligned} BGSS = \sum _{k=1}^K n_k\left[ dist(\mathbf {\bar{x}}_k, \mathbf {\bar{\bar{x}}})\right] ^2. \end{aligned}$$
(4)

Given the preceding definitions, the clustering meta-features used in this work are formalized as follows:

  • Dunn’s separation index (VDU) [14]:

    $$\begin{aligned} VDU = \min _{\begin{array}{c} i,j\in [1,K]\\ i\ne j \end{array}}\frac{\delta _{ij}}{\max \limits _{k\in [1,K]}\varDelta _k}. \end{aligned}$$
    (5)
  • Davies-Bouldin index (VDB) [11]:

    $$\begin{aligned} VDB = \frac{1}{K}\sum _{i=1}^K\max _{j\ne i}\frac{\varDelta _i + \varDelta _j}{dist(\mathbf {\bar{x}}_i,\mathbf {\bar{x}}_j)}. \end{aligned}$$
    (6)
  • Baker-Hubert index (\(\varGamma \)) [1]: let

    $$\begin{aligned} s^+ = \sum _{\forall d_i\in \mathbf {d}^+}\sum _{\forall d_j\in \mathbf {d}^-}One(d_i<d_j) \quad \mathrm {and}\quad s^- = \sum _{\forall d_i\in \mathbf {d}^+}\sum _{\forall d_j\in \mathbf {d}^-}One(d_i>d_j) \end{aligned}$$
    (7)

    where One(condition) is a function that returns 1 if the condition is true and 0 otherwise; then:

    $$\begin{aligned} \varGamma = \frac{s^+ - s^-}{s^+ + s^-}. \end{aligned}$$
    (8)
  • Tie-corrected Kendall tau (\(\tau \)) [12]:

    $$\begin{aligned} \tau = \frac{s^+ - s^-}{\sqrt{N_WN_BN_T(N_T-1)/2}}. \end{aligned}$$
    (9)
  • Ray-Turi index (\(\nu \)) [28]:

    $$\begin{aligned} \nu = \frac{1}{N}\frac{WGSS}{\min \limits _{i\ne j} \delta _{ij}^2}. \end{aligned}$$
    (10)
  • Mean inter-centroid distance (INT) [2]:

    $$\begin{aligned} INT = \frac{2}{K(K-1)}\sum _{k=1}^{K-1} dist(\mathbf {\bar{x}}_k, \mathbf {\bar{x}}_{k+1}). \end{aligned}$$
    (11)
  • Global silhouette index (SIL) [32]: for a given point \(\mathbf {x}\in {X_k}\), let \(a(\mathbf {x})\) denote the mean distance between this point and all other points belonging to the same cluster, \(X_k\); \(\mathfrak {d}(\mathbf {x},k^\prime )\) denote the mean distance between this point and all points belonging to a distinct cluster \(X_{k^\prime }\) (\(k^\prime \ne k\)); and

    $$\begin{aligned} b(\mathbf {x}) = \min _{k^\prime \ne k} \mathfrak {d}(\mathbf {x},k^\prime ). \end{aligned}$$
    (12)

    The silhouette width of point \(\mathbf {x}\) is then calculated as

    $$\begin{aligned} \mathcal {S}(\mathbf {x}) = \frac{b(\mathbf {x}) - a(\mathbf {x})}{\max (b(\mathbf {x}), a(\mathbf {x}))}, \end{aligned}$$
    (13)

    and the global silhouette index can be calculated as

    $$\begin{aligned} SIL = \frac{1}{K}\sum _{k=1}^K\sum _{\forall \mathbf {x}\in X_k}\frac{\mathcal {S}(\mathbf {x})}{n_k}. \end{aligned}$$
    (14)
  • Point biserial index (PB) [12]:

    $$\begin{aligned} PB = \left( \frac{|\mathbf {d}^+|_1}{N_W} - \frac{|\mathbf {d}^-|_1}{N_B}\right) \sqrt{\frac{N_WN_B}{N_T^2}}. \end{aligned}$$
    (15)

    with \(|\cdot |_1\) denoting the \(\ell _1\) norm of a vector.

  • Calinski-Harabasz index (CH) [8]:

    $$\begin{aligned} CH = \frac{\left( N-K\right) BGSS}{\left( K-1\right) WGSS}. \end{aligned}$$
    (16)
  • Xie-Beni index (XB) [38]:

    $$\begin{aligned} XB = \frac{1}{N}\frac{WGSS}{\min \limits _{i\ne j}\delta _{ij}}. \end{aligned}$$
    (17)
  • Normalized Relative Entropy (NRE) [26]:

    $$\begin{aligned} NRE = \sum _{k=1}^K\frac{n_k}{N}\log _2\left( \frac{n_k}{N}\right) . \end{aligned}$$
    (18)
  • C index (C) [21]: let \(s_{min}\) denote the sum of the \(N_W\) smallest elements of \(\mathbf {d}^\bullet \) and \(s_{max}\) the sum of its \(N_W\) largest elements. Then:

    $$\begin{aligned} C = \frac{|\mathbf {d}^+|_1 - s_{min}}{s_{max} - s_{min}}. \end{aligned}$$
    (19)
  • Mean of distances to cluster centroids (CM):

    $$\begin{aligned} CM = \frac{1}{K}\sum _{k=1}^K\sum _{\forall \mathbf {x}_i\in X_k}dist(\mathbf {x}_i,\mathbf {\bar{x}}_k) \end{aligned}$$
    (20)
  • Connectivity (CN) [7, 19]:

    $$\begin{aligned} CN = \sum _{i=1}^N\sum _{j=1}^L\mathcal {I}\left( \mathbf {x}_i,\eta _j\left( \mathbf {x}_i\right) \right) , \end{aligned}$$
    (21)

    where \(\eta _j\left( \mathbf {x}_i\right) \) denotes the j-th nearest-neighbor to point \(\mathbf {x}_i\), and \(\mathcal {I}\left( \mathbf {x}_i,\eta _j\left( \mathbf {x}_i\right) \right) \) is an indicator function that receives the value of 1/j if \(\mathbf {x}_i\) and \(\eta _j\left( \mathbf {x}_i\right) \) belong to the same cluster, zero otherwise.

  • Average scattering for clusters (\(SD_{scat}\)) [18]:

    $$\begin{aligned} SD_{scat} = \frac{\sum \limits _{k=1}^K \left| \hat{\boldsymbol{\sigma }}^2_{k}\right| _2}{K\left| \hat{\boldsymbol{\sigma }}^2_{\bullet }\right| _2}, \end{aligned}$$
    (22)

    where \(\hat{\boldsymbol{\sigma }}^2_{\bullet }\) denotes the vector of variance estimates for all attributes in all clusters, and \(\hat{\boldsymbol{\sigma }}^2_{k}\) is the vector of variance estimates for all attributes considering only the observations in the k-th cluster.

  • Total separation between clusters (\(SD_{dis}\)) [18]:

    $$\begin{aligned} SD_{dis} = \frac{\kappa _{max}}{\kappa _{min}}\sum \limits _{k=1}^{K}\left( \sum _{\begin{array}{c} k^\prime = 1\\ k^\prime \ne k \end{array}}^K dist\left( \bar{\mathbf {x}}_k,\bar{\mathbf {x}}_{k^\prime }\right) \right) ^{-1} \end{aligned}$$
    (23)

    where:

    $$\begin{aligned} \kappa _{max} = \max \limits _{k^\prime \ne k} dist\left( \bar{\mathbf {x}}_k,\bar{\mathbf {x}}_{k^\prime }\right) \quad \mathrm {and}\quad \kappa _{min} = \min \limits _{k^\prime \ne k} dist\left( \bar{\mathbf {x}}_k,\bar{\mathbf {x}}_{k^\prime }\right) . \end{aligned}$$
    (24)
  • Akaike’s Information Criterion (AIC) [33]: The AIC of a first-order multiple linear regression model of class labels on each attribute of the dataset. The linear model is given by

    $$\begin{aligned} class(\mathbf {x}_i) = \hat{\beta }_0 + \hat{\boldsymbol{\beta }}\mathbf {x}_i + e_i, \end{aligned}$$
    (25)

    with \(class(\mathbf {x}_i)\) denoting the numerically-encoded class of point \(\mathbf {x}_i\in {X}\), \(\hat{\beta }_0\) and \(~\hat{\boldsymbol{\beta }}\) representing the fitted coefficients of the model, and \(e_i\) the residual related to the i-th observation in the dataset. The AIC of the model is then given as:

    $$\begin{aligned} AIC = -2\ln (L) + 2q \end{aligned}$$
    (26)

    where \(\ln {L}\) is the log-likelihood value estimated for the model (25).

  • Bayesian Information Criterion (BIC) [33]: The BIC of the regression model described in (25), which is calculated as:

    $$\begin{aligned} BIC = -2\ln (L) + q\ln (N) \end{aligned}$$
    (27)

3 Methodology

This works aims to investigate the use of clustering measures as meta-features for learning a recommendation system for classification tasks. More specifically, the standard and clustering meta-features are used in the MtL setup designed to predict the accuracy of some popular classification techniques for a given dataset. The objectives are: (i) to determine whether the MtL approach results in an improved recommendation system; (ii) to investigate whether the clustering meta-features contribute to the performance of this recommendation system; and (iii) to characterize the execution times required to extract each set of meta-features.

To train and assess the meta-learner, a meta-dataset was populated with the meta-feature values (both clustering and standard meta-features) for a collection of problem instances, labeled with the performances of known classification techniques. The meta-models were induced by regression techniques and evaluated in the base-level, measuring the predictive performance of the classifiers, and the and meta-level, measuring the impact of recommending the best one. The computational costs for theses processes was recorded for evaluation.

Four hundred datasets from the OpenML repository [36], representing diverse application contexts and domains, were used in this experiment. They were selected considering a maximum number of 10, 000 observations, 500 features and 10 classes, to constrain the computational costs of the process. For each dataset, both standard and clustering meta-features, as described in Sect. 2.2, were computed. The averages of 10-fold cross-validated predictive accuracies achieved by each of five classification techniques, for each dataset, were also calculated for labeling the meta-examples.

The classification approaches used were: C4.5 decision tree [27] with pruning based on subtree raising; k-Nearest Neighbors (kNN) model [22] with \(k=3\); Multilayer Perceptron (MLP) [20] with learning rate of 0.3, momentum of 0.5 and a single hidden layer; Random Forest (RF) [5] with 500 trees; and Support Vector Machine (SVM) [10] with radial basis kernel. These hyper-parameter values were defined following the standard configurations of the implementations used.

This process resulted in a meta-dataset containing 130 meta-features (112 standard and 18 clustering) and 400 samples for each classifier, labeled by the mean accuracy of the classification method. This meta-dataset was then used to train regression models to estimate the expected accuracy of each classifier, as a function of the meta-features. Five regression techniques known to have different biases were tested: Classification And Regression Trees (CART) [6] algorithm with pruning; Distance-weighted k-Nearest Neighbor (DWNN) [22] with \(k=3\) and Gaussian kernel; Multilayer Perceptron Regression (MLPR) [20] with learning rate of 0.3, momentum of 0.5 and a single hidden layer; Random Forest Regressor (RFR) [5] with 500 trees; and Support Vector Regression (SVR) [10] using radial basis kernel. As with the classifiers, the regressor hyper-parameters were also set as the default values of the implementations used without any problem-specific tuning.

To train and evaluate the regression models, 10-fold cross-validation rounds were also executed. The models obtained were evaluated for quality, considering a comparison to two simple baselines: Random (RD) and Default (DF). The RD baseline represents the observed performance of a randomly chosen classifier in the meta-base. The DF baseline was set as the accuracy of the classifier that most often presented the best classification performance across all datasets.

The final step is the analysis of the trade-off between the computational cost of each set of meta-features and that of evaluating all classifiers, in a cross-validation setup, through direct comparison of single threaded experiments. The experiments were done in a cluster node with two Intel Xeon E5-2680v2 processors and 128 GB DDR3. The standard meta-features, as well as some of the clustering ones, are provided by the mfe packageFootnote 1.

4 Experimental Results

Since the main goal of algorithm recommender systems using MtL is to suggest, within the known options, the algorithm with the most appropriate bias to a particular dataset [4], the first analysis of this experiment focuses on such objective. Figure 1 presents the results of two analyses on the meta-base involving the classifiers on the selected 400 datasets.

Fig. 1.
figure 1

Performance of classifiers over the 400 datasets.

The distribution of accuracy values, summarized by the boxplots in Fig. 1a, suggests that all approaches have generally similar distributions of performance values across the datasets employed, with reasonably high median accuracies for most problems. Random Forest has a slightly higher median value than the others, while kNN has the lowest median and the largest variability in performance values. Figure 1b shows how many times each algorithm was the “winner”, i.e. the number of datasets for which each presented the best performance. These results imply that the set of choices is considered adequate for MtL since each one was the best performing in a non-empty subset of the dataset. It is clear from the figures that RF was most frequently the best classifier in the dataset, which means it will be the most probable choice for the random baseline recommender system and the fixed choice for the default baseline one.

To investigate their quality on the task of learning to recommend a classifier, the selected regressors (MLPR, CART, DWNN, SVR, and RFR) and the two baselines regressors (RD and DF) were applied to the meta-dataset considering two sets of meta-features: the 112 standard meta-features (described in Sect. 2.1) and the full set of 130 meta-features (incorporating the 18 clustering measures from Sect. 2.2). Figure 2 shows the average normalized mean squared error (NMSE) of each meta-regressor on predicting the expected accuracy of the classifiers for each set of meta-features, estimated by 10-fold cross-validation.

Fig. 2.
figure 2

The log-scaled NMSE for each combination of meta-feature set, classification and regression models.

The boxplots indicate that all meta-regressors provided better predictions than the baseline approaches (with the exception of a single observation for the MLPR regressor). RFR seems to have a slight advantage when compared to the other regressors, for both sets of features. MLPR, CART, DWNN and SVR show a similar performance distribution in most cases, with MLPR having a higher interquartile range and a few outliers.

Although visual inspection does not necessarily provide a clear winner between the two sets of meta-features, it is sufficient to indicate that: (i) they are clearly less error-prone than the baselines, and (ii) their NMSE values show a relatively low variance. More objectively, an ANOVA model followed by Dunnett pairwise comparisons indicates a statistically significant difference at the \(95\%\) confidence level between the MtL algorithms (in aggregate) against the two baselines (\(p < 10^{-10}\) for both the MtL\(\times \)DF and MtL\(\times \)RD comparisons). The effects of different meta-feature sets, classifier methods and distinct replicates were removed by blocking [23].

A second analysis was conducted to investigate the effect of adding the clustering meta-features on the predictive ability of the meta-regressors. This analysis was performed by removing the two baseline methods and fitting a blocked ANOVA model [23] on rank-transformed data (required in this case to meet the ANOVA assumptions), with the meta-feature sets as the experimental factor and the classifier methods, meta-regressors, and replicates as blocking factors. This test reveal a statistically significant positive effect of adding the clustering ones on the performance of the best regressor, RFR, across all classifiers (paired t-test for standard\(\times \)full meta-feature sets, \(p = 0.036\)).

Figure 3 shows the results of a direct comparison between the recommended classifier. The x-axis lists the meta-regressors and the y-axis shows the percent differences in accuracy, averaged over all datasets. As seen in the analysis of Fig. 2 there is a noticeable advantage of using the MtL approaches over the random baseline (RD), as well as against the fixed choice of RF, which is best overall classifier (DF), in the case of the three most successful regressors. The magnitude of the gains is also substantial, with 40–\(50\%\) gains over RD and 10–\(20\%\) gains over DF in the case of DWNN, RFR and SVR.

Fig. 3.
figure 3

Differences between base-classifier accuracies over baselines. Vertical lines represent standard errors.

The impact of the meta-features can be further investigated via the analysis of the RFR model. Figure 4 shows the most relevant meta-features, according to their individual contribution to the reduction of the mean squared error, MSE (measured as the increase in MSE when that meta-feature was omitted), as well as their corresponding groups. The x-axis lists the meta-features in decreasing order of relevance, with the mean change in MSE shown in the y-axis. Vertical bars indicate standard errors of estimation.

Fig. 4.
figure 4

Top-ranked meta-features selected by the RFRs in log-scale, based on the average increase in MSE when the meta-feature is omitted.

The standard meta-features are more strongly represented, as expected due to their informative value [31]. Some meta-features appear more than once due to distinct summarization functions being used. Landmarking measures, which mainly relate to the performance of simple meta-models induced by the kNN, Naïve Bayes and simple node DT algorithms, represent over half of the top features. Interestingly, six of the 18 clustering features are ranked in the 20 most relevant for the RFR model, in this order: sil, ch, ray, vdb, c_index and nre. This suggests that these features are relevant to the MtL task, at least for the case of the RFR meta-regressor, which would corroborate the result of the hypothesis test performed on the data of Fig. 2.

Finally, the practical applications of using clustering meta-features for a MtL recommender system must consider the computational costs involved. The trade-off between the runtime of the characterization process and the evaluation of all alternative data modeling algorithms considered to solve the task under study should favor the former in order to make the recommendation system useful.

Figure 5 presents the results of a runtime analysis to evaluate this trade-off for the proposed approach. This analysis exhaustively compared the single-thread runtime cost for extraction of the standard and clustering meta-features to the cost for running all classification algorithms. In the figure, each point represents the cost for processing a dataset in a log\(\times \)log scale. The x-axis shows the cost for running all classifiers while the y-axis shows the cost for extracting the meta-features. This enables a straightforward visual analysis: if a point is above the line, the y value is less than the x value, thus running the classifiers is more expensive than extracting the features.

Fig. 5.
figure 5

Execution times for computing meta-features compared to applying all classifiers.

The standard meta-features are cheaper than running all classification algorithms in around 89% of the cases. The clustering features are considerably cheaper than that running all classification algorithms. This low cost in computation, associated with the informative nature of some of the clustering meta-features (seen in Fig. 4), suggests their usefulness in, and their potential for, improving MtL recommender systems.

5 Conclusions and Future Works

This work investigated both the gains in performance associated with MtL in general, and the use of clustering meta-features in a recommender system for classification algorithms. Experimental results showed that the selected classifiers were adequate for a recommender system and that using the recommended model resulted in improvements over a random or fixed (best expected) algorithm choice.

The results presented in this paper suggest two tentative conclusions: (i) that meta-learning, as an approach to recommend classifiers for unseen problems, has the potential to provide good choices with a reduced computational budget; and (ii) that clustering-based meta-features are suitable to enhance this MtL task, which may indicate that they are able to capture relevant properties from datasets to describe the performance of classifiers.

Future works include investigating the impact of the underlying grouping structure of the datasets in the adequacy of the clustering meta-features, as well as investigating the correlations of the meta-features to enable the selection of a more efficient, parsimonious subset of meta-feature. The incorporation of a more diverse set of classification algorithms for the recommender system and the incorporation of hyper-parameter tuning for the classifiers and regressors also represent interesting new areas for research.