Abstract
Automated Machine Learning (AutoML) has achieved high popularity in recent years. However, most of these studies have investigated alternatives to single-label classification problems, presenting a need for more investigations in the multi-label classification scenario. From the AutoML point of view, the few studies on multi-label classification focus on automatically finding the best models based on mono-objective optimization. These tools train several multi-label classifiers in search of the one with the best performance in a single objective optimization process. In this work, we propose AutoMMLC, a new multi-objective AutoML method for multi-label classification, to find the best models that maximize the f-score measure and minimize the training time. Experiments were carried out with ten multi-label datasets and different versions of the proposed method using two multi-objective optimization algorithms: Multi-objective Random Search and Non-Dominated Sorting Genetic Algorithm II. We evaluated the Pareto front obtained by these methods through the hypervolume metric. The Wilcoxon test demonstrated that AutoMMLC versions had similar results for this metric. Multi-label Classification (MLC) algorithms were obtained from the Pareto frontiers through the Frugality Score and compared with the baseline algorithms. The Friedman test demonstrated that the MLC algorithms from AutoMMLC versions had equal performances to f-score and training time. Furthermore, they had better results than baseline algorithms for f-score and better results than most baseline algorithms for training time.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Designing high-performance Machine Learning (ML) model is an arduous task that requires expert knowledge [7]. Automated Machine Learning (AutoML) is a research field that seeks to improve how ML applications are built by automating applications [20], creating automatically configured tools that perform well and are easy to use. However, AutoML is not only present in the search by ML models. It can also be applied to the pre-processing phases to the model interpretability phase, automating a specific sub-task of the ML pipeline or even the entire pipeline [18, 20].
AutoML can solve Single-label Classification (SLC) and Multi-label Classification (MLC) problems. Each dataset instance is associated with a single label in SLC, while in MLC, each instance is associated with a subset of labels [9, 17]. Most research dealing with AutoML and MLC focuses on automating searching for the best algorithms and hyperparameters used on the model induction task [12,13,14, 18, 19].
Given the search space, an optimization algorithm, and an evaluation criterion, AutoML trains and evaluates several models in search of the best model evaluated. As the search space comprises different classification algorithms and their corresponding hyperparameters, these studies deal with instances of Combined Algorithm Selection and Hyperparameter Optimization (CASH) optimization problems [7]. Furthermore, in these studies, the optimization algorithms for MLC are mono-objective, with an evaluation criterion usually related to the classifier performance.
However, we could evaluate multi-label classifiers considering more than one evaluation criterion [19]. Thus, we would have a multi-objective optimization problem, where a task involves more than one objective function to be maximized or minimized and produces a set of optimal solutions rather than a single solution, known as Pareto optimal solutions [3]. Multi-objective ML balances distinct evaluation criteria (objective functions) to improve model generalization and avoid models with local optima [8].
In this study, we propose a multi-objective AutoML method for MLC, denominated Automated Multi-objective Multi-label Classification (AutoMMLC), optimizing both the f-score performance measure and the training time of the classifiers. Thus, we want to automatically find models that maximize the f-score and have a lower computational time. We performed experiments evaluating the performance of AutoMMLC employing two different multi-objective algorithms: Multi-objective Random Search (MORS) [8] and Non-Dominated Sorting Genetic Algorithm II (NSGA-II) [2]. These algorithms were used because MORS serves as reasonable baselines, and NSGA-II is one of the most popular multi-objective evolutionary algorithms [8]. In addition, we evaluated the performance of AutoMMLC against other baselines simplest.
The main contributions of this paper are: i) the proposal of a new AutoML method for MLC, contributing to a subject still little explored, and ii) incorporating multi-objective criteria to search for the best ML classifiers. The remainder of this article is structured as follows: in Sect. 2, we review MLC, multi-objective optimization algorithms, and related works; in Sect. 3, we describe the experimental methodology adopted in this study; in Sect. 4, we present the results and discuss their implications; finally, Sect. 5 presents the findings and future works.
2 Background
2.1 Multi-label Classification
Given a label set, SLC associates a single label with each instance in the dataset. In MLC, an instance can be classified into two or more labels simultaneously. For example, textual data can be classified simultaneously with multiple labels, such as images, videos, music, and emotions. Since \(L = \{l_j: j = 1, ..., q\}\) is the finite label set, MLC associates to each instance \({\textbf {x}}_i\) of the dataset a set of labels \({\textbf {y}}_i\) such that \({\textbf {y}}_i \subseteq L\) [17]. According to Madjarov et al. [9], MLC algorithms can be characterized into three approaches: i) Problem Transformation (PT), which transforms multi-label problems into single-label problems and then employs traditional classification algorithms; ii) Algorithm Adaptation (AA), which creates algorithms that handle multiple labels simultaneously; and iii) Ensemble, which employs a set of MLCs as base classifiers.
2.2 Multi-objective Optimization
Multi-objective optimization aims to solve real problems with conflicting objectives, subject to possible constraints of the problem [3]. In the context of multi-objective optimization for ML problems, Karl et al. [8] formally define the problem according to Eq. 1.
Since \(\varLambda \) is the set of possible ML candidate solutions, we want to evaluate each candidate solution \(\lambda \). The \(\lambda \) evaluation is calculated from the m evaluation criteria of the ML models: \(c_{1}:\varLambda \rightarrow \mathbb {R},\dots ,c_{m}:\varLambda \rightarrow \mathbb {R}\), with \(m\in \mathbb {N}\). The objective is to find, among all the candidate solutions, the one with the minimum evaluation \(c(\lambda )\), being \(c:\varLambda \rightarrow \mathbb {R}^m\).
The multi-objective optimization result is a set of optimal solutions rather than a single solution [3]. This set is known as Pareto optimal solutions or Pareto frontier. They have different values for the objectives, where a solution can be good considering one of the objectives but poor concerning the others. It is up to the users to choose the best solution among the resulting solutions.
Most multi-objective algorithms use the concept of dominance to find Pareto optimal solutions. So for a minimization problem, solution \(\lambda \) dominates \(\lambda ^{'}\) if only \(\forall i \in \{1, ..., m\}: c(\lambda _i) \le c(\lambda ^{'}_{i})\) and \(\exists j \in \{1, ..., m\}: c(\lambda _j) < c(\lambda ^{'}_{j })\) [8]. In other words, \(\lambda \) dominates \(\lambda ^{'}\) if the goal values of \(\lambda \) are equivalent to or better than the goal values of \(\lambda ^{'}\) and \(\lambda \) surpasses \(\lambda ^{'}\) in at least one of the objectives. Any other solution does not dominate Pareto optimal solutions. Based on the dominance concept, we describe below two algorithms:
-
Multi-objective Random Search (MORS): Random Search (RS) is one of the most popular mono-objective optimization algorithms. Given T trials, the mono-objective RS algorithm randomly produces candidate solutions and evaluates these solutions in search of the best one. According to Karl et al. [8], the multi-objective version of RS can be derived from RS and the concept of dominance. For this, we assess the T random samples and return the set of non-dominated solutions, the Pareto front.
-
Non-Dominated Sorting Genetic Algorithm II (NSGA-II): NSGA-II is an evolutionary algorithm [2]. Given an initial population, the population evolves over generations by crossover, mutation, and tournament selection operations. The fittest individuals, among parents and offspring, are selected for subsequent generations. Fitness is measured by non-domination rank and crowding distance density estimate. Thus, the populations are composed of individuals from the first Pareto frontiers with good dispersion.
We can find the quality of the Pareto frontier using distance-based (requires the true Pareto frontier) or volume-based (calculates the volume from the frontier to a point) indicators. Hypervolume is a volume-based quality indicator. For a minimization problem, it is the region bounded by the Pareto frontier points and by a reference point [6]. The larger the calculated hypervolume, the more the Pareto frontier minimizes the objectives of the problem.
2.3 Related Works
Most studies covering AutoML and MLC explored mono-objective optimization instances. Given a search space composed of single-label algorithms, multi-label algorithms, and hyperparameters [12,13,14, 18, 19], optimization algorithms explore the search space and evaluate the candidate multi-label classifiers in search of the best classifier. The studies used different evaluation criteria to evaluate candidate classifiers during optimization, aggregating different performance measures composing a single evaluation criterion or using one knowing evaluation criteria, such as Hamming Loss and F-score.
While handling AutoML for MLC as a mono-objective optimization problem, the related studies usually propose new algorithms and compare the results with other optimizers already used for MLC. In this sense, de Sá et al. [13] used Genetic Algorithms as the optimization algorithm, de Sá et al. [12] used Grammar-based Genetic Programming, and de Sá et al. [14] extended their previous works by also employing Bayesian Optimization as an optimization algorithm. Other authors explored the search space with Networks of Hierarchical Tasks [19], used Hyperband Optimization, Bayesian Optimization, and the combination of Bayesian Optimization and Hyperband as optimization algorithms [18].
The works focus on automating the search for ML models. Some suggest employing AutoML in the data pre-processing and post-processing steps [12, 13]. All works employ one evaluation criterion in the optimization process. There are suggestions to study the evaluation criteria in the application domain to use the most appropriate ones [14] and employ multi-objective optimization to improve the generalization of the models [19]. The authors cite the need to improve the search space [13, 18] using Meta Learning, for example [18].
3 Experimental Methodology
3.1 Datasets
We evaluated AutoMMLC in ten multi-label datasets traditionally used in studies on MLC [9, 18] from different application domains: biology, multimedia (audio and music), and text. Table 1 presents the main characteristics of the datasets used. The number of instances in the collection ranges from [194, 7.395], the number of features ranges from [19, 1.836], and the number of labels from [6, 374]. All these datasets were available for download from the Mulan websiteFootnote 1.
3.2 Preprocessing
All datasets were split using a 10-fold cross-validation resampling strategy. To ensure the same proportion of instances of each class in the folds, we used the iterative stratification strategy [15]. It was coded using the scikit-multilearn library [16]. Even after applying the iterative stratification, some folds ended without positive instances for some classes happened because some datasets originally had very few positive instances for the same labels. Thus, before the iterative stratification, we also preprocessed the datasets.
The data preprocessing step included data organization, cleaning, and data transformation. None of the selected datasets had missing values. So, in the preprocessing step, we treated the following situations:
-
Duplicated data: we removed instances with duplicated values; and
-
Labels associated with few instances: we also removed labels with ten or fewer positive instances. With this, we also excluded unlabeled instances.
In the data transformation step, we applied min-max normalization on all datasets to prevent different features from having outliers or features prevailing over others. The three right-most columns of Table 1 show the main statistics of the datasets after the preprocessing. One may note a change in the number of labels in some datasets: birds, corel5k, enron, genbase, and medical. In these datasets, the preprocessing also decreased the available number of instances. Once the preprocessing was finished, we generated the resampling 10-fold through iterative stratification, as mentioned before. The datasets and codes used in this work are freely availableFootnote 2.
3.3 Search Space
During its execution, our AutoMMLC assesses different ML models to find the best model. The search space definition is essential in this process, as it defines the possible ML algorithms (or neural architectures) and hyperparameters values for inducing models. Figure 1 presents the search space of AutoMMLC, composed of 11 SLC algorithms and 10 MLC algorithms of the scikit-learn [10] and scikit-multilearn libraries [16].
The SLC algorithms included in the search space are K-Nearest Neighbors (KNN), Support Vector Machine (SVM), AdaBoost, Bernoulli Naïve Bayes (BNB), Decision Tree (DT), Extra Tree (ET), Gradient Boosting (GB), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Random Forest (RF), Multi-Layer Perceptron (MLP) and Stochastic Gradient Descent (SGD). The MLC algorithms included in the search space belong to the following categories:
-
Ensemble: Random k labelsets Disjoint (RakelD);
-
PT: Binary Relevance (BR) and Classifier Chains (CC); and
-
AA: BR and KNN in version A (BRkNNa), BR and KNN in version B (BRkNNb), DT, Multi-Label KNN (MLkNN), Multi-Label Support Vector Machines (MLTSVM), Multi-Label Adaptive Resonance Associative Map (MLARAM), and RF.
In Fig. 1, RakelD algorithm is an Label Powerset (LP) ensemble and has a base classifier. This base classifier is one of the SLC algorithms listed. The multi-label algorithms of the PT category also have a base classifier, which can be any of the 11 single-label algorithms included in the search space.
It is essential to mention that all these SLC and MLC algorithms can have tunable hyperparameters. Therefore, the search space also contains information about the possible values of each hyperparameter. In this work, we stored the possible values of the hyperparameters in arrays, which could assume discrete or continuous values. More information about the search space algorithms (their hyperparameters and values) is in Supplementary MaterialFootnote 3.
3.4 Evaluation Criteria and Objectives
The evaluation criteria in our ML model evaluation approach were the training time and the measure-based-example f-score (also known as F-measure). The selected criteria are related to the computational time and the performance of the models, respectively. F-score is the harmonic mean between precision and recall and can assume values between [0, 1]. Equation 2 presents the definition of f-score, where N is the number of instances in the dataset, \({\textbf {y}}_i\) is the label set for the instance \({\textbf {x}}_i\), and \(h({\textbf {x}}_i)\) is the predicted label set for instance \({\textbf {x}}_i\) [9, 17].
Given that we want to maximize the f-score and minimize training time, we defined the objectives as training time and 1 - f-score.
3.5 AutoMMLC\(_{{\textbf {MORS}}}\)
\({\textrm{AutoMMLC}_\textrm{MORS}}\) is the AutoMMLC method with the MORS algorithm. This method evaluates candidate solutions randomly sampled from the search space and returns the set of non-dominated ones (Pareto frontier). It receives the termination criterion and the training and testing set as parameters. Then, it randomly obtains a MLC algorithm and its hyperparameters from the search space. In sequence, it trains this multi-label algorithm with the training data and calculates training time and f-score. The evaluation criteria (training time and 1 - f-score) are necessary to find the Pareto front, obtained as described by Deb et al. [2]. Our implementation allows executing the method AutoML by fold and the training and evaluation of the models in parallel from a pool of threads, speeding up the \({\textrm{AutoMMLC}_\textrm{MORS}}\).
3.6 AutoMMLC\(_{{\textbf {NSGA}}}\)
Individual. In the \({\textrm{AutoMMLC}_\textrm{NSGA}}\) optimization process, we represented candidate solutions (individuals) as an array of integers. They are created dynamically, and their structure is related to the search space. Figure 2 depicts examples of individuals and a reduced search space, considering only two candidates MLC algorithms (MLkNN and BR). This illustrative figure will be used to explain the individual. Each array position corresponds to a gene, and the first position specifies the MLC algorithm. We can get the MLC algorithms from the search space ([MLkNN, BR]). Thus, the possible values in the first gene are zero (MLkNN) or one (BR), which are indexes into the array of available MLC algorithms.
The following individual’s genes are the hyperparameters of the MLC algorithms in the multi-label search space. They are allocated in the individual in the same order they appear in the multi-label search space. Thus, they refer to the hyperparameters (Fig. 2): k and classifier. The other individual’s genes are the hyperparameters of the SLC algorithms in the single-label search space. They are also allocated in the individual in the same order as in the single-label search space. Therefore, the other positions refer to the hyperparameters (Fig. 2): c, gamma and n_neighbors.
The search space contains the possible values that a hyperparameter can assume. A hyperparameter can take on a finite number of possible values. Therefore, the hyperparameter space is discrete, so it was represented using arrays in search space. Thus, the value of a gene in the individual indicates one of the positions (indices) of this hyperparameter array. For example, the gamma hyperparameter in Fig. 2 has a set of possible values (\([2e-5, 2e-3, 2e-1, 2e1]\)). The value of the gene referring to this hyperparameter in the individual is 3, which means \(gamma = 2e1\). Following the described logic, individuals are randomly initialized based on the search space.
Figure 2 shows the decoding of the individual. The individual with the first gene equal to 0 has MLkNN as a MLC algorithm. This algorithm has hyperparameter k with hyperparameter space: [1, 3, 5, 7, 9]. The individual’s gene referring to this hyperparameter contains a value of 2. Therefore, we have the MLkNN algorithm with k equal to 5. The hatched genes in Fig. 2 have values, but they are not used in the MLkNN algorithm.
On the other hand, when the individual’s first gene is 1, the MLC algorithm is BR. This algorithm has the hyperparameter classifier with hyperparameter space: [SVM, KNN]. The individual’s gene referring to this hyperparameter contains a value of 0. Therefore, we have the BR algorithm with the classifier equal to SVM. We must also decode the SVM base classifier (\(c = 2e-1\) and \(gamma = 2e1\)).
AutoMMLC\(_{{\textbf {NSGA}}}\) Implementation. \({\textrm{AutoMMLC}_\textrm{NSGA}}\) is the AutoMMLC method with the NSGA-II algorithm. In the implementation, we used the NSGA-II algorithm from Pymoo, a multi-objective optimization framework in Python [1]. For this, we defined the search space and developed classes that inherit from the Pymoo framework classes:
-
Sampling: we specified how to create an initial population. Our algorithm selects the initial population randomly, selecting a valid value for each gene in the search space.
-
Problem: in this class, we initialized the number of objectives and defined how to evaluate the candidate solution. This implementation involved decoding the individual in a MLC algorithm and assessing the correspondent classification model, measuring the training time, and calculating the f-score from the model. This process was similar to the defined MORS and was implemented in parallel.
-
Mutation: we developed the mutation operator, which verifies whether individuals may or may not mutate with a 5% probability. The mutation alters one of the individual’s genes with a valid search space value.
In the crossover, we employed uniform crossover available in the Pymoo framework, where each child’s gene can be from one of the parents with a probability of 50%. The individuals resulting from crossover and mutation are valid, as the search space guarantees the consistency of the generated individuals.
3.7 Baseline Algorithms
The multi-label search space algorithms were used as baseline algorithms for comparison. We adopted the SVM algorithm as the base classifier of the MLC algorithms of the Problem Transformation approach. SVM was chosen because it is traditionally used as a base classifier [9]. For the hyperparameters penalty of the MLTSVM algorithm, we use the value \(2e^{-3}\) (one of the values of our search space). We adopted the default hyperparameters of the implementations available in the textitScikit-learn [10] and Scikit-Multilearn [16] libraries for the other hyperparameters.
3.8 Settings for Running
We executed the \({\textrm{AutoMMLC}_\textrm{NSGA}}\) with a population of size 100 and 100 generations, totaling 10.000 candidate solutions evaluated. The mutation rate was set to \(5\%\). We adopted early stopping, so if there was no change in the average of the objectives (training time and f-score) for ten consecutive generations, \({\textrm{AutoMMLC}_\textrm{NSGA}}\) ended. The budget size for \({\textrm{AutoMMLC}_\textrm{MORS}}\) was the same, i.e., it evaluated 10.000 random candidate solutions.
\({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) were implemented in parallel with a pool of 10 threads. The runtime of the MLC algorithms was limited to 20 minutes. Limiting the runtime of these algorithms is expected and commonly adopted in the AutoML context [12,13,14, 18, 19]. Given the complexity of some multi-label algorithms and the dimensionality of some datasets, our goal with the runtime limit was to ensure AutoML results promptly. If the algorithm runtime exceeds this pre-established limit threshold, the objectives receive their maximum values, 1 for 1 - f-score and 20 min for training time. These values were also the reference point for the hypervolume calculation. The runtime of the baseline algorithms was also limited to 20 min.
In addition, we also stored all the evaluated candidate solutions and their corresponding objectives’ values. Those values were stored for \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\). Thus, before considering a new candidate solution, we checked if it was among the solutions already evaluated. If so, we used the objectives already calculated, avoiding training the same classifier (algorithm and hyperparameters) more than once. With this requirement, we would like to decrease the total runtime required by the AutoMMLC.
3.9 Statistical Analysis
To compare the \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) algorithms, we consider the hypervolume obtained from the Pareto frontiers of each fold of the datasets. We use the Wilcoxon non-parametric test [4] to evaluate statistically the hypervolumes with a significance level of 5%. Our null hypothesis was that AutoMMLC versions performed equally, producing equivalent hypervolumes.
We selected Pareto frontier algorithms using Frugality Score to compare \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) with the baseline algorithms. The frugality Score is a measure for evaluating algorithms that combines a measure of performance and resources [5]. In this work, we combined the f-score with the training time, where the f-score is penalized with training time, as shown in Eq. 3.
The result of \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) on a fold/dataset is the Pareto frontier. We calculated the Frugality Score for the Pareto frontier MLC algorithms and selected the highest score. Thus, we have a MLC algorithm representing the results of AutoMMLC in each fold/dataset. We used two Friedman tests [4] with a significance level of 5% to compare the MLC algorithms representing \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) with the baseline algorithms to f-score and training time. Our null hypotheses were: the MLC algorithms had the same f-scores, and the MLC algorithms had the same training time. If the null hypotheses were rejected, the Nemenyi post-hoc test was applied, where the performances of two MLC algorithms are significantly different if the corresponding mean ranks differ by at least one critical difference value.
4 Results and Discussion
4.1 AutoMMLC\(_\textrm{NSGA}\) and AutoMMLC\(_\textrm{MORS}\)
\({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) were run as described in the experimental setup (Sect. 3.8). Figure 3 presents the average hypervolume (k-folds) for the ten datasets obtained while evaluating candidate solutions. Hypervolume calculation occurred every 100 evaluations for \({\textrm{AutoMMLC}_\textrm{MORS}}\) and at the end of each generation for \({\textrm{AutoMMLC}_\textrm{NSGA}}\). \({\textrm{AutoMMLC}_\textrm{NSGA}}\) converges before 10.000 evaluations (or 100th generation), and its execution ends. Convergence means that after determinate generation, there were no changes in the values of the objectives and, consequently, in the value of the hypervolume. In \({\textrm{AutoMMLC}_\textrm{MORS}}\), there is also this tendency. As a result, the hypervolume stops increasing after a certain number of evaluations. However, this method has always evaluated the 10.000 candidate solutions.
\({\textrm{AutoMMLC}_\textrm{MORS}}\) evaluates more candidate solutions than \({\textrm{AutoMMLC}_\textrm{NSGA}}\), the runtime of this method is more. Both AutoMMLC methods consult candidate solutions that have already been evaluated, contributing to the execution time of methods. Furthermore, the methods stop training the MLC algorithms when the training time exceeds 20 minutes. This fact occurs more frequently in \({\textrm{AutoMMLC}_\textrm{MORS}}\) and increases your runtime. The evolutionary process of \({\textrm{AutoMMLC}_\textrm{NSGA}}\) contributes to the lower number of interrupted training since candidate solutions subject to interruption cease to be part of the population in the first generations.
\({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) results were the multi-objective algorithms results, the Pareto frontier. Thus, we could evaluate it by calculating the hypervolume. Figure 4 shows the boxplot of hypervolumes calculated for the 10-fold of each dataset, considering the results of the AutoMMLC with NSGA-II and MORS algorithms. As we wanted to minimize the objective values, the larger the hypervolume, the better the obtained Pareto frontier. Analyzing Fig. 4, we cannot assume which multi-objective algorithm got the best results. Then we did the Wilcoxon test. Our null hypothesis was that the hypervolumes obtained in the AutoMMLC versions were the same. The null hypothesis was accepted for all datasets. That is, \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) had equal results for the hypervolume.
The Pareto frontiers resulting from \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) comprise a set of MLC algorithms, one of the ten MLC algorithms in the search space (Sect. 3.3). Figure 5 shows heat maps with the occurrences of each MLC algorithm in the \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) solutions after running the algorithms with the ten datasets. In the graphs of Fig. 5, there is a scale difference. This difference occurred because \({\textrm{AutoMMLC}_\textrm{MORS}}\) randomly selected from the search space many repeated MLC algorithms (as well as their hyperparameters), which ended up being part of the Pareto frontier. Among the most repeatable MLC algorithms is BRkNNa, followed by BRkNNb.
Except for the replications produced by \({\textrm{AutoMMLC}_\textrm{MORS}}\), both methods produced frontiers with different MLC algorithms. In the medical dataset, for example, the Pareto frontiers resulting from \({\textrm{AutoMMLC}_\textrm{MORS}}\) (in the 10-folds) had the MLC algorithms BRkNNa, BRkNNb, BR, CC, RakelD, and Decision Tree. These MLC algorithms also composited the Pareto frontiers resulting from \({\textrm{AutoMMLC}_\textrm{NSGA}}\) for the medical dataset. The two versions of AutoMMLC never included MLTSVM and MLkNN algorithms in their solutions.
4.2 Comparison of AutoMMLC with Baselines Algorithms
The Friedman test compared the MLC algorithms selected by the Frugality Score of \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) solutions with the baseline algorithms to f-score and training time. The null hypothesis was that the MLC algorithms presented equal performances. The null hypothesis was rejected for the test with f-score and with training time.
Figure 6 presents the critical difference diagram for the Nemenyi test to the f-score. The analysis indicates no statistical difference between the two best MLC algorithms: algorithms from \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\). Figure 6 shows the critical difference diagram for the Nemenyi test to the training time. In this test, the MLC algorithms with the best training time are BRkNNb and BRkNNa, followed by algorithms from \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\).
The MLC algorithms representing \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) do not have statistical differences for f-score and training time. However, these algorithms are superior to baseline algorithms regarding f-score and superior to BR, CC, DT, MLARAM, MLkNN, MLTSVM, RakelD, and Random Forest algorithms regarding training time.
5 Conclusions
This work presented the AutoMMLC, a new multi-objective AutoML method for MLC, which seeks solutions that maximize the f-score and minimize the training time of the classifiers. AutoMMLC was developed using NSGA-II and MORS optimization algorithms. We ran both versions of the AutoMMLC with ten datasets, the same search space, and the settings for running. The Wilcoxon test indicated that \({\textrm{AutoMMLC}_\textrm{NSGA}}\) and \({\textrm{AutoMMLC}_\textrm{MORS}}\) were statistically equal results concerning hypervolume, but \({\textrm{AutoMMLC}_\textrm{NSGA}}\) was better runtime than \({\textrm{AutoMMLC}_\textrm{MORS}}\). Regarding baseline algorithms, AutoMMLC versions were statistically better than the f-score and had a lower training time than most baseline algorithms.
The results presented are still preliminary on a recent research topic that requires further exploration. Thus, in future work, we need to expand the analysis of data from AutoMMLC runs and the resulting Pareto frontiers. In addition, we need to study other ways to select Pareto frontier solutions to compare with other solutions already available in the literature. We can expand the comparisons, considering other AutoML solutions as baselines, be they mono-objective or multi-objective optimization. We can also produce new results by improving the search space. For this, we need to improve our representation of the search space to allow algorithms with deeper hierarchical levels. We can also add more algorithms to the search space, like the Multi-label Extension to Weka (MEKA) library algorithms [11]. Finally, more objectives could be used and have different weights in multi-objective optimization.
References
Blank, J., Deb, K.: Pymoo: multi-objective optimization in python. IEEE Access 8, 89497–89509 (2020)
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Deb, K., Deb, K.: Multi-Objective Optimization, pp. 403–449. Springer, US, Boston, MA (2014)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Evchenko, M.M.: Frugal learning:applying machine learning with minimal resources (2016)
Fonseca, C., Paquete, L., Lopez-Ibanez, M.: An improved dimension-sweep algorithm for the hypervolume indicator. In: 2006 IEEE International Conference on Evolutionary Computation, pp. 1157–1163 (2006)
He, X., Zhao, K., Chu, X.: AutoML: a survey of the state-of-the-art. Knowl.-Based Syst. 212, 106622 (2021)
Karl, F., et al.: Multi-objective hyperparameter optimization - an overview (2022)
Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012), best Papers of Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA’2011)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Read, J., Reutemann, P., Pfahringer, B., Holmes, G.: MEKA: a multi-label/multi-target extension to weka. J. Mach. Learn. Res. 17(21), 1–5 (2016)
de Sá, A.G.C., Freitas, A.A., Pappa, G.L.: Automated selection and configuration of multi-label classification algorithms with grammar-based genetic programming. In: Auger, A., Fonseca, C.M., Lourenço, N., Machado, P., Paquete, L., Whitley, D. (eds.) Parallel Problem Solving from Nature - PPSN XV, pp. 308–320. Springer International Publishing, Cham (2018)
de Sá, A.G.C., Pappa, G.L., Freitas, A.A.: Towards a method for automatically selecting and configuring multi-label classification algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1125–1132. GECCO 2017, Association for Computing Machinery, New York, NY, USA (2017)
de Sá, A.G.C., Pimenta, C.G., Pappa, G.L., Freitas, A.A.: A robust experimental evaluation of automated multi-label classification methods. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 175–183. GECCO 2020, Association for Computing Machinery, New York, NY, USA (2020)
Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 145–158. Springer, Berlin Heidelberg, Berlin, Heidelberg (2011)
Szymanski, P., Kajdanowicz, T.: Scikit-multilearn: a scikit-based python environment for performing multi-label classification. J. Mach. Learn. Res. 20(1), 209–230 (2019)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-Label Data, pp. 667–685. Springer, US, Boston, MA (2010)
Wever, M., Tornede, A., Mohr, F., Hüllermeier, E.: AutoML for multi-label classification: overview and empirical evaluation. IEEE Trans. Pattern Anal. Mach. Intell. 43(09), 3037–3054 (2021)
Wever, M.D., Mohr, F., Tornede, A., Hüllermeier, E.: Automating multi-label classification extending ML-Plan. In: 6th ICML Workshop on Automated Machine Learning, Long Beach, CA, USA (2019)
Zöller, M.A., Huber, M.F.: Benchmark and survey of automated machine learning frameworks. J. Artif. Int. Res. 70, 409–472 (2021)
Acknowledgments
The authors would like to thank the Brazilian research agencies FAPESP, CAPES and CNPq for financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Del Valle, A.M., Mantovani, R.G., Cerri, R. (2023). AutoMMLC: An Automated and Multi-objective Method for Multi-label Classification. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14196. Springer, Cham. https://doi.org/10.1007/978-3-031-45389-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-45389-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45388-5
Online ISBN: 978-3-031-45389-2
eBook Packages: Computer ScienceComputer Science (R0)







