An evolutionary decomposition-based multi-objective feature selection for multi-label classification An evolutionary decomposition-based multi-objective feature selection for multi-label classification Azam Asilian Bidgoli1, Hossein Ebrahimpour-Komleh1 and Shahryar Rahnamayan2 1 Department of Electrical and Computer Engineering, University of Kashan, Kashan, Iran 2 Nature Inspired Computational Intelligence (NICI) Lab, Department of Electrical, Computer, and Software Engineering, Ontario Tech University, Oshawa, ON, Canada ABSTRACT Data classification is a fundamental task in data mining. Within this field, the classification of multi-labeled data has been seriously considered in recent years. In such problems, each data entity can simultaneously belong to several categories. Multi-label classification is important because of many recent real-world applications in which each entity has more than one label. To improve the performance of multi-label classification, feature selection plays an important role. It involves identifying and removing irrelevant and redundant features that unnecessarily increase the dimensions of the search space for the classification problems. However, classification may fail with an extreme decrease in the number of relevant features. Thus, minimizing the number of features and maximizing the classification accuracy are two desirable but conflicting objectives in multi-label feature selection. In this article, we introduce a multi-objective optimization algorithm customized for selecting the features of multi-label data. The proposed algorithm is an enhanced variant of a decomposition-based multi-objective optimization approach, in which the multi-label feature selection problem is divided into single-objective subproblems that can be simultaneously solved using an evolutionary algorithm. This approach leads to accelerating the optimization process and finding more diverse feature subsets. The proposed method benefits from a local search operator to find better solutions for each subproblem. We also define a pool of genetic operators to generate new feature subsets based on old generation. To evaluate the performance of the proposed algorithm, we compare it with two other multi-objective feature selection approaches on eight real-world benchmark datasets that are commonly used for multi-label classification. The reported results of multi-objective method evaluation measures, such as hypervolume indicator and set coverage, illustrate an improvement in the results obtained by the proposed method. Moreover, the proposed method achieved better results in terms of classification accuracy with fewer features compared with state-of-the-art methods. Subjects Artificial Intelligence, Data Mining and Machine Learning, Optimization Theory and Computation Keywords Feature selection, Multi-label classification, Multi-objective optimization, Decomposition-based algorithm, Evolutionary algorithm How to cite this article Asilian Bidgoli A, Ebrahimpour-Komleh H, Rahnamayan S. 2020. An evolutionary decomposition-based multi- objective feature selection for multi-label classification. PeerJ Comput. Sci. 6:e261 DOI 10.7717/peerj-cs.261 Submitted 7 November 2019 Accepted 22 January 2020 Published 2 March 2020 Corresponding author Azam Asilian Bidgoli, asilian@grad.kashanu.ac.ir Academic editor Gang Mei Additional Information and Declarations can be found on page 28 DOI 10.7717/peerj-cs.261 Copyright 2020 Asilian Bidgoli et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.261 mailto:asilian@�grad.�kashanu.�ac.�ir https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.261 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ INTRODUCTION In traditional classification approaches, each sample in a dataset belongs to one class. However, in recent years, to adapt to real-world problems, researchers have studied multi-label learning (Zhang & Zhou, 2014). In such problems, each sample in a dataset can simultaneously belong to several classes. Therefore, a set of labels is defined for each data entity. Because this is supervised learning, the objective of the classification is to create a model by using the training data to predict the unseen data labels. In real-world applications, it is less common for each entity to have exactly one label; for this reason, this is an important direction for research. In multi-label text classification, each text sample can simultaneously belong to different classes (such as “politics” and “sports”) (Ueda & Saito, 2003). Another example is digital image classification: an image sample may contain a mountain, lake, and tree; hence, the image is included in each of the classes (Boutell et al., 2004). In the functional classification of genes, every gene is also a member of different functional classes (such as “metabolism” and “protein synthesis”) (Li, Miao & Pedrycz, 2017). The accuracy of a classification task strongly depends on the selected features that provide the most relevant knowledge about the data to construct a reliable model. Feature selection is a data mining preprocessing task that removes irrelevant and redundant features. It reduces computational complexity in the learning process and improves the classifier’s performance (Zhang, Gong & Rong, 2016). In multi-label datasets, each sample is related to more than one label, and the corresponding labels are not necessarily independent of each other; hence, feature selection in such a dataset is more complicated than in single-label classification (Zhang et al., 2017). Several researchers have reported that classification performance can be improved using a proper feature selection strategy in multi-label data (Madjarov et al., 2012; Lee & Kim, 2015; Dembczynski et al., 2012). The feature selection methods for both multi-label and single-label datasets can be divided into three categories: wrapper, filter, and embedded methods (Pereira et al., 2016). The wrapper methods select the features based on the resulting classification performance; hence, the learning task is a part of the feature selection process. Additionally, wrapper methods have been used for multi-label data feature selection (Dendamrongvit, Vateekul & Kubat, 2011; Wandekokem, Varejão & Rauber, 2010). In filter methods, the best set of features is selected using the statistical characteristics of data (e.g., the correlation among features and classes). Many filter-based feature selection methods have been proposed for multi-label data (SpolaôR et al., 2013a, 2013b; Reyes, Morell & Ventura, 2015; Lin et al., 2016; Li, Miao & Pedrycz, 2017). The embedded methods select best subset of features as a integrated part of the learning algorithm. One of the well-known embedded methods is Decision Tree algorithm (Safavian & Landgrebe, 1991). This classifier constructs a tree structure model which selects best feature at each node in term of a discriminancy criterion. To obtain the best subset of features out of d features, we need to evaluate 2d possible subsets. Consequently, selecting the best subset out of all possible subsets is extremely time-consuming; therefore, it is not practical to employ a brute force approach. In fact, feature selection is an NP-hard problem (Chandrashekar & Sahin, 2014; Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 2/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ Blum & Langley, 1997). Therefore, the use of meta-heuristic search strategies, such as evolutionary algorithms, can be beneficial in this regard (Ibrahim et al., 2019b, El Aziz & Hassanien, 2018; Elaziz et al., 2017; Mousavirad & Ebrahimpour-Komleh, 2014). Evolutionary algorithms have attracted significant attention because they are more robust in avoiding local optima, compared with traditional optimization methods (Ibrahim et al., 2019a; Elaziz et al., 2020). Various evolutionary algorithms have been used for multi- label feature selection (Zhang, Pena & Robles, 2009; Lee & Kim, 2015; Reyes, Morell & Ventura, 2014; Shao et al., 2013). Some studies in feature selection have considered only classification accuracy for their optimization algorithm, whereas several other objectives can be simultaneously optimized using multi-objective optimization algorithms. Although feature selection can enhance the accuracy of the classification task and decrease the computational complexity, an extreme reduction of relevant features will degrade the accuracy. On the other hand, increasing the number of appropriate features gives more relevant knowledge of data to construct an accurate model. Accordingly, a massive number of features increases the computational complexity of a classification task because of the complexity of its search space. Therefore, the main goal of multi-objective feature selection has two conflicting objectives, that is, to minimize the number of features while maintaining an acceptable classification accuracy (Dembczynski et al., 2012; Xue, Zhang & Browne, 2013). To the best of our knowledge, a few articles have used multi-objective optimization methods for feature selection of multi-label data. Yin, Tao & Xu (2015) attempted to find the best subset of features by using the nondominated sorting genetic algorithm II (NSGA-II) (Deb et al., 2000). In another study, feature selection in multi-label datasets used a differential evolution algorithm (Zhang, Gong & Rong, 2016). Zhang et al. (2017) presented a particle swarm optimization (PSO)-based multi-objective optimization algorithm and achieved a better accuracy compared with the previous methods. Lee, Seo & Kim (2018) proposed an evolutionary multi-label feature selection that used dependencies between the features and labels to select more relevant features. Their method selects features that have a higher level of correlation with the labels and have not been selected using genetic operators during the optimization process. In another study , the most salient features were selected by mapping the features to a multi-dimensional space based on the correlation between features and each label (Kashef & Nezamabadi-pour, 2019). However, the authors have only used the Pareto-dominance concept inspired by multi-objective optimization. In other words, they do not search the features’ space using a multi-objective optimization algorithm. Evolutionary-based multi-objective optimization algorithms can be divided into three categories: dominance-based, decomposition-based, and indicator-based methods (Trivedi et al., 2017). The dominance-based methods attempt to find the solutions that optimize the objective functions by using a concept called dominance, which will be defined in the next section. All the above-mentioned studies on multi-label feature selection belong to this category of multi-objective optimization algorithms. The indicator- based methods evaluate the fitness of each solution by assessing an indicator (such as hypervolume) to improve the convergence and diversity criteria simultaneously. Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 3/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ On the contrary, the decomposition-based methods decompose the whole search space into smaller subproblems and solve all of them simultaneously. Therefore, the convergence rate of the algorithm is significantly improved, which enhances the diversity of the obtained solutions (Zhang & Li, 2007). An advantage of decomposition-based methods is their potential for scalability to multi-objective optimization problems (Zhang & Li, 2007). Research on feature selection for multi-label data has started recently; therefore, few studies have been conducted in this area, especially for multi-objective problems. The most important aim of this paper is to address this problem for multi-objective optimization. In this article, we propose a decomposition-based method for multi-label feature selection. The objective functions used in this paper include Hamming loss and the number of features. The main contributions of the paper can be summarized as follows: (1) we address the problem of multi-objective feature selection by solving several single- label subproblems, that is, for the first time, decomposition-based evolutionary multi- objective optimization has been used for multi-label classification; (2) we apply a local search strategy to increase the exploitation power of the proposed method; (3) we propose a hybrid crossover scheme that switches among crossover operators with a predefined probability. Because some of the benchmark datasets have more than 1,000 features, we used decomposition-based algorithms, which are beneficial for large-scale problems. To validate the results, we compared the proposed method of multi-label feature selection with state-of-the-art methods. Furthermore, to validate the performance of the proposed algorithm, we conducted an extensive set of experiments on real-world multi- label datasets. The results show a significant improvement compared with the other methods in terms of multi-objective evaluation measures, such as hypervolume indicator and set-coverage criterion. This article is organized as follows. “Background Review” describes related work on multi-label classification, multi-objective optimization, and the existing methods for multi- objective multi-label feature selection. The proposed algorithm is explained in “Proposed Method”. The experiments are presented in “Experimental Design”. “Results and Discussion” describes and discusses the results. Finally, “Concluding Remarks” concludes the article. Background review In the following subsections, we briefly review related concepts. We start with a brief explanation of multi-label classification to clarify the importance of this research problem. Next, we explain multi-objective optimization and the corresponding challenges. Finally, we examine existing multi-label feature selection methods that have been proposed for multi-objective optimization algorithms. Multi-label classification If a dataset X contains d-dimensional samples and Y represents the set of the q possible labels in a multi-label problem, the objective of multi-label classification is to create a model in the form of h : X ! 2Y from m training examples, D = (Xi, Yi|1 ≤ i ≤ m). For each multi-label sample (Xi, Yi), Xi includes a d-dimensional feature vector and Yi includes Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 4/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ a set of labels associated with Xi. For unseen data Xi, the multi-label classifier predicts h(X) as the set of labels. Multi-label learning algorithms can be divided into two main categories: problem transformation and algorithm adaptation methods (Zhang & Zhou, 2014). In the problem transformation methods, the multi-label classification problem is converted into a single-label problem to classify data using existing single-label classifiers. The basic idea of the algorithm adaptation methods is to adapt single-label classifiers to deal with multi-label data. Multi-label K-nearest neighbor (ML-KNN) (Zhang & Zhou, 2007) is one of the most well-known adaptive methods, and it was used in this study to evaluate feature subsets. In the single-label version of this algorithm, to predict the class label of the sample, the algorithm calculates the distance between the query sample and the other samples in dataset. K neighbors (smallest distances) of the sample should be picked. The algorithm gets the labels of the selected K entries. Then it returns the mode of the K labels as the class of query sample. Despite its simplicity, this classifier is commonly used on various applications. In its multi-label version ML-KNN, as in the single-label version, the sample would be labeled by classes in which the distribution of neighbors is higher. In this direction, decision making is performed for every class as follows: Y ¼ fyijPðHjjCjÞ=Pð�HjjCjÞ . 1; 1 � j � qg (1) Sample x belongs to class j if the posterior probability P(Hj|Cj) that x belongs to class j, providing that x has exactly Cj neighbors with label yj, is bigger than P(∼Hj|Cj). To obtain the value of posterior probability, Bayes’ theorem mentioned in Eq. (2) has been applied. The ratio of two mentioned posterior probabilities determines belonging of the sample to class j. According to this equation, the posterior probability is dependent on the values of prior probabilities (P(Hj) and P(∼Hj)) and likelihood functions (P(Cj|Hj) and P(Cj|∼Hj)). PðHjjCjÞ Pð�HjjCjÞ ¼ PðHjÞ � PðCjjHjÞ Pð�HjÞ � PðCjj�HjÞ (2) To calculate the P(Hj), we obtain the ratio of the samples that have label yj to the total samples. The value of P(Cj|Hj)) is also calculated using Eq. (3), where kj(r) is the number of samples in the training set that have label yj and have exactly r neighbors with label yj. Based on this definition, kj(Cj) is the number of samples that belong to class j and have r neighbors in this class. PðCjjHjÞ ¼ kjðCjÞXk r¼0 kjðrÞ ð1 � j � q; 0 � Cj � kÞ (3) Because of the simplicity and popularity of ML-KNN, we used this classifier to evaluate the quality of selected features in our proposed method. Moreover, we use the same classifier to compare several algorithms. Multi-objective optimization Most real-world optimization problems involve multiple conflicting objectives (Konak, Coit & Smith, 2006). Hence, multi-objective optimization problems have various practical Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 5/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ applications. The use of evolutionary algorithms has been motivating for solving such problems. Because of the population-based nature of these algorithms, we obtain a set of solutions on every run. In a multi-objective optimization problem, the definition of optimality is not as simple as in single-objective optimization. When the optimal solution of an objective function conflicts with an optimal solution of another objective function, the problem becomes challenging. Therefore, to solve such problems, it is necessary to find a trade-off between objective functions. The obtained solutions of multi-objective algorithms are called nondominated solutions or Pareto-optimal solutions. Theoretically, if a multi-objective optimization problem is a minimization problem, it is formulated as follows (Mirjalili et al., 2016). Min FðxÞ ¼ ½f1ðxÞ; f2ðxÞ; . . . ; fMðxÞ� s:t: Li � xi � Ui; i ¼ 1; 2; . . . ; d (4) Subject to the following equality and inequality constraints: giðxÞ � 0 j ¼ 1; 2; …; J hkðxÞ ¼ 0 k ¼ 1; 2; …; K (5) where M is the number of objectives, and d is the number of decision variables (dimension) of solution x, so that xi should be in interval [Li,Ui] (i.e., box-constraint). Finally, fi is the objective function that should be minimized. To compare two candidate solutions in multi-objective problems, we can use the concept of Pareto dominance. Mathematically, the Pareto dominance is defined as follows. If x = (x1, x2,…, xd) and �x ¼ ð�x1; �x2; . . . ; �xdÞ are two vectors in the search space, x dominates �xðx � �xÞ if and only if 8i 2 f1; 2; . . . ; Mg; fiðxÞ � fið�xÞ^ 9j 2 f1; 2; . . . ; Mg : fjðxÞ , fjð�xÞ (6) This means that solution x dominates solution �x (is better) if and only if the objective values of x are better than or equal to all objective values of �x (is not worse than �x in any of the values of the objective functions) and it has a better value than x in at least one of the objective functions. If the solution x is better than �x in all objectives, we call strong dominance but in the case that they have at least one equal objective, the weak dominance happens. All nondominated solutions construct a Pareto front. Crowding distance (Deb et al., 2000) is another measure to compute the distribution of candidate solutions in the objective space. It is calculated using the sum of distances between each solution and its neighbors. It is computed using Eq. (7). CDi ¼ XM j¼1 jfjði þ 1Þ � fjði � 1Þj; (7) where fj(i + 1) and fj(i − 1) indicate the jth objective value of the previous and next neighbors of solution i. Larger distance indicates a non-crowded space. Hence, a selection of solutions from this region creates a better distribution. In fact, it represents the distribution of the members surrounding each solution. Decomposition-based methods are a category of multi-objective optimization algorithms that decompose the Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 6/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ approximation of PF into a number of single-objective optimization subproblems. All subproblems can be solved by using other classical optimization or evolutionary methods simultaneously. A strategy is required to convert a multi-objective optimization problem to single-objective one. During optimization, the trade-off relation between objectives can be applied by considering information mainly from subproblem neighboring. The neighborhood concept is defined by producing a set of weight vectors in objective space. If the subproblems are solved by evolutionary algorithms, neighbors communicate with each other to reproduce the new candidate solutions and update the existing individuals in the population. The steps of a decomposition-based method has been explained in “Proposed Method” section. Tchebycheff method As stated before, multi-objective optimization problems can be solved by different methods. Traditional multi-objective optimization methods seek a way to convert the multi-objective problem into a single-objective problem. One of these methods is the Tchebycheff method (Jaszkiewicz, 2002), which was used in this study to solve multi- objective subproblems. The Tchebycheff method looks for the optimal solutions that have the minimum distance from a reference point. The single-objective optimization problem is defined as Eq. (8). Minimize gteðxj�o; z�Þ ¼ maxf�ijfiðxÞ � z�i jg subject to x 2 S; (8) where z� = (z1�,…,zm�) T is a reference point used to evaluate the quality of the obtained solutions, m is the number of objective functions, and S is the search space. According to this equation, the distances between the objective function values of each solution x and reference point z� are calculated. The single-objective optimization problem is regarded as minimizing the maximum of these distances. A uniform weight vector λ = (λ1, λ1,…, λm) is defined for each solution such that Pm i �i ¼ 1. Therefore, weight λi is assigned to the objective function fi. To obtain each optimal solution of the minimization problem defined in Eq. (8), we need to find an appropriate weight vector. The obtained optimal solution would be one of the Pareto optimal solutions. As a result, the traditional methods are time-consuming because of continuous changes in the weights required to obtain the best solutions. Therefore, we consider a set of distributed weight vectors in the decomposition-based evolutionary methods for all the subproblems. Reference point selection is another issue that should be considered in the Tchebycheff method. For a minimization problem, the minimum value obtained for each objective function can be a reference point. z�i ¼ minfiðxÞjx 2 S (9) Therefore, the value of the reference point is also updated after each iteration. Figure 1 shows the Tchebycheff method for obtaining an optimal solution on the Pareto front. As an example, we show that the reference point has been placed at the center of the coordinates, where the values of both objective functions are minimal. We show a sample Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 7/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ weight vector (λ1, λ2), and the solutions from each iteration are shown in blue. The solutions converge toward the reference point in the direction of the weight vector until the optimal point on the Pareto front (in red) is obtained. At each iteration, the previous solution is replaced with a new solution if the new one outperforms the previous one. Multi-label feature selection using multi-objective evolutionary algorithms A review of the literature shows little research in the area of multi-label feature selection using multi-objective evolutionary algorithms. Next, we briefly explain the state-of-the-art methods. Multi-label feature selection algorithm based on NSGA-II Yin, Tao & Xu (2015) have selected the optimal features for multi-label data classification using the NSGA-II algorithm. The Hamming loss and the average precision criteria have been considered as the objective functions. This paper has yielded the Pareto front using the NSGA-II algorithm. NSGA-II uses fast non-dominated sorting to rank feature subsets. The fast non-dominated sorting technique categorizes the population members in different ranks. For each solution p, the number of members for which solution p dominates and the number of members that dominate solution p are specified. All solutions that have never been dominated (members with a domination count of zero) are added to a set named F1. Here, F1 is the first Pareto front that contains the best- qualified members of the current population. In the next step, the members included in F1 are removed from the population, and the remaining members that have never been dominated construct the second rank F2. This procedure continues in the same way until all population members are ranked. At the end of the algorithm, the members of the first front F1 are presented as the optimal Pareto front. The proposed method was tested on multi-label standard data Optimal Point ( 1 , 2) Z* 2( ) 1( ) Figure 1 Illustration of Tchebycheff method. Full-size DOI: 10.7717/peerj-cs.261/fig-1 Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 8/32 http://dx.doi.org/10.7717/peerj-cs.261/fig-1 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ classified using the ML-KNN classifier. The authors compared the proposed method with several filter-based feature selection methods. PSO-based multi-objective multi-label feature selection PSO is a well-known population-based evolutionary algorithm. Zhang et al. (2017) presented a multi-objective PSO-based method for feature selection of multi-label data. They considered the number of features and accuracy of classification as conflicting objectives. In the PSO algorithm, population consists of particles that have two properties: position and velocity. The position and the velocity of the i-th particle are presented as Pi(t) = (pi,1, pi,2,…, pi,D) and Vi(t) = (vi,1, vi,2,…, vi,D), respectively. The position of the particle is updated based on the previous position and velocity. Moreover, the particle velocity is updated according to Eq. (10) based on two parameters: (1) the best individual position of the particle up to now Lbi(t) = (lbi,1, lbi,2,…, lbi,D) and (2) the best global position among all particles Gb(t) = (gb1, gb2,…, gbD). vi;jðt þ 1Þ ¼w vi;jðtÞ þ r1 c1 ðlbi;jðtÞ � pi;jðtÞÞ þ r2 c2 ðgbjðtÞ � pi;jðtÞÞ (10) pi;jðt þ 1Þ ¼ pi;jðtÞ þ vi;jðt þ 1Þ (11) where t is the number of iterations; r1 and r2 are two random vectors uniformly distributed in the range (0, 1); c1 and c2 are two parameters that represent the particle’s confidence in itself and in the swarm, respectively, and w determines the effect of previous velocity, called inertia weight. Generating an initial population is the first step of PSO-based multi-label feature selection. Then, an archive of nondominated solutions is provided. Velocities and positions of all particles are updated in each iteration. We also need to update the particle’s best individual position and the best global position. The particle’s best individual position is calculated using the domination concept. In addition, the best global position is selected among the particles’ historical positions by using the crowding distance criterion. An adaptive mutation operator has been used to produce offspring; the number of the mutated elements in a particle is determined using a non-linear function. For this purpose, K variables of some particles are randomly selected to be reinitiated. The proposed method has been evaluated on standard benchmark problems using the ML-KNN classifier. The results show significant improvements compared to the previous state-of-the-art methods. PROPOSED METHOD Decomposition methods (such as the Tchebycheff method) are traditional methods of multi-objective optimization. They transform the problem of approximation of the Pareto front into a number of scalar optimization problems. As mentioned before, because of the continuous modifications of the objective functions’ weights for obtaining a Pareto solution, these methods may be time-consuming. Some of them are unable to discover all Pareto points in convex problems effectively (Zhang & Li, 2007). An evolutionary algorithm can be used to overcome this problem. Recently, a method based on decomposition and evolutionary algorithms (MOEA/D) was proposed for solving a multi- objective problem (Zhang & Li, 2007). MOEA/D uses evolutionary algorithms for Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 9/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ decomposition of the problem space into scalar subproblems and simultaneously solves them. Hence, it increases the speed of finding Pareto-optimal solutions and improves the diversity of the obtained solutions (Zhang, Gong & Rong, 2016). The scalar subproblems are simultaneously solved by receiving information from neighboring subproblems; therefore, the algorithm has less computational complexity compared to the domination- based algorithms. MOEA/D has several advantages over Pareto dominance-based algorithms, such as computational efficiency, scalability to optimization problems with many objectives , and high search ability for combinatorial optimization problems (Jiang et al., 2011). In this article, we propose a decomposition-based multi-objective feature selection for multi-label data classification. This is the first time that a decomposition-based approach has been customized to tackle multi-label classification. Figure 2 shows the overall flowchart of the proposed method. According to the overall structure, the search process needs an encoding strategy to define the search space which is explained in the next subsection. Algorithm, as an iterative process, starts with initialization step. At each Reproduc�on using proposed gene�c operators Encoding Replacement neighbors based on Tchebychef algorithm Stop Ini�aliza�on (Popula�on, Weight vectors, Reference point) Fitness Evalua�on (Number of features, Hamming lose) Feature subset No Yes Fitness value(s) Number of features H a m m in g lo se Pareto Front of Trade-off solu�ons Update Reference point Local search to improve the Pareto Front Figure 2 Flowchart of overall structure of the proposed method. Full-size DOI: 10.7717/peerj-cs.261/fig-2 Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 10/32 http://dx.doi.org/10.7717/peerj-cs.261/fig-2 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ iteration, based on the proposed operators, new feature subsets are created and evaluated using objective functions. Using Tchebychef method, the neighbors of generated solutions and reference points will be updated. After applying a local search, a set of non-dominated solutions are obtained as trade-off feature subsets. Algorithm 1 also represents the Pseudo-code of the proposed method which in the following subsections, we describe the details of its main components. Representation of individuals Each member of the population indicates a candidate solution in the search space. In this paper, the representation of individuals for feature selection is a string with a length equal to the number of features. A cell of the vector is randomly filled with a real value between 0 and 1. This representation is used in problems that need a continuous representation of the solutions (Xue, Zhang & Browne, 2013). The use of real values is due to the use of continuous genetic operators. A cell with a value greater than 0.5 indicates the selection of the feature, and a value less than 0.5 indicates that a feature is not selected. If the length of the feature vector is D, the i-th population member is defined as ci(t) = (ci,1, ci,2,…, ci,d). The feature subsets use the following notation: when a feature is selected, the corresponding cell value changes to 1; otherwise, it becomes zero. Hence, the string is converted to a binary vector, where 0 indicates the rejection of the feature, and 1 indicates the selection of the feature. Therefore, the number of selected features is equivalent to the count of “1” in the vector. An instance of a feature vector is shown in Fig. 3. Objective functions To acquire the best solutions in feature selection, we consider two objective functions: the number of selected features and the Hamming loss. As mentioned before, the goal of feature selection is to remove irrelevant and redundant features and, therefore, to reduce the complexity of the search space in the classification task or any other feature-based process. The ratio of the features selected for each solution to all the features (a value between 0 and 1) is our first objective function. The second objective function evaluates the learning accuracy of multi-label data. The Hamming loss is one of the most well-known measures for computing the classification error for multi-label data; it has been used in several papers on multi-label wrapper feature selection (Zhang et al., 2017; Yin, Tao & Xu, 2015; Jungjit & Freitas, 2015). The Hamming loss evaluates the fraction of misclassified instance-label pairs, that is, a relevant label is missed, or an irrelevant one is predicted (Zhang & Zhou, 2014). The Hamming loss is defined as follows: hlossðhÞ ¼ 1 p Xn i¼1 1 q jhðxiÞDYij; (12) where q is the number of labels, and p is the total number of multi-label samples. If xi shows the i-th sample, h(xi) represents the labels predicted by model h. Moreover, Yi are the actual labels of the i-th sample. Δ is the difference between the vectors of the predicted and actual labels. The Hamming loss error is our second objective function for feature selection. Hence, multi-objective optimization can be applied to minimize this objective as Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 11/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ Algorithm 1. Pseudo-code for the proposed method. input : NP: the number of subproblems, T: the number of neighbors in the decomposition-based optimization algorithm, K: the number of neighbors in the multi-label KNN classifier, R: the number of iterations output : final feature selection subsets // Initialization 1 Divide multi-label data into two training and test sets; 2 Produce the weight vectors by uniformly distributed aggregation values; 3 Generate the initial population uniform randomly; 4 Evaluate the objective functions for each candidate solution according to (Eq. 12) using training set; 5 Compute the T neighbors for each weight vector using Euclidean distance; 6 Initialize the reference point according to (Eq. 9); 7 Determine the non-dominated solutions in the initial population as an archive (AC); 8 it=0; // Main algorithm 9 while it < R do 10 for i)1 to NP do // For each individual, xi, in the population // Regeneration 11 Randomly select two candidate solutions from among the neighbors of x1; 12 Produce two new candidate solutions, y1, y2 using the proposed genetic operators; // Comparison and replacement (Eq. 8) 13 for j)1 to T do // For each neighbor 14 if gte(y1|W j,Z) ≤ gte(x1|W j ,Z) then 15 xj = y1 16 end // Update the reference point (Eq. 9) 17 if f1(y1) < z1 then 18 z1 = f1(y1) 19 end 20 if f2(y1) < z2 then 21 z2 = f2(y1) 22 end 23 if gte(y2|W j ,Z) ≤ gte(x1|W j ,Z) then 24 xj = y2 25 end // Update the reference point (Eq. 9) 26 if f1(y2) < z1 then 27 z1 = f1(y2) 28 end 29 if f2(y2) < z2 then Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 12/32 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ well. According to the definitions of the two objective functions, the proposed method attempts to find feature sets with a minimum number of features and a minimum classification error. The proposed genetic operators In this paper, we introduce a pool of crossover operators to obtain the benefits of various operators to produce better solutions. Three genetic operators—single point, double point, and uniform crossover—are used to produce a new generation of candidate solutions. In each iteration, one of these crossover operators is selected. A random number P between 0 and 1 is generated as the selection probability of one of the operators. The ranges of the selection are specified using P1 and P2, which can be determined in the experiments. If the generated number is less than P1, the single-point crossover is applied to the parent solutions. In the single-point crossover, a random point is selected, and the tails of its two parents are swapped to generate new offspring. The double-point crossover is selected if P is between P1 and P2. The double-point crossover is a generalization of the single-point crossover wherein alternating segments are swapped to generate new Algorithm 1. (continued). 30 z2 = f2(y2) 31 end 32 end 33 end // Local Search and obtaining final Pareto 34 Separate non-dominated solutions from the updated population (NS); 35 Separate non-dominated solutions from AC and NS (EP); 36 Select a solution with the maximum crowding distance (Xcr) and two random solutions Xn1, Xn2 from EP; 37 Produce a new solution by using (Eq. 13); 38 Select the non-dominated solutions as the final Pareto set from EP and �X; 39 Update the archive; 40 it = it +1; 41 end 42 Obtain the hamming loss for test data with the selected features of solutions in the final Pareto front. Number of features 0.8….0.350.61….0.440.30.7 1….01….001 Figure 3 An instance of a feature selection representation. Full-size DOI: 10.7717/peerj-cs.261/fig-3 Asilian Bidgoli et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.261 13/32 http://dx.doi.org/10.7717/peerj-cs.261/fig-3 http://dx.doi.org/10.7717/peerj-cs.261 https://peerj.com/computer-science/ offspring. A probability greater than P2 causes the selection of the uniform crossover to produce offspring. It performs the swapping of the parents by choosing a uniform random real number (between 0 and 1). The random real number decides whether the first child selects the ith gene from the first or the second parent. For each variable, a uniform random number is generated. Based on the value of this number, the child’s variable is selected from one of the parents. If the random number is more than 0.5, the first parent’s variable would be selected, and vice versa. Figure 4 shows the process of selecting crossover operators. A uniform mutation is applied for a newly produced individual to guarantee the diversity property. A random number of features is selected from the generated subset. Then, the values of the variables related to the corresponding features will be replaced with a new random uniform number between 0 and 1. Local search The domination concept is used to separate the best candidate solutions at the end of each iteration. All dominated solutions are omitted from the population. To improve the obtained Pareto front in the decomposition-based algorithm, a local search (Zhang et al., 2017) is applied to produce a candidate solution in the search space with a large crowding distance. We estimate the density of solutions surrounding each solution; hence, producing a new solution in the area with less density is desirable. For this purpose, at the end of each iteration, the final Pareto front is saved in the archive (AC). A solution with the maximum crowding distance (Xcr) is selected among non-dominated solutions of the new Pareto front obtained from the current generation and the solutions in the archive (from the previous generations). A new solution is produced by using Xcr and two random solutions, Xn1, Xn2, based on the following equation: �Xi ¼ Xcr þ F � ðXn1 � Xn2Þ (13) 0.27 0.87 0.54 0.11 0.91 0.20 0.49 0.42 0.15 0.73 0.91 0.63 0.12 0.82 0.27 0.87 0.54 0.11 0.91 0.20 0.49 0.42 0.15 0.73 0.91 0.63 0.12 0.82 0.42 0.15 0.73 0.91 0.63 0.12 0.82 0.27 0.87 0.54 0.11 0.91 0.20 0.49 0.42 0.15 0.73 0.91 0.63 0.12 0.82 0.27 0.87 0.54 0.11 0.91 0.20 0.49 0.42 0.15 0.73 0.91 0.63 0.12 0.82 0.27 0.87 0.54 0.11 0.91 0.20 0.49 Produce a real random number (p) p P≤P1 P>P2 P1> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice