Feature selection using Joint Mutual Information Maximisation Expert Systems With Applications 42 (2015) 8520–8532 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Feature selection using Joint Mutual Information Maximisation Mohamed Bennasar, Yulia Hicks, Rossitza Setchi∗ School of Engineering, Cardiff University, Cardiff CF24 3AA, UK a r t i c l e i n f o Keywords: Feature selection Mutual information Joint mutual information Conditional mutual information Subset feature selection Classification Dimensionality reduction Feature selection stability a b s t r a c t Feature selection is used in many application areas relevant to expert and intelligent systems, such as data mining and machine learning, image processing, anomaly detection, bioinformatics and natural language processing. Feature selection based on information theory is a popular approach due its computational ef- ficiency, scalability in terms of the dataset dimensionality, and independence from the classifier. Common drawbacks of this approach are the lack of information about the interaction between the features and the classifier, and the selection of redundant and irrelevant features. The latter is due to the limitations of the employed goal functions leading to overestimation of the feature significance. To address this problem, this article introduces two new nonlinear feature selection methods, namely Joint Mutual Information Maximisation (JMIM) and Normalised Joint Mutual Information Maximisation (NJMIM); both these methods use mutual information and the ‘maximum of the minimum’ criterion, which alleviates the problem of overestimation of the feature significance as demonstrated both theoretically and experimentally. The proposed methods are compared using eleven publically available datasets with five competing methods. The results demonstrate that the JMIM method outperforms the other methods on most tested public datasets, reducing the relative average classification error by almost 6% in comparison to the next best performing method. The statistical significance of the results is confirmed by the ANOVA test. More- over, this method produces the best trade-off between accuracy and stability. © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). ( L 2 s o N f p p d p i l s t f 1. Introduction High dimensional data is a significant problem in both super- vised and unsupervised learning (Janecek, Gansterer, Demel, & Ecker, 2008), which is becoming even more prominent with the recent ex- plosion of the size of the available datasets both in terms of the num- ber of data samples and the number of features in each sample (Zhang et al., 2015). The main motivation for reducing the dimensionality of the data and keeping the number of features as low as possible is to decrease the training time and enhance the classification accuracy of the algorithms (Guyon & Elisseeff, 2003; Jain, Duin, & Mao, 2000; Liu & Yu, 2005). Dimensionality reduction methods can be divided into two main groups: those based on feature extraction and those based on feature selection. Feature extraction methods transform existing features into a new feature space of lower dimensionality. During this process, new features are created based on linear or nonlinear combinations of features from the original set. Principal Component Analysis (PCA) ∗ Corresponding author. Tel: +44 2920875720; fax: +44 2920874716. E-mail addresses: BennasarM@cf.ac.uk (M. Bennasar), HicksYA@cf.ac.uk (Y. Hicks), Setchi@cf.ac.uk (R. Setchi). n & C f http://dx.doi.org/10.1016/j.eswa.2015.07.007 0957-4174/© 2015 The Authors. Published by Elsevier Ltd. This is an open access article unde Bajwa, Naweed, Asif, & Hyder, 2009; Turk & Pentland, 1991) and inear Discriminant Analysis (LDA) (Tang, Suganthana, Yao, & Qina, 005; Yu & Yang, 2001) are two examples of such algorithms. Feature election methods reduce the dimensionality by selecting a subset f features which minimises a certain cost function (Guyon, Gunn, ikravesh, & Zadeh, 2006; Jain et al., 2000). Unlike feature extraction, eature selection does not alter the data and, as a result, it is the referred choice when an understanding of the underlying physical rocess is required. Feature extraction may be preferred when only iscrimination is needed (Jain et al., 2000). Feature selection is used in many application areas relevant to ex- ert and intelligent systems, such as data mining and machine learn- ng, image processing, anomaly detection, bioinformatics and natural anguage processing (Hoque, Bhattacharyya, & Kalita, 2014). Feature election is normally used at the data pre-processing stage before raining a classifier. This process is also known as variable selection, eature reduction or variable subset selection. The topic of feature selection has been reviewed in detail in a umber of recent review articles (Bolón-Canedo, Sánchez-Maroño, Alonso-Betanzos, 2013; Brown, Pocock, Zhao, & Lujan, 2012; handrashekar & Sahin, 2014; Vergara & Estévez, 2014). Usually, eature selection methods are divided into two categories in terms of r the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). http://dx.doi.org/10.1016/j.eswa.2015.07.007 http://www.ScienceDirect.com http://www.elsevier.com/locate/eswa http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.07.007&domain=pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ mailto:BennasarM@cf.ac.uk mailto:HicksYA@cf.ac.uk mailto:Setchi@cf.ac.uk http://dx.doi.org/10.1016/j.eswa.2015.07.007 http://creativecommons.org/licenses/by-nc-nd/4.0/ M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8521 e ‘ W s o o c T c s a t o h P w o t c r s c c fi “ S M R D X t a A a s o f a m r a c S i o c s o o s w i o n p m c r I s c l t e m c p p F m p S r t c 2 c f a s a w a H w X T c i H w a H T a b o e H H M a I M a r I valuation strategy, in particular, classifier dependent (‘wrapper’ and embedded’ methods) or classifier independent (‘filter’ methods). rapper methods search the feature space, and test all possible ubsets of feature combinations by using the prediction accuracy f a classifier as a measure of the selected subset’s quality, with- ut modifying the learning function. Therefore, wrapper methods an be combined with any learning machine (Guyon et al., 2006). hey perform well because the selected subset is optimised for the lassification algorithm. On the other hand, wrapper methods may uffer from over-fitting to the learning algorithm. This means that ny changes in the learning model may reduce the usefulness of he subset. In addition, these methods are very expensive in terms f computational complexity, especially when handling extremely igh-dimensional data (Brown et al., 2012; Cheng et al., 2011; Ding & eng, 2003; Karegowda, Jayaram, & Manjunath, 2010). The feature selection stage in the embedded methods is combined ith the learning stage. These methods are less expensive in terms f computational complexity and less prone to over-fitting; however, hey are limited in terms of generalisation, because they are very spe- ific to the used learning algorithm (Guyon et al., 2006). Classifier-independent methods rank features according to their elevance to the class label in the supervised learning. The relevance core is calculated using distance, information, correlation and onsistency measures. Many techniques have been proposed to ompute the relevance score, including Pearson correlation coef- cients (Rodgers & Nicewander, 1988), Fisher’s discriminate ratio F score” (Lin, Li, & Tsai, 2004), the Scatter criterion (Duda, Hart, & tork, 2001), Single Variable Classifier SVC (Guyon & Elisseeff, 2003), utual Information (Battiti, 1994), the Relief Algorithm (Kira & endell, 1992; Liu & Motoda, 2008), Rough Set Theory (Liang, Wang, ang, & Qian, 2014) and Data Envelopment Analysis (Zhang, Yang, iong, Wang, & Zhang, 2014). The main advantages of the filter methods are their computa- ional efficiency, scalability in terms of the dataset dimensionality, nd independence from the classifier (Saeys, Inza, & Larranaga, 2007). common drawback of these methods is the lack of information bout the interaction between the features and the classifier and election of redundant and irrelevant features due to the limitations f the employed goal functions leading to overestimation of the eature significance. Information theory (Cover & Thomas, 2006) has been widely pplied in filter methods, where information measures such as utual information (MI) are used as a measure of the features’ elevance and redundancy (Battiti, 1994). MI does not make an ssumption of linearity between the variables, and can deal with ategorical and numerical data with two or more class values (Meyer, chretter, & Bontempi, 2008). There are several alternative measures n information theory that can be used to compute the relevance f features, namely mutual information, interaction information, onditional mutual information, and joint mutual information. This paper contributes to the knowledge in the area of feature election by proposing two new nonlinear feature selection meth- ds based on information theory. The proposed methods aim to vercome the limitations of the current state of the art filter feature election methods such as overestimation of the feature significance, hich causes selection of redundant and irrelevant features. This s achieved through the introduction of a new goal function based n joint mutual information and the ‘maximum of the minimum’ onlinear approach. As shown in the evaluation section, one of the roposed methods outperforms the competing feature selection ethods in terms of classification accuracy, decreasing the average lassification error by 0.88% in absolute terms and almost by 6% in elative terms in comparison to the next best performing method. n addition, it produces the best trade-off between accuracy and tability. The statistical significance of the reported results is further onfirmed by ANOVA test. This paper also reviews existing feature selection methods high- ighting their common limitations and compares the performance of he proposed and existing methods on the basis of several criteria. For xample, a nonlinear approach, which employs the ‘maximum of the inimum’ criterion, is compared to a linear approach, which employs umulative summation approximation. To optimise the nonlinear ap- roach, a goal function based on joint mutual information is com- ared to the goal function based on conditional mutual information. inally, the effect of using normalised mutual information instead of utual information is tested. The rest of the paper is organised as follows. Section 2 presents the rinciples of the information theory, Section 3 reviews related work, ection 4 discusses the limitations of current feature selection crite- ia, Section 5 introduces the proposed methods. Section 6 describes he conducted experiments and discusses the results. Section 7 con- ludes the paper. . Information theory This section introduces the principles of information theory by fo- using on entropy and mutual information and explains the reasons or employing them in feature selection. The entropy of a random variable is a measure of its uncertainty nd a measure of the average amount of information required to de- cribe the random variable (Cover & Thomas, 2006). The entropy of discrete random variable X = (x1, x2, . . . . . . , xN) is denoted by H(X), here xi refers to the possible values that X can take. H(X) is defined s: (X) = − N∑ i=1 p(xi)log(p(xi)), (1) here p(xi) is the probability mass function. The value of p(xi), when is discrete, is: p(xi) = number o f instants with value xi total number o f instants (N) . (2) he base of the logarithm, log, is 2, so 0 ≤ H(X) ≤ 1. For any two dis- rete random variables X and C = (c1, c2, . . . . . . , cM), the joint entropy s defined as: (X, C) = − M∑ j=1 N∑ i=1 p ( xi, c j ) log ( p ( xi, c j )) (3) here p(xi, cj) is the joint probability mass function of the variables X nd C. The conditional entropy of the variable X given C is defined as: (C|X) = − M∑ j=1 N∑ i=1 p ( xi, c j ) log ( p ( c j|xi )) (4) he conditional entropy is the amount of uncertainty left in C when variable X is introduced, so it is less than or equal to the entropy of oth variables. The conditional entropy is equal to the entropy if, and nly if, the two variables are independent. The relation between joint ntropy and conditional entropy is: (X, C) = H(X) + H(C|X) (5) (X, C) = H(C) + H(X|C) (6) utual Information (MI) is the amount of information that both vari- bles share, and is defined as: (X ; C) = H(C) − H(C|X) (7) I can be expressed as the amount of information provided by vari- ble X, which reduces the uncertainty of variable C. MI is zero if the andom variables are statistically independent. MI is symmetric, so: (X ; C) = I(C; X) (8) 8522 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 a a p M v D d t p t a v m d t o c t t g t m M b s j c & c J t t s t ( r t r a p r ( i I A k t p l o l t c d t n m I(X ; C) = H(X) − H(X|C) (9) I(X ; C) = H(X) + H(C) − H(X, C) (10) The Joint MI is defined as: I(X ; C|Y ) = H(X|C) − H(X|C, Y ) (11) I(X, Y ; C) = I(X ; C|Y ) + I(Y ; C) (12) where Y is a discrete variable; Y = (y1, y2, . . . . . . , yN). Interaction in- formation can be defined as the amount of information that is shared by all features, but is not found within any feature subset. Mathemat- ically, the relation between interaction information and MI is defined as: I(X ; Y ; C) = I(X, Y ; C) − I(X ; C) − I(Y ; C) (13) High interaction information means that a large amount of infor- mation can be obtained by considering the three variables together (Jakulin, 2003). Interaction information can be positive, negative or zero (Jakulin, 2005). 3. Related work The focus of the work presented in this article is on the filter feature selection methods due to their popularity, and thus the review part of this article focuses specifically on these methods. For a more detailed review of the feature selection methods recent review articles in this area are recommended (Bolón-Canedo et al., 2013; Brown et al., 2012; Chandrashekar & Sahin, 2014; Vergara & Estévez, 2014). Information theory has been employed by many filter feature selection methods. Information Gain (IG) (Guyon & Elisseeff, 2003) is the simplest of these methods. It is classified as a univariate feature selection method, as it ranks features based on the value of their mutual information with the class label. Simplicity and low computational costs are the main advantages of this method. How- ever, it does not take into consideration the dependency between the features, rather, it assumes independency, which is not always the case. Therefore some of the selected features may carry redundant information. To tackle this problem new methods have been pro- posed for selecting relevant features, which are non-redundant with respect to each other. For a feature set F = { f1, f2, . . . . . . , fN}, the feature selection pro- cess identifies a subset of features S with dimension k where k ≤ N, and S⊆F. In theory, the selected subset S should maximise the joint mutual information between the class label C and the subset S of a fixed size k. I(S; C) = I( f1, f2, . . . . . . , fk; C) (14) However, such an approach is impractical, due to the number of cal- culations and the limited number of observations available for the calculation of the high-dimensional probability density function. As a result, many methods use heuristic approaches to approximate the ideal solution. Generally, the filter criteria are based on the concepts of fea- ture relevance, redundancy and complementarity (Vergara & Estévez, 2014). The methods which are based on information theory can be split into two groups: linear criteria, which are linear combinations of MI terms; and nonlinear criteria, which use maximum or mini- mum operations or normalised MI in their goal functions (Brown et al., 2012). Battiti (1994) introduces a first-order incremental search algo- rithm, known as the Mutual Information Feature Selection (MIFS) method, for selecting the most relevant k features from an initial set of n features. A greedy selection method is used to build the subset. Instead of calculating the joint MI between the selected features and the class label, Battiti studies the MI between the candidate feature nd the class, and the relationship between the candidate and the lready-selected features. Kwok and Choi (2002) propose the MIFS-U method to improve the erformance of the MIFS method by making a better estimation of the I between the input feature and the class label. Another method ariant to MIFS, the mRMR method is proposed by Peng, Long, and ing (2005). The redundancy term in mRMR is divided over the car- inality |S| of the selected subset S to balance the magnitude of this erm, and to avoid it growing very large as the subsets expand. As re- orted in the existing literature (Brown et al., 2012; Peng et al., 2005), his modification allows mRMR to outperform the conventional MIFS nd MIFS-U methods. Estévez, Tesmer, Perez, and Zurada (2009) propose an enhanced ersion of MIFS, MIFS-U and mRMR, called Normalised Mutual Infor- ation Feature Selection (NMIFS). It uses normalised MI in the re- undancy term instead of MI. The normalisation of MI prevents bias owards multivalued features and limits the value of MI to the range f zero to unity (Estévez et al., 2009). Hoque et al. (2014) propose a method called MIFS-ND. The method alculates the mutual information between the candidate feature and he class label, and the average of the mutual information between he candidate feature and the features within the selected subset. A enetic algorithm is employed to select the feature that maximises he mutual information with the class, and minimises the average utual information with the other selected features. Other proposed criteria (Yang & Moody, 1999; Fleuret, 2004; eyer & Bontempi, 2006; Vidal-Naquet & Ullman, 2003) use the MI etween the candidate feature and the class label in the context of the elected subset features. They utilise conditional mutual information, oint mutual information or feature interaction. Some of them apply umulative summation approximations (Yang & Moody, 1999; Meyer Bontempi, 2006), while others use the ‘maximum of the minimum’ riterion (Fleuret, 2004; Vidal-Naquet & Ullman, 2003). Yang and Moody (1999) propose a feature selection method called oint Mutual Information (JMI). In this method, the candidate feature hat maximises the cumulative summation of Joint Mutual Informa- ion with features of the selected subset is chosen and added to the ubset. This method is reported to perform well in terms of classifica- ion accuracy and stability (Brown et al., 2012). Meyer and Bontempi 2006) introduce a similar method known as Double Input Symmet- ical Relevance (DISR). The joint mutual information in the goal func- ion of this method is substituted with symmetrical relevance. Other methods that employ the ‘maximum of the minimum’ crite- ion have been proposed. Vidal-Naquet and Ullman (2003) introduce method called Information Fragment (IF), while Fleuret (2004) pro- ose Conditional Mutual Information Maximisation, which have been eported to perform well with KNN and SVM classifiers in later work Freeman, Kulić, & Basir, 2015). There are also a number of other methods which rely on max- mising Feature Interaction. For example, Jakulin (2005) proposes the nteraction Capping (IC) method, while El Akadi, El Ouardighi, and boutajdine (2008) propose a method which uses feature interaction, nown as Interaction Gain Based Feature Selection (IGFS). However, his is typically the same as JMI. General formula based on conditional likelihood has been pro- osed by Brown et al. (2012) based on a study of MI-based feature se- ection criterion, this formula can be used to derive many of the meth- ds listed in this section. In practice, most of the methods which are inear combinations of MI can be derived from this formula. However, he authors stated that the goal function of the nonlinear method annot be generated by their formula. Feature selection techniques have also been used for multi-label ata sets. Lee and Kim (2015) proposed a multi-label feature selec- ion method based on information theory, in which they introduce a ew score function to measure the importance of each feature to the ultiple labels. M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8523 t a O i ( f p t d m t 4 t r t b n t m e i t m l t o o s l E a f c t I t s p 5 T m d t e s w d F m v E i D c s D s a o t S L o v P a m t C a m i D t t L t c P t p t t I t H 5 r t r m r l n i t I s t t M j p L a w I Two other notable approaches in the area of filter feature selec- ion are the application of the rough set theory (Liang et al., 2014) nd the application of Data Envelopment Analysis (Zhang et al., 2014). ne of the issues affecting the methods based on the fuzzy-rough sets s their time inefficiency, with many existing attempts to improve it Qian, Wang, Cheng, Liang, & Dang, 2015). The methods using DEA or feature selection also suffer from the problem of the large com- utational cost, although it was improved in a more recent publica- ion (Zhang et al., 2015), as well as the problem of the selection of re- undant features. The latter problem is characteristic of most of the ethods listed above and the reasons for this problem will be inves- igated in more detail in Section 4. . Limitations of the current feature selection criteria In general, most of the methods listed in the previous section use he criteria consisting of two elements: the relevancy term and the edundancy term. The methods attempt to simultaneously maximise he relevancy term whilst minimising the redundancy term. It has een noted in literature that such feature selection methods have a umber of limitations (Estévez et al., 2009; Peng et al., 2005). For example, MIFS and MIFS-U share a common problem: when he number of selected features grows, the redundancy term grows in agnitude with respect to the relevancy term. In this case some irrel- vant features may be selected. This problem has been partly solved n the mRMR, NMIFS, MIFS-ND methods by dividing the redundancy erm over the cardinality of the subset. Another problem shared by all above methods (MIFS, MIFS-U, RMR, NMIFS, and MIFS-ND) is that the redundancy term is calcu- ated based on the value of the MI between the candidate feature and he features within the selected subset, without any consideration f the class label. The features may share information between each ther, but that does not mean they are redundant; they may in fact hare different information with the class. Yet another problem particular to the methods employing cumu- ative summation and forward search to approximate the solution of q. (14) (such as MIFS, NMIFS, mRMR, NMIFS, MIFS-ND, DISR, IGFS, nd JMI) is the overestimation of the significance of some candidate eatures. For example, this can occur when the candidate feature is in omplete correlation with one or several pre-selected features, but at he same time is almost independent from the majority of the subset. n such situation, the value of the goal function will be high despite he redundancy of the candidate feature to some features within the ubset. In practice, the significance of each of the above problems de- ends on the data and the characteristics of each particular data set. . Proposed methods for feature selection In this paper, two new methods for feature selection are proposed. he methods employ joint mutual information, and use the ‘maxi- um of the minimum’ approach. The proposed methods aim to ad- ress the problem of overestimation the significance of some fea- ures, which occurs when cumulative summation approximation is mployed. For a feature set F = { f1, f2, . . . . . . , fN} of a data set D of dimen- ion N, the feature selection process identifies a subset of features S ith dimension K where K ≤ N, and S⊆F. The subset S should pro- uce equal or better classification accuracy compared to feature set . In other words feature selection defines the subset of features that aximises mutual information with the class label I(S, C). In the past, a number of alternative definitions of feature rele- ance have been used (Battiti, 1994; Brown et al., 2012; Vergara & stévez, 2014; Estévez et al., 2009). The following definition is used n this work. efinition 1. (Feature relevance). Feature fi is more relevant to the lass label C than feature fj in the context of the already selected sub- et S when I(fi, S; C) > I(fj, S; C). efinition 2. (Minimum joint mutual information): Let F be the full et of features, and let S be the subset of features that are selected lready. Let fi ∈ F − S, and fs ∈ S. The m-Joint MI is the minimum value f joint mutual information that the candidate feature fi shares with he class label C when it is joined with every feature within the subset individually, hence min s=1,2,...,k I( fi, fs; C), emma 1. For a feature fi, if the m-Joint MI is larger than that of all ther features fj, where fi and f j ∈ F − S (i �= j), then it is the most rele- ant feature to the class label C in the context of the subset S. roof. Let S = { f1, f2, . . . . . . , fK }. The joint mutual information of fi nd each feature in S with C is calculated. The minimum value of this utual information (m-Joint) is the lowest amount of new informa- ion that the feature fi adds to the shared information between S and . The feature that produces the maximum m-Joint is the feature that dds maximum information to that shared between S and C, which eans it is the feature which is the most relevant to the class label C n the context of the subset S according to Definition 1. efinition 3. Candidate feature fi is redundant to the selected fea- ures within the subset S if fi does not share new information with he class C. emma 2. Let F be the full set of features, let S be the subset of features hat are selected already, and fi ∈ F − S, fs ∈ S. If the feature fi is highly orrelated with a feature fs in the subset then I(fi; C) ∼=I(fs; C)∼=I(fi, fs; C). roof. If the feature fi is highly correlated with a featurefs, then he probability mass functions of fi, fs, and (fi, fs) are equal, (fi) ∼=p(fs)∼=p(fs, fi) . Since the definition of the entropy is (X) = − ∑Ni=1 p(xi)log(p(xi)) hen H(fi) ∼=H(fs)∼=H(fs, fi). Since the definition of the mutual informa- ion is I(X ; C) = H(X) + H(C) − H(X, C) then I(fi; fs)∼=H(fs)∼=H(fi) and (fi; C) ∼=I(fs; C). I( fi, fs; C) = H( fi, fs) + H(C) − H( fi, fs, C), according o the definition, which can be simplified to: I( fi, fs; C) = H( fi) + (C) − H( fi, C). According to Eq. (10) I(fi, fs; C)∼=I(fi; C)∼= I(fs; C). .1. Joint Mutual Information Maximisation (JMIM) All methods listed in the previous section attempt to optimise the elationship between relevancy and redundancy when selecting fea- ures by approximating the solution of Eq. (14). The JMI method is eported in existing literature as being the method which selects the ost relevant features (Brown et al., 2012). It studies relevancy and edundancy, and takes into consideration the class label when calcu- ating MI. However, the method still allows overestimation of the sig- ificance of some features, for example, when the candidate feature is n complete correlation with one or a few pre-selected features, but at he same time is almost independent from the majority of the subset. n such a situation, the value of the JMI goal function will be high de- pite the redundancy of the candidate feature to some features within he subset. This drawback is evident in almost all methods that use he cumulative sum approximation. For this reason, a new method called Joint Mutual Information aximisation (JMIM) is proposed in this research. JMIM employs oint mutual information and the ‘maximum of the minimum’ ap- roach, which should choose the most relevant features according to emma 1, following from which, the features are selected by JMIM ccording to the following new criterion: fJMIM = arg max fi ∈F −S(min fs ∈S(I( fi, fs; C))), (22) here ( fi, fs; C) = I( fs; C) + I( fi; C/ fs), (23) 8524 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 5 w 5 a I t s F w S W F T t 6 a m f l e t w t o t i s w d g e ( s N t o I( fi, fs; C) = H(C) − H(C/ fi, fs), (24) I( fi, fs; C) = [ − ∑ c∈C p(c) log (p(c)) ] − [∑ c∈C ∑ fi ∈F −S ∑ fs ∈S log ( p( fi fs, c/ fs) p( fi/ fs)p(c/ fs) )] . (25) The method uses the following iterative forward greedy search al- gorithm to find the relevant feature subset of size k within the feature space: Algorithm 1. Forward greedy search. 1. (Initialisation) Set F ← “initial set of n features”; S ← “empty set.” 2. (Computation of the MI with the output class) For ∀ fi ∈ F compute I(C; fi). 3. (Choice of the first feature) Find a feature fi that maximises I(C; fi); set F ← F \{ fi}; set S ← { fi}. 4. (Greedy selection) Repeat until |S| = k: (Selection of the next feature) Choose the feature fi = arg max fi ∈F −S(min fs ∈S(I( fi , fs ; C))); set F ← F \ { fi}; set S ← S ∪{ fi}. 5. (Output) Output the set S with the selected features. 5.2. Advantages over existing alternative methods The Venn diagrams in Fig. 1 show different scenarios for the re- lationship between the candidate feature fi, the selected feature fs, and the class label C. Fig. 1a illustrates the case in which methods like MIFS, NMIFS or mRMR will fail to select fi because it is redundant to fs, although each of them shares different information about C, and the correlation is not in the context of C. The goal function of JMIM is similar to the goal function of CMIM (Section 3), as CMIM also uses the ‘maximum of the mini- mum’ approach. The main difference is that CMIM maximises the amount of information the candidate feature fi contributes given the pre-selected feature fs (i.e. fi is selected for any complementing fs), whereas JMIM selects the feature that maximises the joint mutual information with fs. Fig. 1b and c is used to explain this difference further. The figures represent two candidate features fi and fj, and the subsequent selection of one of them. I(fi, fs; C) is the union of areas 1, 2, and 3; I(fi; C/fs) is area 1 in Fig. 1b. The CMIM method would se- lect fi in Fig. 1b, even though its complementing feature fs from the subset does not carry as much information as the feature fj in Fig. 1c. Conversely, JMIM would select the feature that maximises JMI, so it would select feature fi in Fig. 1c. Therefore, the joint mutual infor- mation between the candidate feature and at least one of the pre- selected features will be high, which can increase the discrimination power of the selected subset. Fig. 1. Venn diagrams illustrating the re .3. Normalised Joint Mutual Information Maximisation (NJMIM) The second method proposed in this paper uses a goal function, hich is very similar to the one used in JMIM proposed in Section .1, with the difference being that symmetrical relevance is used as n alternative to MI. This method is called Normalised Joint Mutual nformation Maximisation (NJMIM). It is proposed in order to study he effect of using normalised MI instead of MI. the proposed NJMIM election criteria is presented in Eq. (26). NJMIM = arg max fi ∈F −S ( min fs ∈S(SR( fi, fs; C)) ) , (26) here ymmetrical relevance = SR(F ; C) = I(F ; C) H(F, C) . (27) hich can be simplified as: NJMIM = arg max fi ∈F −S ( min fs ∈S ( I( fi, fs; C) H( fi, fs, C) )) . (28) he same iterative forward greedy search algorithm is used to find he subset of features within the candidate feature space. . Evaluation The performance of the two proposed methods in this paper, JMIM nd NJMIM, is compared with the results produced by five other ethods: CMIM, DISR, mRMR, JMI, and IG. These methods are chosen or the following four reasons: (i) these methods are reported in the iterature to provide good performance (Brown et al., 2012; Freeman t al., 2015); (ii) the choice of these methods allows the comparison of he ‘maximum of the minimum’ approach used by JMIM and NJMIM ith the cumulative summation used by JMI and DISR; (iii) it enables he analysis of the effect of using the symmetrical relevance instead f MI on the algorithm’s performance; (iv) it allows the comparison of he effects of using joint mutual information and conditional mutual nformation, which are employed in JMIM and CMIM, respectively. The seven methods are applied to data from different domains uch as: life sciences, physical sciences, engineering, business, hand- riting recognition, and gene microarray. The features within these atasets have different characteristics, being binary, discrete or cate- orical, or continuous. The continuous features are discretised into 10 qual intervals, using the Equal Width Discretisation (EWD) method Dougherty, Kohavi, & Sahami, 1995). Two classifiers are used to evaluate the quality of the selected sub- ets. These are Naïve-Bayes with kernel density estimation, and 3- earest Neighbours. Both classifiers are available in the Matlab Statis- ics Toolbox. The average classification accuracy is used as a measure f the quality of the selected features. Five-fold cross-validation is lation between features and class. M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8525 Fig. 2. Evaluation framework. e t t 2 d d s f T s s 6 a v 2 o t r i n t T s a Table 1 UCI datasets used in the experiment. No Data set Number of Number of Number of Ratio features instances classes 1 Credit approval 15 690 2 54 2 Gas sensor 128 13874 6 198 3 Libra movement 90 483 15 3 4 Parkinson 22 195 2 11 5 Breast 30 569 2 28 6 Sonar 60 208 2 10 7 Musk 166 7074 2 354 8 Handwriting 649 2000 10 20 Table 2 Additional datasets used in the experiment (Peng et al., 2005). No Data set Number of Number of Number of Ratio features instances classes 1 Colon 2000 62 2 10 2 Leukemia 7070 72 2 12 3 Lymphoma 4026 96 9 4 6 d b l d mployed when processing feature selection and feature validation; herefore each fold is used for validation once. This means that 80% of he data is used for feature selection and classification training, whilst 0% is used for validation. This is repeated five times, using the whole ataset for validation over the course of five experiments. Overall, five ifferent subsets of samples are used to generate five different sub- ets of features. Discretisation is performed as a pre-processing step or all data prior to the feature selection step. Fig. 2 shows the evaluation framework used in this experiment. o test the impact of adding each feature to the subset on the clas- ification accuracy, training and validation are performed after the election of each feature in the subset. .1. Data Eight datasets from the UCI Repository (Bache & Lichman, 2013) re used in the experiment (Table 1). These datasets have been pre- iously used in similar research (Brown et al., 2012; El Akadi et al., 008; Cheng et al., 2011). They have different characteristics in terms f number of classes, features, instances and feature types. An example-feature ratio (Brown et al., 2012) is used as an indica- ion of the difficulty of the feature selection task for the dataset. This atio is computed using N mC , where N is the number of instances, m s the median number of values that the features have, and C is the umber of classes. The most challenging feature selection tasks are hose performed using datasets with a small example-feature ratio. he libra movement dataset is the most challenging dataset. To test the behaviour of the methods with an extremely small ample, datasets from Peng et al. (2005) are also used in the evalu- tion process, and these are shown in Table 2. .2. Performance analysis on low dimensional datasets Figs. 3–5 show the average classification accuracy of the three atasets with low numbers of features (Parkinson, credit approval and reast). The classification is computed over the whole size of the se- ected subset, from 1 feature up to 20 features (or all features of the ataset in the case of the credit approval dataset). 8526 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 0 2 4 6 8 10 12 14 16 18 20 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 C la ss ifi ca tio n a cc u ra cy CMIM NJMIM DISR JMIM mRMR JMI IG Number of features Fig. 3. Average classification accuracy achieved with the Parkinson dataset. 0 5 10 15 0.65 0.7 0.75 0.8 0.85 Number of features CMIM NJMIM DISR JMIM mRMR JMI IG C la ss ifi ca tio n a cc u ra cy Fig. 4. Average classification accuracy achieved with the credit approval dataset. 0 2 4 6 8 10 12 14 16 18 20 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Number of features C la ss ifi ca tio n a cc u ra cy CMIM NJMIM DISR JMIM mRMR JMI IG Fig. 5. Average classification accuracy achieved with the breast dataset. M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8527 d j a s u t h c t o f J t w h N p d p r b a c 6 s g w s t a I f m p a As shown in Fig. 3, which illustrates the experiment with the first ataset, JMIM achieves the highest average accuracy (90.77%) with ust 8 features, which is higher than the accuracy of CMIM (90.26%) nd JMI (88.97%). On the other hand, methods that use normalised MI, uch as NJMIM and DISR, perform less well than JMIM and JMI, which se MI. This is expected for datasets with discrete features, because he normalisation may reduce the significance of the feature when it as high entropy and shares a high amount of information with the lass label. The mRMR and IG methods perform poorly on this dataset. JMIM and JMI again achieve the highest classification accuracy on he credit approval dataset, using only 4 features to reach an accuracy f 82.92%. The accuracy of CMIM is 79.17% with the same number of eatures. The other methods perform worse compared to JMIM and MI with the same number of features. The figure also shows that he methods using normalised MI do not perform as well as those hich use MI. Features selected by the JMIM and JMI methods have a igher discriminative power than the features which are selected by JMIM and DISR. NJMIM performs better than DISR, yet both perform oorly. The breast dataset has 20 features selected. As seen in Fig. 5, JMIM oes not achieve the highest classification accuracy. However, it 0 5 10 15 20 2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Number of C la ss ifi ca tio n a cc u ra cy Fig. 6. Average classification accuracy ac 0 5 10 15 20 2 0.65 0.7 0.75 0.8 0.85 Number of f C la ss ifi ca tio n a cc u ra cy Fig. 7. Average classification accuracy roduces a high accuracy (95.87%) with only 5 features, while mRMR equires 14 features to achieve the same accuracy. JMIM performs etter in comparison with JMI and CMIM. The performance of NJMIM nd DISR is not as good as JMIM and JMI, as with 4 features their lassification accuracies are 87.61% and 89.28%, respectively. .3. Performance analysis on high dimensional datasets The second experiment involves high dimensional data (musk, onar, gas sensor, and handwriting datasets. The experiment with the as sensor and sonar datasets includes the selection of 50 features, ith JMIM achieving high classification accuracy with a relatively mall number of features. The other methods require more features o achieve this level of accuracy (Figs. 6–7). Fig. 8 shows the results for the handwriting dataset. 50 features re selected. JMIM performs well, but is inferior to JMI and mRMR. n terms of classification accuracy of the selected subset JMI per- ormed better than JMIM in the subset with 11–21 features, by a aximum difference in accuracy of 0.5%. The mRMR method also erforms well with this dataset; however JMIM produces the highest ccuracy (97.68%) with the selected subset of 33 features. 5 30 35 40 45 50 features CMIM NJMIM DISR JMIM mRMR JMI IG hieved with the gas sensor dataset. 5 30 35 40 45 50 eatures CMIM NJMIM DISR JMIM mRMR JMI IG achieved with the sonar dataset. 8528 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 0 10 20 30 40 50 60 70 80 90 100 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of features y c ar u c c a n oit a cifi s s al C CMIM NJMIM DISR FIM JMIM mRMR JMI IG 32 34 36 38 40 42 44 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 Fig. 8. Average classification accuracy achieved with the handwriting dataset. 0 5 10 15 20 25 30 35 40 45 50 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of features C la ss ifi ca tio n a cc u ra cy CMIM NJMIM DISR FIM JMIM mRMR JMI IG 14 16 18 20 22 24 0.65 0.7 0.75 0.8 Fig. 9. Average classification accuracy achieved with the libra movement dataset. m i i o b o m w b b 6 t t e A The experimental results using the libra movement dataset are shown in Fig. 9, when 50 features are selected. JMIM is the best method with this dataset with almost any number of selected fea- tures, followed by NJMIM. JMIM outperforms JMI by up to 3% in terms of classification accuracy. NJMIM also outperforms DISR for all of the selected subsets. The methods are also applied to the musk dataset. Fig. 10 shows the result when 50 features are selected. With this dataset, JMIM se- lects the best subset and outperforms the other methods in terms of classification accuracy. NJMIM does not perform as well as JMIM, but produces better accuracy than DISR and mRMR for most of the fea- tures selected. 6.4. Performance analysis with Peng et al. (2005) datasets The results using the three datasets employed by Peng et al. (2005) are shown in Fig. 11. The leukemia dataset (Fig. 11a) has a small num- ber of samples. The results show that none of the feature selection ethods perform particularly well, confirming the findings reported n the review article by Brown et al. (2012). The colon dataset, which s the least challenging dataset of the three in terms of the number f classes and features, is shown in Fig. 11b. The results indicate the etter performance of JMIM and JMI compared to the other meth- ds, especially CMIM, which performs poorly. However, CMIM is the ethod that provides the best accuracy with the lymphoma dataset, hile JMIM, JMI and mRMR also perform well, with JMIM being the est of these. NJMIM performs better than DISR with all of the subsets elow 34 features. .5. Evaluating and validating results ANOVA statistical test is employed to analyse the results, and o confirm that the results are systematic and they were not ob- ained by chance. The classification experiment is run five times for ach dataset and the average accuracy results are submitted to the NOVA test. Table 3 shows the ANOVA results, where P-value is the M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8529 0 5 10 15 20 25 30 35 40 45 50 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Number of features C la ss ifi ca tio n a cc u ra cy CMIM NJMIM DISR JMIM mRMR JMI IG Fig. 10. Average classification accuracy achieved with the musk dataset. Table 3 ANOVA test. Dataset MS F P-value Credit approval 0.027537 731.3342 1.87E−37 Gas sensor 0.004117 77.17653 1.38E−16 Libra movement 0.009677 114.5907 2.94E−23 Parkinson 0.009677 114.5907 2.94E−23 Breast 0.001414 101.4627 2.37E−22 Sonar 0.00094 5.760126 9.62E−05 Musk 0.000505 304.4366 1.11E−30 Handwriting 8.84E−05 35.99929 6.35E−15 Colon 0.000411 3.532383 0.006395 Leukemia 0.000161 10.36207 2.21E−07 Lymphoma 0.011501 232.6585 1.28E−28 p m i b 6 o d t i s c t ( w d | n s r o t Table 4 Average stability, average accuracy and the compromise between accuracy and stability. Method Accuracy Stability Accuracy/stability CIMIM 0.8488 0.8598 0.9197 NJMIM 0.8264 0.8344 0.8954 DISR 0.8129 0.9054 0.8807 JMIM 0.8578 0.8598 0.9294 mRMR 0.8278 0.8868 0.8969 JMI 0.8490 0.8838 0.9199 IG 0.8226 0.9228 0.8913 l c c w w m o E s m w i m 0 c i n t t s o a t robability of the improvement to occur by chance, and MS is the ean square error. When the value of the P-value is less than 0.05 it s unlikely that the improvement in classification accuracy happened y chance. This is shown to be the case for all the datasets (Table 3). .6. Stability of the methods This section focuses on the stability of the feature selection meth- ds discussed. The selected subset features are dependent on the atasets provided, and therefore any change to the data might lead o different selected features. In this context, the present study nvestigates the influence of changes in the data on the features elected. Kuncheva’s measure of stability (Kuncheva, 2007), known as the onsistency index, uses Eq. (29) to compute the consistency between wo selected feature subsets, S1 and S2: S1, S2 ) = rn − k 2 k(n − k) , (29) here S1 and S2 are selected feature subsets using different groups of ataset samples, i.e. S1, S2 ∈ F where F is the total set of the feature, S1| = |S2| = k, |F | = n, and r = |S1 ∩ S2|. However, this method does ot take into consideration the correlation between features. Yu, Ding, and Loscalzo (2008) proposed a method for measuring tability based on similarity. This method takes into account the cor- elation between features. It calculates the weight between each pair f features from the subsets S1 and S2, computes the similarity be- ween S1 and S2, and constructs a bipartite graph. If f is a feature be- i onging to S1 and fj is a feature belonging to S 2, the value of the weight an be the correlation coefficient, or any other similarity measure. This article uses symmetrical uncertainty (Yu & Liu, 2004) to cal- ulate the weight w: ( s1i , s 2 j ) = 2 [ I ( s1 i , s2 j ) H ( s1 i ) + H ( s2 j ) ] , (30) here 0 ≤ w(s1 i , s2 j ) ≤ 1.0. To find the maximum weighted bipartite atching, the Hungarian Algorithm (Kuhn, 1955) is used to find the ptimal solution. This experiment uses the eight UCI datasets, as shown in Table 1. ach dataset is divided into 5 folds, 4 of which are used for feature election using the CMIM, NJMIM, DISR, JMIM, mRMR, JMI, and IG ethods. Eq. (30) is used to calculate the weight between the features ithin each pair of selected subsets from each dataset. The final cost s divided over the cardinality of the subset used, and therefore the agnitude of the final cost should be less than or equal to 0.5 (it is .5 if all selected subsets are the same). The relationship between accuracy and stability is computed by omparing the average classification accuracy and the average stabil- ty with different numbers of features. Table 4 shows the average accuracy/stability for each method in o particular order. It is worth noting that the methods employing he ‘maximum of the minimum’ criterion (JMIM, NJMIM and CMIM) end to have lower stability than the methods using the cumulative ummation approximation (JMI and DISR). The best method in terms f stability is IG. JMIM has the best compromise between accuracy nd stability. Moreover, it demonstrates the best average classifica- ion accuracy among all methods. 8530 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 a-Leukemia 0 5 10 15 20 25 30 35 40 45 50 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Number of features C la ss ifi ca tio n a cc u ra cy CMIM NJMIM DISR FIM JMIM mRMR JMI IG b-Colon 0 5 10 15 20 25 30 35 40 45 50 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Number of features C la ss ifi ca tio n a cc u ra cy c-Lymphoma 0 10 20 30 40 50 60 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Number of features C la ss ifi ca tio n a cc u ra cy Fig. 11. Average classification accuracy with the additional datasets. M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8531 7 w d a P t t r a f g s c t t fi t J e c d a p i i u l I t w t t c o J d i n g d a m C o i 8 i a T r i a i d p m M S t t a t s a i t o t i o f t c g a h t t t r s t b J t d p t R B B B B B C C C D D D E E . Discussion The JMIM method outperforms the other methods when tested ith most of the datasets in terms of selecting the subset that pro- uces the best classification accuracy. JMIM also produces the best ccuracy with the datasets with a low number of features, such as the arkinson, credit approval and breast datasets. In these experiments, he maximum average classification accuracy achieved by JMIM with he Parkinson dataset was 90.77%. JMIM and JMI achieved the accu- acy of 82.92% with the credit approval dataset whilst JMI and CMIM chieved 93.83% and 95.22%, respectively. The JMIM method also per- ormed well on high dimensional datasets, such as the musk, sonar, as sensor and handwriting datasets. JMIM and JMI also outperform the other methods on extremely mall sample datasets with a large number of features, such as the olon dataset. However, CMIM produces the best performance with he lymphoma dataset. JMIM, JMI, and mRMR also perform better than he other three methods. Overall, JMIM decreases the average classi- cation error by 0.88% in absolute terms and almost 6% in relative erms in comparison to the next best performing method, JMI. The MIM classification accuracy is also higher than that reported in lit- rature by other filter methods (Zhang et al., 2015), although no firm onclusions can be made on this account due to the variety of the atasets used in the most recent articles (Liang et al., 2014; Zhang et l., 2015). In addition to the quantitative assessment of the accuracy of the roposed methods, several experiments are conducted to enable an n-depth comparison of different feature selection methods, accord- ng to several criteria. For example, the nonlinear approach, which ses the ‘maximum of the minimum’ criterion, is compared to the inear approach that employs cumulative summation approximation. n particular, JMIM is compared to JMI, with the results showing that he non-linear approach performed better than the linear approach hen tested with most of the datasets. The goal function based on joint mutual information is compared o the goal function based on conditional mutual information, with he result showing better performance of joint mutual information in ombination with the non-linear criterion. Finally, the effect of using normalised mutual information instead f mutual information is tested by comparing the performance of MIM and JMI with NJMIM and DISR. The results show that, with the iscretised datasets, the methods employing non normalised mutual nformation such as JMI and JMIM perform better than those using ormalised mutual information, such as DISR and NJMIM. This sug- ests that division of the mutual information over the joint entropy oes not improve performance. In addition, the methods are compared in terms of their stability, s described in detail in Section 6.5. The results demonstrate that the ethods employing ‘maximum of the minimum’ criterion, such as MIM, JMIM, and NJMIM, show less average stability than the meth- ds which employ cumulative summation, although there is no dom- nant method. . Conclusion This paper presents two new feature selection methods based on nformation theory: Joint Mutual Information Maximisation (JMIM) nd Normalised Joint Mutual Information Maximisation (NJMIM). hese methods are designed to resolve the problem of choosing edundant and irrelevant features in certain circumstances, which s characteristic of filter feature selection methods. The latter is chieved through the use of the mutual information and the ‘max- mum of the minimum’ nonlinear approach for the goal function esign. The methods have been evaluated using public datasets and com- ared with five other feature selection methods: Joint Mutual Infor- ation (JMI), Conditional Mutual Information Maximisation (CMIM), aximum Relevancy Minimum Redundancy (mRMR), Double Input ymmetrical Relevance (DISR), and Information Gain (IG) in terms of heir ability to select features with high discriminative power, and heir stability. To evaluate the performance of the proposed methods, n experiment is conducted using eight datasets from the UCI Reposi- ory. In addition, to test the behaviour of the methods with extremely mall sample datasets, three other datasets from Peng et al. (2005) re used. Overall, JMIM decreases the average classification error by 0.88% n absolute terms and almost by 6% in relative terms in comparison o the next best performing method, JMI. The statistical significance f the reported results is further confirmed by ANOVA test. Moreover, his method produces the best trade-off between accuracy and stabil- ty. The limitations of our approach are those which are characteristic f other filter approaches: it disregards the interaction between the eatures and the classifier, as well as the higher dimensional joint mu- ual information between more than two features, which sometimes an lead to suboptimal choice of features. Future work includes more experiments using other search strate- ies to validate the proposed method in a wider range of search lgorithms, employing parallel computation techniques to estimate igher dimensional joint mutual information in which two or more of he features from the selected subset are used simultaneously to test he significance of the candidate feature, automating the selection of he optimal subset by introducing a cut-off parameter measuring the elevancy of the features. Further improvements can be made by studying the information hared between features and class labels and classifying the fea- ures into strongly relevant, relevant, weakly relevant, and redundant ased on the information that the feature adds to the selected subset. In terms of applications relevant to expert and intelligent systems, MIM method would be of benefit for choosing the most relevant fea- ures in classification tasks. In addition to the analysis of the public atasets in this article, the method could be used in many other ap- lications where the relevance of the features for the classification ask needs to be analysed. eferences ache, K., & Lichman, M. (2013). UCI machine learning repository. Irvine, CA: Univer- sity of California, School of Information and Computer Science. (http://archive. ics.uci.edu/ml). ajwa, I., Naweed, M., Asif, M., & Hyder, S. (2009). Feature based image classification by using principal component analysis. ICGST International Journal on Graphics Vision and Image Processing, 9, 11–17. attiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5, 537–550. olón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of fea- ture selection methods on synthetic data. Knowledge and Information Systems, 34, 483–519. rown, G., Pocock, A., Zhao, M., & Lujan, M. (2012). Conditional likelihood maximisa- tion: a unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66. handrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers and Electrical Engineering, 40, 16–28. heng, H., Qin, Z., Feng, C., Wang, Y., & Li, F. (2011). Conditional mutual information- based feature selection analysing for synergy and redundancy. Electronics and Telecommunications Research Institute, 33, 210–218. over, T., & Thomas, J. (2006). Elements of information theory. New York: John Wiley & Sons. ing, C., & Peng, H. (2003). Minimum redundancy feature selection from microarray gene expression data. In Proceedings of the computational systems bioinformatics: IEEE Computer Society (pp. 523–528). ougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretiza- tion of continuous features. In Proceedings of the twelfth international conference on machine learning (pp. 194–202). uda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: John Wiley and Sons. l Akadi, A., El Ouardighi, A., & Aboutajdine, D. (2008). A powerful feature selection approach based on mutual information. International Journal of Computer Science and Network Security, 8, 116–121. stévez, P. A., Tesmer, M., Perez, A., & Zurada, J. M. (2009). Normalized mutual informa- tion feature selection. IEEE Transactions on Neural Networks, 20, 189–201. http://archive.ics.uci.edu/ml http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0002 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0002 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012 8532 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 L M M P R Q S T T V V Y Y Y Y Z Z Fleuret, F. (2004). Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5, 1531–1555. Freeman, C., Kulić, D., & Basir, O. (2015). An evaluation of classifier-specific filter mea- sure performance for feature selection. Pattern Recognition, 48, 1812–1826. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Jour- nal of Machine Learning Research, 3, 1157–1182. Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006). Feature extraction foundations and applications. New York/Berlin, Heidelberg: Springer Studies in fuzziness and soft computing. Hoque, N., Bhattacharyya, D. K., & Kalita, J. K. (2014). MIFS-ND: a mutual information- based feature selection method. Expert Systems with Applications, 41(14), 6371– 6385. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4–37. Jakulin, A. (2003). Attribute interactions in machine learning. (M.S.c thesis), Computer and Information Science, University of Ljubljana. Jakulin, A. (2005). Machine learning based on attribute interactions (Ph.D. thesis), Com- puter and Information Science, University of Ljubljana. Janecek, A., Gansterer, W., Demel, M., & Ecker, G. (2008). On the relationship between feature selection and classification accuracy. Journal of Machine Learning Research: Workshop and Conference Proceedings, 4, 90–105. Karegowda, A. G., Jayaram, M. A., & Manjunath, A. S. (2010). Feature subset selection problem using wrapper approach in supervised learning. International Journal of Computer Applications, 1, 13–17. Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proceedings of the 10th International Workshop on Machine Learning (ML92) (pp. 249–256). Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Research Logistic Quarterly, 2, 83–97. Kuncheva, L. (2007). A stability index for feature selection. In Proceedings of the 25th IASTED International Multi-Conference on Artificial Intelligence and Applications (pp. 390–395). Kwok, N., & Choi, C. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13, 143–159. Lee, J., & Kim, D. (2015). Fast multi-label feature selection based on information- theoretic feature ranking. Pattern Recognition, 48, 2761–2771. Liang, J., Wang, F., Dang, C., & Qian, Y. (2014). A group incremental approach to feature selection applying rough set technique. IEEE Transactions on Knowledge and Data Engineering, 26(2), 294–308. Lin, T., Li, H., & Tsai, K. (2004). Implementing the fisher’s discriminant ratio in a k- means clustering algorithm for feature selection and dataset trimming. Journal of Chemical Information and Computer Sciences, 44, 76–87. Liu, H., & Motoda, H. (2008). Computational methods of feature selection. New York: Chapman & Hall/CRC Taylor & Francis Group. iu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17, 491–502. eyer, P. E., & Bontempi, G. (2006). On the use of variable complementarity for feature selection in cancer classification. In Proceedings of European workshop on applica- tions of evolutionary computing: Evo Workshops (pp. 91–102). eyer, P. E., Schretter, C., & Bontempi, G. (2008). Information-theoretic feature selec- tion in microarray data using variable complementarity. IEEE Journal of Selected Topics in Signal Processing, 2, 261–274. eng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: cri- teria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226–1238. odgers, J., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coeffi- cient. The American Statistician, 42, 59–66. ian, Y., Wang, Q., Cheng, H., Liang, J., & Dang, C. (2015). Fuzzy-rough feature selection accelerator. Fuzzy Sets and Systems, 258, 61–78. aeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. ang, E. K., Suganthana, P. N., Yao, X., & Qina, A. K. (2005). Linear dimensionality reduc- tion using relevance weighted LDA. Pattern Recognition, 38, 485–493. urk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuro- science, 3, 72–86. ergara, J., & Estévez, P. (2014). A review of feature selection methods based on mutual information. Neural Computing and Applications, 24, 175–186. idal-Naquet, M., & Ullman, S. (2003). Object recognition with informative features and linear classification. In Proceedings of the 10th IEEE international conference on computer vision (pp. 281–289). ang, H., & Moody, J. (1999). Feature selection based on joint mutual information. In Proceedings of international ICSC symposium on advances in intelligent data analysis (pp. 22–25). u, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data with appli- cation to face recognition. Pattern Recognition, 34, 2067–2070. u, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redun- dancy. Journal of Machine Learning Research, 5, 1205–1224. u, L., Ding, C., & Loscalzo, S. (2008). Stable feature selection via dense feature groups. In Proceedings of the 14th ACM SIGKDD international conference on knowledge dis- covery and data mining (pp. 803–811). hang, Y., Yang, A., Xiong, C., Wang, T., & Zhang, Z. (2014). Feature selection using data envelopment analysis. Knowledge-Based Systems, 64, 70–80. hang, Y., Yang, C., Yang, A., Xiong, C. Y., Zhou, X., & Zhang, Z. (2015). Feature selection for classification with class-separability strategy and data envelopment analysis. Neurocomputing, 166, 172–184. http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0013 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0013 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0022 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0022 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0023 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0023 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046 Feature selection using Joint Mutual Information Maximisation 1 Introduction 2 Information theory 3 Related work 4 Limitations of the current feature selection criteria 5 Proposed methods for feature selection 5.1 Joint Mutual Information Maximisation (JMIM) 5.2 Advantages over existing alternative methods 5.3 Normalised Joint Mutual Information Maximisation (NJMIM) 6 Evaluation 6.1 Data 6.2 Performance analysis on low dimensional datasets 6.3 Performance analysis on high dimensional datasets 6.4 Performance analysis with Peng et al. (2005) datasets 6.5 Evaluating and validating results 6.6 Stability of the methods 7 Discussion 8 Conclusion References