Feature selection using Joint Mutual Information Maximisation


Expert Systems With Applications 42 (2015) 8520–8532

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

Feature selection using Joint Mutual Information Maximisation

Mohamed Bennasar, Yulia Hicks, Rossitza Setchi∗

School of Engineering, Cardiff University, Cardiff CF24 3AA, UK

a r t i c l e i n f o

Keywords:

Feature selection

Mutual information

Joint mutual information

Conditional mutual information

Subset feature selection

Classification

Dimensionality reduction

Feature selection stability

a b s t r a c t

Feature selection is used in many application areas relevant to expert and intelligent systems, such as data

mining and machine learning, image processing, anomaly detection, bioinformatics and natural language

processing. Feature selection based on information theory is a popular approach due its computational ef-

ficiency, scalability in terms of the dataset dimensionality, and independence from the classifier. Common

drawbacks of this approach are the lack of information about the interaction between the features and the

classifier, and the selection of redundant and irrelevant features. The latter is due to the limitations of the

employed goal functions leading to overestimation of the feature significance.

To address this problem, this article introduces two new nonlinear feature selection methods, namely

Joint Mutual Information Maximisation (JMIM) and Normalised Joint Mutual Information Maximisation

(NJMIM); both these methods use mutual information and the ‘maximum of the minimum’ criterion, which

alleviates the problem of overestimation of the feature significance as demonstrated both theoretically and

experimentally. The proposed methods are compared using eleven publically available datasets with five

competing methods. The results demonstrate that the JMIM method outperforms the other methods on most

tested public datasets, reducing the relative average classification error by almost 6% in comparison to the

next best performing method. The statistical significance of the results is confirmed by the ANOVA test. More-

over, this method produces the best trade-off between accuracy and stability.

© 2015 The Authors. Published by Elsevier Ltd.

This is an open access article under the CC BY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/4.0/).

(

L

2

s

o

N

f

p

p

d

p

i

l

s

t

f

1. Introduction

High dimensional data is a significant problem in both super-

vised and unsupervised learning (Janecek, Gansterer, Demel, & Ecker,

2008), which is becoming even more prominent with the recent ex-

plosion of the size of the available datasets both in terms of the num-

ber of data samples and the number of features in each sample (Zhang

et al., 2015). The main motivation for reducing the dimensionality of

the data and keeping the number of features as low as possible is to

decrease the training time and enhance the classification accuracy of

the algorithms (Guyon & Elisseeff, 2003; Jain, Duin, & Mao, 2000; Liu

& Yu, 2005).

Dimensionality reduction methods can be divided into two main

groups: those based on feature extraction and those based on feature

selection. Feature extraction methods transform existing features

into a new feature space of lower dimensionality. During this process,

new features are created based on linear or nonlinear combinations

of features from the original set. Principal Component Analysis (PCA)
∗ Corresponding author. Tel: +44 2920875720; fax: +44 2920874716.
E-mail addresses: BennasarM@cf.ac.uk (M. Bennasar), HicksYA@cf.ac.uk (Y. Hicks),

Setchi@cf.ac.uk (R. Setchi).

n

&

C

f

http://dx.doi.org/10.1016/j.eswa.2015.07.007

0957-4174/© 2015 The Authors. Published by Elsevier Ltd. This is an open access article unde
Bajwa, Naweed, Asif, & Hyder, 2009; Turk & Pentland, 1991) and

inear Discriminant Analysis (LDA) (Tang, Suganthana, Yao, & Qina,

005; Yu & Yang, 2001) are two examples of such algorithms. Feature

election methods reduce the dimensionality by selecting a subset

f features which minimises a certain cost function (Guyon, Gunn,

ikravesh, & Zadeh, 2006; Jain et al., 2000). Unlike feature extraction,

eature selection does not alter the data and, as a result, it is the

referred choice when an understanding of the underlying physical

rocess is required. Feature extraction may be preferred when only

iscrimination is needed (Jain et al., 2000).

Feature selection is used in many application areas relevant to ex-

ert and intelligent systems, such as data mining and machine learn-

ng, image processing, anomaly detection, bioinformatics and natural

anguage processing (Hoque, Bhattacharyya, & Kalita, 2014). Feature

election is normally used at the data pre-processing stage before

raining a classifier. This process is also known as variable selection,

eature reduction or variable subset selection.

The topic of feature selection has been reviewed in detail in a

umber of recent review articles (Bolón-Canedo, Sánchez-Maroño,

Alonso-Betanzos, 2013; Brown, Pocock, Zhao, & Lujan, 2012;

handrashekar & Sahin, 2014; Vergara & Estévez, 2014). Usually,

eature selection methods are divided into two categories in terms of
r the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

http://dx.doi.org/10.1016/j.eswa.2015.07.007
http://www.ScienceDirect.com
http://www.elsevier.com/locate/eswa
http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.07.007&domain=pdf
http://creativecommons.org/licenses/by-nc-nd/4.0/
mailto:BennasarM@cf.ac.uk
mailto:HicksYA@cf.ac.uk
mailto:Setchi@cf.ac.uk
http://dx.doi.org/10.1016/j.eswa.2015.07.007
http://creativecommons.org/licenses/by-nc-nd/4.0/


M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8521

e

‘

W

s

o

o

c

T

c

s

a

t

o

h

P

w

o

t

c

r

s

c

c

fi

“

S

M

R

D

X

t

a

A

a

s

o

f

a

m

r

a

c

S

i

o

c

s

o

o

s

w

i

o

n

p

m

c

r

I

s

c

l

t

e

m

c

p

p

F

m

p

S

r

t

c

2

c

f

a

s

a

w

a

H

w

X

T

c

i

H

w

a

H

T

a

b

o

e

H

H

M

a

I

M

a

r

I

valuation strategy, in particular, classifier dependent (‘wrapper’ and

embedded’ methods) or classifier independent (‘filter’ methods).

rapper methods search the feature space, and test all possible

ubsets of feature combinations by using the prediction accuracy

f a classifier as a measure of the selected subset’s quality, with-

ut modifying the learning function. Therefore, wrapper methods

an be combined with any learning machine (Guyon et al., 2006).

hey perform well because the selected subset is optimised for the

lassification algorithm. On the other hand, wrapper methods may

uffer from over-fitting to the learning algorithm. This means that

ny changes in the learning model may reduce the usefulness of

he subset. In addition, these methods are very expensive in terms

f computational complexity, especially when handling extremely

igh-dimensional data (Brown et al., 2012; Cheng et al., 2011; Ding &

eng, 2003; Karegowda, Jayaram, & Manjunath, 2010).

The feature selection stage in the embedded methods is combined

ith the learning stage. These methods are less expensive in terms

f computational complexity and less prone to over-fitting; however,

hey are limited in terms of generalisation, because they are very spe-

ific to the used learning algorithm (Guyon et al., 2006).

Classifier-independent methods rank features according to their

elevance to the class label in the supervised learning. The relevance

core is calculated using distance, information, correlation and

onsistency measures. Many techniques have been proposed to

ompute the relevance score, including Pearson correlation coef-

cients (Rodgers & Nicewander, 1988), Fisher’s discriminate ratio

F score” (Lin, Li, & Tsai, 2004), the Scatter criterion (Duda, Hart, &

tork, 2001), Single Variable Classifier SVC (Guyon & Elisseeff, 2003),

utual Information (Battiti, 1994), the Relief Algorithm (Kira &

endell, 1992; Liu & Motoda, 2008), Rough Set Theory (Liang, Wang,

ang, & Qian, 2014) and Data Envelopment Analysis (Zhang, Yang,

iong, Wang, & Zhang, 2014).

The main advantages of the filter methods are their computa-

ional efficiency, scalability in terms of the dataset dimensionality,

nd independence from the classifier (Saeys, Inza, & Larranaga, 2007).

common drawback of these methods is the lack of information

bout the interaction between the features and the classifier and

election of redundant and irrelevant features due to the limitations

f the employed goal functions leading to overestimation of the

eature significance.

Information theory (Cover & Thomas, 2006) has been widely

pplied in filter methods, where information measures such as

utual information (MI) are used as a measure of the features’

elevance and redundancy (Battiti, 1994). MI does not make an

ssumption of linearity between the variables, and can deal with

ategorical and numerical data with two or more class values (Meyer,

chretter, & Bontempi, 2008). There are several alternative measures

n information theory that can be used to compute the relevance

f features, namely mutual information, interaction information,

onditional mutual information, and joint mutual information.

This paper contributes to the knowledge in the area of feature

election by proposing two new nonlinear feature selection meth-

ds based on information theory. The proposed methods aim to

vercome the limitations of the current state of the art filter feature

election methods such as overestimation of the feature significance,

hich causes selection of redundant and irrelevant features. This

s achieved through the introduction of a new goal function based

n joint mutual information and the ‘maximum of the minimum’

onlinear approach. As shown in the evaluation section, one of the

roposed methods outperforms the competing feature selection

ethods in terms of classification accuracy, decreasing the average

lassification error by 0.88% in absolute terms and almost by 6% in

elative terms in comparison to the next best performing method.

n addition, it produces the best trade-off between accuracy and

tability. The statistical significance of the reported results is further

onfirmed by ANOVA test.
This paper also reviews existing feature selection methods high-

ighting their common limitations and compares the performance of

he proposed and existing methods on the basis of several criteria. For

xample, a nonlinear approach, which employs the ‘maximum of the

inimum’ criterion, is compared to a linear approach, which employs

umulative summation approximation. To optimise the nonlinear ap-

roach, a goal function based on joint mutual information is com-

ared to the goal function based on conditional mutual information.

inally, the effect of using normalised mutual information instead of

utual information is tested.

The rest of the paper is organised as follows. Section 2 presents the

rinciples of the information theory, Section 3 reviews related work,

ection 4 discusses the limitations of current feature selection crite-

ia, Section 5 introduces the proposed methods. Section 6 describes

he conducted experiments and discusses the results. Section 7 con-

ludes the paper.

. Information theory

This section introduces the principles of information theory by fo-

using on entropy and mutual information and explains the reasons

or employing them in feature selection.

The entropy of a random variable is a measure of its uncertainty

nd a measure of the average amount of information required to de-

cribe the random variable (Cover & Thomas, 2006). The entropy of

discrete random variable X = (x1, x2, . . . . . . , xN) is denoted by H(X),
here xi refers to the possible values that X can take. H(X) is defined

s:

(X) = −
N∑

i=1
p(xi)log(p(xi)), (1)

here p(xi) is the probability mass function. The value of p(xi), when

is discrete, is:

p(xi) =
number o f instants with value xi

total number o f instants (N)
. (2)

he base of the logarithm, log, is 2, so 0 ≤ H(X) ≤ 1. For any two dis-
rete random variables X and C = (c1, c2, . . . . . . , cM), the joint entropy
s defined as:

(X, C) = −
M∑

j=1

N∑
i=1

p
(
xi, c j

)
log

(
p
(
xi, c j

))
(3)

here p(xi, cj) is the joint probability mass function of the variables X

nd C. The conditional entropy of the variable X given C is defined as:

(C|X) = −
M∑

j=1

N∑
i=1

p
(
xi, c j

)
log

(
p
(
c j|xi

))
(4)

he conditional entropy is the amount of uncertainty left in C when

variable X is introduced, so it is less than or equal to the entropy of

oth variables. The conditional entropy is equal to the entropy if, and

nly if, the two variables are independent. The relation between joint

ntropy and conditional entropy is:

(X, C) = H(X) + H(C|X) (5)
(X, C) = H(C) + H(X|C) (6)
utual Information (MI) is the amount of information that both vari-

bles share, and is defined as:

(X ; C) = H(C) − H(C|X) (7)
I can be expressed as the amount of information provided by vari-

ble X, which reduces the uncertainty of variable C. MI is zero if the

andom variables are statistically independent. MI is symmetric, so:

(X ; C) = I(C; X) (8)


8522 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532

a

a

p

M

v

D

d

t

p

t

a

v

m

d

t

o

c

t

t

g

t

m

M

b

s

j

c

&

c

J

t

t

s

t

(

r

t

r

a

p

r

(

i

I

A

k

t

p

l

o

l

t

c

d

t

n

m

I(X ; C) = H(X) − H(X|C) (9)
I(X ; C) = H(X) + H(C) − H(X, C) (10)
The Joint MI is defined as:

I(X ; C|Y ) = H(X|C) − H(X|C, Y ) (11)
I(X, Y ; C) = I(X ; C|Y ) + I(Y ; C) (12)
where Y is a discrete variable; Y = (y1, y2, . . . . . . , yN). Interaction in-
formation can be defined as the amount of information that is shared

by all features, but is not found within any feature subset. Mathemat-

ically, the relation between interaction information and MI is defined

as:

I(X ; Y ; C) = I(X, Y ; C) − I(X ; C) − I(Y ; C) (13)
High interaction information means that a large amount of infor-

mation can be obtained by considering the three variables together

(Jakulin, 2003). Interaction information can be positive, negative or

zero (Jakulin, 2005).

3. Related work

The focus of the work presented in this article is on the filter

feature selection methods due to their popularity, and thus the

review part of this article focuses specifically on these methods.

For a more detailed review of the feature selection methods recent

review articles in this area are recommended (Bolón-Canedo et al.,

2013; Brown et al., 2012; Chandrashekar & Sahin, 2014; Vergara &

Estévez, 2014). Information theory has been employed by many filter

feature selection methods. Information Gain (IG) (Guyon & Elisseeff,

2003) is the simplest of these methods. It is classified as a univariate

feature selection method, as it ranks features based on the value of

their mutual information with the class label. Simplicity and low

computational costs are the main advantages of this method. How-

ever, it does not take into consideration the dependency between the

features, rather, it assumes independency, which is not always the

case. Therefore some of the selected features may carry redundant

information. To tackle this problem new methods have been pro-

posed for selecting relevant features, which are non-redundant with

respect to each other.

For a feature set F = { f1, f2, . . . . . . , fN}, the feature selection pro-
cess identifies a subset of features S with dimension k where k ≤ N,
and S⊆F. In theory, the selected subset S should maximise the joint
mutual information between the class label C and the subset S of a

fixed size k.

I(S; C) = I( f1, f2, . . . . . . , fk; C) (14)
However, such an approach is impractical, due to the number of cal-

culations and the limited number of observations available for the

calculation of the high-dimensional probability density function. As

a result, many methods use heuristic approaches to approximate the

ideal solution.

Generally, the filter criteria are based on the concepts of fea-

ture relevance, redundancy and complementarity (Vergara & Estévez,

2014). The methods which are based on information theory can be

split into two groups: linear criteria, which are linear combinations

of MI terms; and nonlinear criteria, which use maximum or mini-

mum operations or normalised MI in their goal functions (Brown et

al., 2012).

Battiti (1994) introduces a first-order incremental search algo-

rithm, known as the Mutual Information Feature Selection (MIFS)

method, for selecting the most relevant k features from an initial set

of n features. A greedy selection method is used to build the subset.

Instead of calculating the joint MI between the selected features and

the class label, Battiti studies the MI between the candidate feature
nd the class, and the relationship between the candidate and the

lready-selected features.

Kwok and Choi (2002) propose the MIFS-U method to improve the

erformance of the MIFS method by making a better estimation of the

I between the input feature and the class label. Another method

ariant to MIFS, the mRMR method is proposed by Peng, Long, and

ing (2005). The redundancy term in mRMR is divided over the car-

inality |S| of the selected subset S to balance the magnitude of this

erm, and to avoid it growing very large as the subsets expand. As re-

orted in the existing literature (Brown et al., 2012; Peng et al., 2005),

his modification allows mRMR to outperform the conventional MIFS

nd MIFS-U methods.

Estévez, Tesmer, Perez, and Zurada (2009) propose an enhanced

ersion of MIFS, MIFS-U and mRMR, called Normalised Mutual Infor-

ation Feature Selection (NMIFS). It uses normalised MI in the re-

undancy term instead of MI. The normalisation of MI prevents bias

owards multivalued features and limits the value of MI to the range

f zero to unity (Estévez et al., 2009).

Hoque et al. (2014) propose a method called MIFS-ND. The method

alculates the mutual information between the candidate feature and

he class label, and the average of the mutual information between

he candidate feature and the features within the selected subset. A

enetic algorithm is employed to select the feature that maximises

he mutual information with the class, and minimises the average

utual information with the other selected features.

Other proposed criteria (Yang & Moody, 1999; Fleuret, 2004;

eyer & Bontempi, 2006; Vidal-Naquet & Ullman, 2003) use the MI

etween the candidate feature and the class label in the context of the

elected subset features. They utilise conditional mutual information,

oint mutual information or feature interaction. Some of them apply

umulative summation approximations (Yang & Moody, 1999; Meyer

Bontempi, 2006), while others use the ‘maximum of the minimum’

riterion (Fleuret, 2004; Vidal-Naquet & Ullman, 2003).

Yang and Moody (1999) propose a feature selection method called

oint Mutual Information (JMI). In this method, the candidate feature

hat maximises the cumulative summation of Joint Mutual Informa-

ion with features of the selected subset is chosen and added to the

ubset. This method is reported to perform well in terms of classifica-

ion accuracy and stability (Brown et al., 2012). Meyer and Bontempi

2006) introduce a similar method known as Double Input Symmet-

ical Relevance (DISR). The joint mutual information in the goal func-

ion of this method is substituted with symmetrical relevance.

Other methods that employ the ‘maximum of the minimum’ crite-

ion have been proposed. Vidal-Naquet and Ullman (2003) introduce

method called Information Fragment (IF), while Fleuret (2004) pro-

ose Conditional Mutual Information Maximisation, which have been

eported to perform well with KNN and SVM classifiers in later work

Freeman, Kulić, & Basir, 2015).

There are also a number of other methods which rely on max-

mising Feature Interaction. For example, Jakulin (2005) proposes the

nteraction Capping (IC) method, while El Akadi, El Ouardighi, and

boutajdine (2008) propose a method which uses feature interaction,

nown as Interaction Gain Based Feature Selection (IGFS). However,

his is typically the same as JMI.

General formula based on conditional likelihood has been pro-

osed by Brown et al. (2012) based on a study of MI-based feature se-

ection criterion, this formula can be used to derive many of the meth-

ds listed in this section. In practice, most of the methods which are

inear combinations of MI can be derived from this formula. However,

he authors stated that the goal function of the nonlinear method

annot be generated by their formula.

Feature selection techniques have also been used for multi-label

ata sets. Lee and Kim (2015) proposed a multi-label feature selec-

ion method based on information theory, in which they introduce a

ew score function to measure the importance of each feature to the

ultiple labels.


M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8523

t

a

O

i

(

f

p

t

d

m

t

4

t

r

t

b

n

t

m

e

i

t

m

l

t

o

o

s

l

E

a

f

c

t

I

t

s

p

5

T

m

d

t

e

s

w

d

F

m

v

E

i

D

c

s

D

s

a

o

t

S

L

o

v

P

a

m

t

C

a

m

i

D

t

t

L

t

c

P

t

p

t

t

I

t

H

5

r

t

r

m

r

l

n

i

t

I

s

t

t

M

j

p

L

a

w

I

Two other notable approaches in the area of filter feature selec-

ion are the application of the rough set theory (Liang et al., 2014)

nd the application of Data Envelopment Analysis (Zhang et al., 2014).

ne of the issues affecting the methods based on the fuzzy-rough sets

s their time inefficiency, with many existing attempts to improve it

Qian, Wang, Cheng, Liang, & Dang, 2015). The methods using DEA

or feature selection also suffer from the problem of the large com-

utational cost, although it was improved in a more recent publica-

ion (Zhang et al., 2015), as well as the problem of the selection of re-

undant features. The latter problem is characteristic of most of the

ethods listed above and the reasons for this problem will be inves-

igated in more detail in Section 4.

. Limitations of the current feature selection criteria

In general, most of the methods listed in the previous section use

he criteria consisting of two elements: the relevancy term and the

edundancy term. The methods attempt to simultaneously maximise

he relevancy term whilst minimising the redundancy term. It has

een noted in literature that such feature selection methods have a

umber of limitations (Estévez et al., 2009; Peng et al., 2005).

For example, MIFS and MIFS-U share a common problem: when

he number of selected features grows, the redundancy term grows in

agnitude with respect to the relevancy term. In this case some irrel-

vant features may be selected. This problem has been partly solved

n the mRMR, NMIFS, MIFS-ND methods by dividing the redundancy

erm over the cardinality of the subset.

Another problem shared by all above methods (MIFS, MIFS-U,

RMR, NMIFS, and MIFS-ND) is that the redundancy term is calcu-

ated based on the value of the MI between the candidate feature and

he features within the selected subset, without any consideration

f the class label. The features may share information between each

ther, but that does not mean they are redundant; they may in fact

hare different information with the class.

Yet another problem particular to the methods employing cumu-

ative summation and forward search to approximate the solution of

q. (14) (such as MIFS, NMIFS, mRMR, NMIFS, MIFS-ND, DISR, IGFS,

nd JMI) is the overestimation of the significance of some candidate

eatures. For example, this can occur when the candidate feature is in

omplete correlation with one or several pre-selected features, but at

he same time is almost independent from the majority of the subset.

n such situation, the value of the goal function will be high despite

he redundancy of the candidate feature to some features within the

ubset.

In practice, the significance of each of the above problems de-

ends on the data and the characteristics of each particular data set.

. Proposed methods for feature selection

In this paper, two new methods for feature selection are proposed.

he methods employ joint mutual information, and use the ‘maxi-

um of the minimum’ approach. The proposed methods aim to ad-

ress the problem of overestimation the significance of some fea-

ures, which occurs when cumulative summation approximation is

mployed.

For a feature set F = { f1, f2, . . . . . . , fN} of a data set D of dimen-
ion N, the feature selection process identifies a subset of features S

ith dimension K where K ≤ N, and S⊆F. The subset S should pro-
uce equal or better classification accuracy compared to feature set

. In other words feature selection defines the subset of features that

aximises mutual information with the class label I(S, C).

In the past, a number of alternative definitions of feature rele-

ance have been used (Battiti, 1994; Brown et al., 2012; Vergara &

stévez, 2014; Estévez et al., 2009). The following definition is used

n this work.
efinition 1. (Feature relevance). Feature fi is more relevant to the

lass label C than feature fj in the context of the already selected sub-

et S when I(fi, S; C) > I(fj, S; C).

efinition 2. (Minimum joint mutual information): Let F be the full

et of features, and let S be the subset of features that are selected

lready. Let fi ∈ F − S, and fs ∈ S. The m-Joint MI is the minimum value
f joint mutual information that the candidate feature fi shares with

he class label C when it is joined with every feature within the subset

individually, hence min
s=1,2,...,k

I( fi, fs; C),

emma 1. For a feature fi, if the m-Joint MI is larger than that of all

ther features fj, where fi and f j ∈ F − S (i �= j), then it is the most rele-
ant feature to the class label C in the context of the subset S.

roof. Let S = { f1, f2, . . . . . . , fK }. The joint mutual information of fi
nd each feature in S with C is calculated. The minimum value of this

utual information (m-Joint) is the lowest amount of new informa-

ion that the feature fi adds to the shared information between S and

. The feature that produces the maximum m-Joint is the feature that

dds maximum information to that shared between S and C, which

eans it is the feature which is the most relevant to the class label C

n the context of the subset S according to Definition 1.

efinition 3. Candidate feature fi is redundant to the selected fea-

ures within the subset S if fi does not share new information with

he class C.

emma 2. Let F be the full set of features, let S be the subset of features

hat are selected already, and fi ∈ F − S, fs ∈ S. If the feature fi is highly
orrelated with a feature fs in the subset then I(fi; C)

∼=I(fs; C)∼=I(fi, fs; C).
roof. If the feature fi is highly correlated with a featurefs, then

he probability mass functions of fi, fs, and (fi, fs) are equal,

(fi)
∼=p(fs)∼=p(fs, fi) .

Since the definition of the entropy is (X) = − ∑Ni=1 p(xi)log(p(xi))
hen H(fi)

∼=H(fs)∼=H(fs, fi). Since the definition of the mutual informa-
ion is I(X ; C) = H(X) + H(C) − H(X, C) then I(fi; fs)∼=H(fs)∼=H(fi) and
(fi; C)

∼=I(fs; C). I( fi, fs; C) = H( fi, fs) + H(C) − H( fi, fs, C), according
o the definition, which can be simplified to: I( fi, fs; C) = H( fi) +
(C) − H( fi, C). According to Eq. (10) I(fi, fs; C)∼=I(fi; C)∼= I(fs; C).

.1. Joint Mutual Information Maximisation (JMIM)

All methods listed in the previous section attempt to optimise the

elationship between relevancy and redundancy when selecting fea-

ures by approximating the solution of Eq. (14). The JMI method is

eported in existing literature as being the method which selects the

ost relevant features (Brown et al., 2012). It studies relevancy and

edundancy, and takes into consideration the class label when calcu-

ating MI. However, the method still allows overestimation of the sig-

ificance of some features, for example, when the candidate feature is

n complete correlation with one or a few pre-selected features, but at

he same time is almost independent from the majority of the subset.

n such a situation, the value of the JMI goal function will be high de-

pite the redundancy of the candidate feature to some features within

he subset. This drawback is evident in almost all methods that use

he cumulative sum approximation.

For this reason, a new method called Joint Mutual Information

aximisation (JMIM) is proposed in this research. JMIM employs

oint mutual information and the ‘maximum of the minimum’ ap-

roach, which should choose the most relevant features according to

emma 1, following from which, the features are selected by JMIM

ccording to the following new criterion:

fJMIM = arg max fi ∈F −S(min fs ∈S(I( fi, fs; C))), (22)
here

( fi, fs; C) = I( fs; C) + I( fi; C/ fs), (23)


8524 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532

5

w

5

a

I

t

s

F

w

S

W

F

T

t

6

a

m

f

l

e

t

w

t

o

t

i

s

w

d

g

e

(

s

N

t

o

I( fi, fs; C) = H(C) − H(C/ fi, fs), (24)

I( fi, fs; C) =
[

−
∑
c∈C

p(c) log (p(c))

]

−
[∑

c∈C

∑
fi ∈F −S

∑
fs ∈S

log

(
p( fi fs, c/ fs)

p( fi/ fs)p(c/ fs)

)]
. (25)

The method uses the following iterative forward greedy search al-

gorithm to find the relevant feature subset of size k within the feature

space:

Algorithm 1. Forward greedy search.

1. (Initialisation) Set F ← “initial set of n features”; S ← “empty set.”
2. (Computation of the MI with the output class) For ∀ fi ∈ F compute I(C; fi).
3. (Choice of the first feature) Find a feature fi that maximises I(C; fi); set

F ← F \{ fi}; set S ← { fi}.
4. (Greedy selection) Repeat until |S| = k: (Selection of the next feature) Choose

the feature fi = arg max fi ∈F −S(min fs ∈S(I( fi , fs ; C))); set F ← F \ { fi}; set
S ← S ∪{ fi}.

5. (Output) Output the set S with the selected features.

5.2. Advantages over existing alternative methods

The Venn diagrams in Fig. 1 show different scenarios for the re-

lationship between the candidate feature fi, the selected feature fs,

and the class label C. Fig. 1a illustrates the case in which methods like

MIFS, NMIFS or mRMR will fail to select fi because it is redundant to fs,

although each of them shares different information about C, and the

correlation is not in the context of C.

The goal function of JMIM is similar to the goal function of

CMIM (Section 3), as CMIM also uses the ‘maximum of the mini-

mum’ approach. The main difference is that CMIM maximises the

amount of information the candidate feature fi contributes given the

pre-selected feature fs (i.e. fi is selected for any complementing fs),

whereas JMIM selects the feature that maximises the joint mutual

information with fs. Fig. 1b and c is used to explain this difference

further. The figures represent two candidate features fi and fj, and the

subsequent selection of one of them. I(fi, fs; C) is the union of areas

1, 2, and 3; I(fi; C/fs) is area 1 in Fig. 1b. The CMIM method would se-

lect fi in Fig. 1b, even though its complementing feature fs from the

subset does not carry as much information as the feature fj in Fig. 1c.

Conversely, JMIM would select the feature that maximises JMI, so it

would select feature fi in Fig. 1c. Therefore, the joint mutual infor-

mation between the candidate feature and at least one of the pre-

selected features will be high, which can increase the discrimination

power of the selected subset.
Fig. 1. Venn diagrams illustrating the re
.3. Normalised Joint Mutual Information Maximisation (NJMIM)

The second method proposed in this paper uses a goal function,

hich is very similar to the one used in JMIM proposed in Section

.1, with the difference being that symmetrical relevance is used as

n alternative to MI. This method is called Normalised Joint Mutual

nformation Maximisation (NJMIM). It is proposed in order to study

he effect of using normalised MI instead of MI. the proposed NJMIM

election criteria is presented in Eq. (26).

NJMIM = arg max fi ∈F −S
(
min fs ∈S(SR( fi, fs; C))

)
, (26)

here

ymmetrical relevance = SR(F ; C) = I(F ; C)
H(F, C)

. (27)

hich can be simplified as:

NJMIM = arg max fi ∈F −S
(

min fs ∈S

(
I( fi, fs; C)
H( fi, fs, C)

))
. (28)

he same iterative forward greedy search algorithm is used to find

he subset of features within the candidate feature space.

. Evaluation

The performance of the two proposed methods in this paper, JMIM

nd NJMIM, is compared with the results produced by five other

ethods: CMIM, DISR, mRMR, JMI, and IG. These methods are chosen

or the following four reasons: (i) these methods are reported in the

iterature to provide good performance (Brown et al., 2012; Freeman

t al., 2015); (ii) the choice of these methods allows the comparison of

he ‘maximum of the minimum’ approach used by JMIM and NJMIM

ith the cumulative summation used by JMI and DISR; (iii) it enables

he analysis of the effect of using the symmetrical relevance instead

f MI on the algorithm’s performance; (iv) it allows the comparison of

he effects of using joint mutual information and conditional mutual

nformation, which are employed in JMIM and CMIM, respectively.

The seven methods are applied to data from different domains

uch as: life sciences, physical sciences, engineering, business, hand-

riting recognition, and gene microarray. The features within these

atasets have different characteristics, being binary, discrete or cate-

orical, or continuous. The continuous features are discretised into 10

qual intervals, using the Equal Width Discretisation (EWD) method

Dougherty, Kohavi, & Sahami, 1995).

Two classifiers are used to evaluate the quality of the selected sub-

ets. These are Naïve-Bayes with kernel density estimation, and 3-

earest Neighbours. Both classifiers are available in the Matlab Statis-

ics Toolbox. The average classification accuracy is used as a measure

f the quality of the selected features. Five-fold cross-validation is
lation between features and class.


M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8525

Fig. 2. Evaluation framework.

e

t

t

2

d

d

s

f

T

s

s

6

a

v

2

o

t

r

i

n

t

T

s

a

Table 1

UCI datasets used in the experiment.

No Data set Number of Number of Number of Ratio

features instances classes

1 Credit approval 15 690 2 54

2 Gas sensor 128 13874 6 198

3 Libra movement 90 483 15 3

4 Parkinson 22 195 2 11

5 Breast 30 569 2 28

6 Sonar 60 208 2 10

7 Musk 166 7074 2 354

8 Handwriting 649 2000 10 20

Table 2

Additional datasets used in the experiment (Peng et al., 2005).

No Data set Number of Number of Number of Ratio

features instances classes

1 Colon 2000 62 2 10

2 Leukemia 7070 72 2 12

3 Lymphoma 4026 96 9 4

6

d

b

l

d

mployed when processing feature selection and feature validation;

herefore each fold is used for validation once. This means that 80% of

he data is used for feature selection and classification training, whilst

0% is used for validation. This is repeated five times, using the whole

ataset for validation over the course of five experiments. Overall, five

ifferent subsets of samples are used to generate five different sub-

ets of features. Discretisation is performed as a pre-processing step

or all data prior to the feature selection step.

Fig. 2 shows the evaluation framework used in this experiment.

o test the impact of adding each feature to the subset on the clas-

ification accuracy, training and validation are performed after the

election of each feature in the subset.

.1. Data

Eight datasets from the UCI Repository (Bache & Lichman, 2013)

re used in the experiment (Table 1). These datasets have been pre-

iously used in similar research (Brown et al., 2012; El Akadi et al.,

008; Cheng et al., 2011). They have different characteristics in terms

f number of classes, features, instances and feature types.

An example-feature ratio (Brown et al., 2012) is used as an indica-

ion of the difficulty of the feature selection task for the dataset. This

atio is computed using N
mC

, where N is the number of instances, m

s the median number of values that the features have, and C is the

umber of classes. The most challenging feature selection tasks are

hose performed using datasets with a small example-feature ratio.

he libra movement dataset is the most challenging dataset.

To test the behaviour of the methods with an extremely small

ample, datasets from Peng et al. (2005) are also used in the evalu-

tion process, and these are shown in Table 2.
.2. Performance analysis on low dimensional datasets

Figs. 3–5 show the average classification accuracy of the three

atasets with low numbers of features (Parkinson, credit approval and

reast). The classification is computed over the whole size of the se-

ected subset, from 1 feature up to 20 features (or all features of the

ataset in the case of the credit approval dataset).


8526 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532

0 2 4 6 8 10 12 14 16 18 20
0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

CMIM
NJMIM
DISR
JMIM
mRMR
JMI
IG

Number of features

Fig. 3. Average classification accuracy achieved with the Parkinson dataset.

0
5 10 15

0.65

0.7

0.75

0.8

0.85

Number of features

CMIM
NJMIM
DISR
JMIM
mRMR
JMI
IG

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

Fig. 4. Average classification accuracy achieved with the credit approval dataset.

0 2 4 6 8 10 12 14 16 18 20

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Number of features

C
la

ss
ifi

ca
tio

n
 a

cc
u
ra

cy

CMIM
NJMIM
DISR
JMIM
mRMR
JMI
IG

Fig. 5. Average classification accuracy achieved with the breast dataset.


M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8527

d

j

a

s

u

t

h

c

t

o

f

J

t

w

h

N

p

d

p

r

b

a

c

6

s

g

w

s

t

a

I

f

m

p

a

As shown in Fig. 3, which illustrates the experiment with the first

ataset, JMIM achieves the highest average accuracy (90.77%) with

ust 8 features, which is higher than the accuracy of CMIM (90.26%)

nd JMI (88.97%). On the other hand, methods that use normalised MI,

uch as NJMIM and DISR, perform less well than JMIM and JMI, which

se MI. This is expected for datasets with discrete features, because

he normalisation may reduce the significance of the feature when it

as high entropy and shares a high amount of information with the

lass label. The mRMR and IG methods perform poorly on this dataset.

JMIM and JMI again achieve the highest classification accuracy on

he credit approval dataset, using only 4 features to reach an accuracy

f 82.92%. The accuracy of CMIM is 79.17% with the same number of

eatures. The other methods perform worse compared to JMIM and

MI with the same number of features. The figure also shows that

he methods using normalised MI do not perform as well as those

hich use MI. Features selected by the JMIM and JMI methods have a

igher discriminative power than the features which are selected by

JMIM and DISR. NJMIM performs better than DISR, yet both perform

oorly.

The breast dataset has 20 features selected. As seen in Fig. 5, JMIM

oes not achieve the highest classification accuracy. However, it
0 5 10 15 20 2
0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
Fig. 6. Average classification accuracy ac

0 5 10 15 20 2

0.65

0.7

0.75

0.8

0.85

Number of f

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

 
Fig. 7. Average classification accuracy
roduces a high accuracy (95.87%) with only 5 features, while mRMR

equires 14 features to achieve the same accuracy. JMIM performs

etter in comparison with JMI and CMIM. The performance of NJMIM

nd DISR is not as good as JMIM and JMI, as with 4 features their

lassification accuracies are 87.61% and 89.28%, respectively.

.3. Performance analysis on high dimensional datasets

The second experiment involves high dimensional data (musk,

onar, gas sensor, and handwriting datasets. The experiment with the

as sensor and sonar datasets includes the selection of 50 features,

ith JMIM achieving high classification accuracy with a relatively

mall number of features. The other methods require more features

o achieve this level of accuracy (Figs. 6–7).

Fig. 8 shows the results for the handwriting dataset. 50 features

re selected. JMIM performs well, but is inferior to JMI and mRMR.

n terms of classification accuracy of the selected subset JMI per-

ormed better than JMIM in the subset with 11–21 features, by a

aximum difference in accuracy of 0.5%. The mRMR method also

erforms well with this dataset; however JMIM produces the highest

ccuracy (97.68%) with the selected subset of 33 features.
5 30 35 40 45 50

 features

CMIM
NJMIM
DISR
JMIM
mRMR
JMI
IG

hieved with the gas sensor dataset.

5 30 35 40 45 50
eatures

CMIM
NJMIM
DISR
JMIM
mRMR
JMI
IG

achieved with the sonar dataset.


8528 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532

0 10 20 30 40 50 60 70 80 90 100

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of features

y
c

ar
u

c
c

a
n

oit
a

cifi
s

s
al

C

CMIM
NJMIM
DISR
FIM
JMIM
mRMR
JMI
IG

32 34 36 38 40 42 44

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

Fig. 8. Average classification accuracy achieved with the handwriting dataset.

0 5 10 15 20 25 30 35 40 45 50
0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of features

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

CMIM
NJMIM
DISR
FIM
JMIM
mRMR
JMI
IG

14 16 18 20 22 24

0.65

0.7

0.75

0.8

Fig. 9. Average classification accuracy achieved with the libra movement dataset.

m

i

i

o

b

o

m

w

b

b

6

t

t

e

A

The experimental results using the libra movement dataset are

shown in Fig. 9, when 50 features are selected. JMIM is the best

method with this dataset with almost any number of selected fea-

tures, followed by NJMIM. JMIM outperforms JMI by up to 3% in terms

of classification accuracy. NJMIM also outperforms DISR for all of the

selected subsets.

The methods are also applied to the musk dataset. Fig. 10 shows

the result when 50 features are selected. With this dataset, JMIM se-

lects the best subset and outperforms the other methods in terms of

classification accuracy. NJMIM does not perform as well as JMIM, but

produces better accuracy than DISR and mRMR for most of the fea-

tures selected.

6.4. Performance analysis with Peng et al. (2005) datasets

The results using the three datasets employed by Peng et al. (2005)

are shown in Fig. 11. The leukemia dataset (Fig. 11a) has a small num-

ber of samples. The results show that none of the feature selection
ethods perform particularly well, confirming the findings reported

n the review article by Brown et al. (2012). The colon dataset, which

s the least challenging dataset of the three in terms of the number

f classes and features, is shown in Fig. 11b. The results indicate the

etter performance of JMIM and JMI compared to the other meth-

ds, especially CMIM, which performs poorly. However, CMIM is the

ethod that provides the best accuracy with the lymphoma dataset,

hile JMIM, JMI and mRMR also perform well, with JMIM being the

est of these. NJMIM performs better than DISR with all of the subsets

elow 34 features.

.5. Evaluating and validating results

ANOVA statistical test is employed to analyse the results, and

o confirm that the results are systematic and they were not ob-

ained by chance. The classification experiment is run five times for

ach dataset and the average accuracy results are submitted to the

NOVA test. Table 3 shows the ANOVA results, where P-value is the


M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8529

0 5 10 15 20 25 30 35 40 45 50
0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of features

C
 la

ss
ifi

ca
tio

n
 a

cc
u
ra

cy

CMIM
NJMIM
DISR
JMIM
mRMR
JMI
IG

Fig. 10. Average classification accuracy achieved with the musk dataset.

Table 3

ANOVA test.

Dataset MS F P-value

Credit approval 0.027537 731.3342 1.87E−37
Gas sensor 0.004117 77.17653 1.38E−16
Libra movement 0.009677 114.5907 2.94E−23
Parkinson 0.009677 114.5907 2.94E−23
Breast 0.001414 101.4627 2.37E−22
Sonar 0.00094 5.760126 9.62E−05
Musk 0.000505 304.4366 1.11E−30
Handwriting 8.84E−05 35.99929 6.35E−15
Colon 0.000411 3.532383 0.006395

Leukemia 0.000161 10.36207 2.21E−07
Lymphoma 0.011501 232.6585 1.28E−28

p

m

i

b

6

o

d

t

i

s

c

t

(
w

d

|
n

s

r

o

t

Table 4

Average stability, average accuracy and the compromise between accuracy

and stability.

Method Accuracy Stability Accuracy/stability

CIMIM 0.8488 0.8598 0.9197

NJMIM 0.8264 0.8344 0.8954

DISR 0.8129 0.9054 0.8807

JMIM 0.8578 0.8598 0.9294

mRMR 0.8278 0.8868 0.8969

JMI 0.8490 0.8838 0.9199

IG 0.8226 0.9228 0.8913

l

c

c

w

w

m

o

E

s

m

w

i

m

0

c

i

n

t

t

s

o

a

t

robability of the improvement to occur by chance, and MS is the

ean square error. When the value of the P-value is less than 0.05 it

s unlikely that the improvement in classification accuracy happened

y chance. This is shown to be the case for all the datasets (Table 3).

.6. Stability of the methods

This section focuses on the stability of the feature selection meth-

ds discussed. The selected subset features are dependent on the

atasets provided, and therefore any change to the data might lead

o different selected features. In this context, the present study

nvestigates the influence of changes in the data on the features

elected.

Kuncheva’s measure of stability (Kuncheva, 2007), known as the

onsistency index, uses Eq. (29) to compute the consistency between

wo selected feature subsets, S1 and S2:

S1, S2
)

= rn − k
2

k(n − k) , (29)

here S1 and S2 are selected feature subsets using different groups of

ataset samples, i.e. S1, S2 ∈ F where F is the total set of the feature,
S1| = |S2| = k, |F | = n, and r = |S1 ∩ S2|. However, this method does
ot take into consideration the correlation between features.

Yu, Ding, and Loscalzo (2008) proposed a method for measuring

tability based on similarity. This method takes into account the cor-

elation between features. It calculates the weight between each pair

f features from the subsets S1 and S2, computes the similarity be-

ween S1 and S2, and constructs a bipartite graph. If f is a feature be-
i
onging to S1 and fj is a feature belonging to S
2, the value of the weight

an be the correlation coefficient, or any other similarity measure.

This article uses symmetrical uncertainty (Yu & Liu, 2004) to cal-

ulate the weight w:

(
s1i , s

2
j

)
= 2

[
I
(
s1

i
, s2

j

)
H
(
s1

i

)
+ H

(
s2

j

)
]

, (30)

here 0 ≤ w(s1
i
, s2

j
) ≤ 1.0. To find the maximum weighted bipartite

atching, the Hungarian Algorithm (Kuhn, 1955) is used to find the

ptimal solution.

This experiment uses the eight UCI datasets, as shown in Table 1.

ach dataset is divided into 5 folds, 4 of which are used for feature

election using the CMIM, NJMIM, DISR, JMIM, mRMR, JMI, and IG

ethods. Eq. (30) is used to calculate the weight between the features

ithin each pair of selected subsets from each dataset. The final cost

s divided over the cardinality of the subset used, and therefore the

agnitude of the final cost should be less than or equal to 0.5 (it is

.5 if all selected subsets are the same).

The relationship between accuracy and stability is computed by

omparing the average classification accuracy and the average stabil-

ty with different numbers of features.

Table 4 shows the average accuracy/stability for each method in

o particular order. It is worth noting that the methods employing

he ‘maximum of the minimum’ criterion (JMIM, NJMIM and CMIM)

end to have lower stability than the methods using the cumulative

ummation approximation (JMI and DISR). The best method in terms

f stability is IG. JMIM has the best compromise between accuracy

nd stability. Moreover, it demonstrates the best average classifica-

ion accuracy among all methods.


8530 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532

a-Leukemia

0 5 10 15 20 25 30 35 40 45 50
0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Number of features

C
la

ss
ifi

ca
tio

n
 a

cc
u
ra

cy

CMIM
NJMIM
DISR
FIM
JMIM
mRMR
JMI
IG

b-Colon

0 5 10 15 20 25 30 35 40 45 50
0.74

0.76

0.78

0.8

0.82

0.84

0.86

Number of features

C
la

ss
ifi

ca
tio

n
 a

cc
u

ra
cy

c-Lymphoma

0 10 20 30 40 50 60
0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of features

C
la

ss
ifi

ca
tio

n
 a

cc
u
ra

cy

Fig. 11. Average classification accuracy with the additional datasets.


M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532 8531

7

w

d

a

P

t

t

r

a

f

g

s

c

t

t

fi

t

J

e

c

d

a

p

i

i

u

l

I

t

w

t

t

c

o

J

d

i

n

g

d

a

m

C

o

i

8

i

a

T

r

i

a

i

d

p

m

M

S

t

t

a

t

s

a

i

t

o

t

i

o

f

t

c

g

a

h

t

t

t

r

s

t

b

J

t

d

p

t

R

B

B

B

B

B

C

C

C

D

D

D

E

E

. Discussion

The JMIM method outperforms the other methods when tested

ith most of the datasets in terms of selecting the subset that pro-

uces the best classification accuracy. JMIM also produces the best

ccuracy with the datasets with a low number of features, such as the

arkinson, credit approval and breast datasets. In these experiments,

he maximum average classification accuracy achieved by JMIM with

he Parkinson dataset was 90.77%. JMIM and JMI achieved the accu-

acy of 82.92% with the credit approval dataset whilst JMI and CMIM

chieved 93.83% and 95.22%, respectively. The JMIM method also per-

ormed well on high dimensional datasets, such as the musk, sonar,

as sensor and handwriting datasets.

JMIM and JMI also outperform the other methods on extremely

mall sample datasets with a large number of features, such as the

olon dataset. However, CMIM produces the best performance with

he lymphoma dataset. JMIM, JMI, and mRMR also perform better than

he other three methods. Overall, JMIM decreases the average classi-

cation error by 0.88% in absolute terms and almost 6% in relative

erms in comparison to the next best performing method, JMI. The

MIM classification accuracy is also higher than that reported in lit-

rature by other filter methods (Zhang et al., 2015), although no firm

onclusions can be made on this account due to the variety of the

atasets used in the most recent articles (Liang et al., 2014; Zhang et

l., 2015).

In addition to the quantitative assessment of the accuracy of the

roposed methods, several experiments are conducted to enable an

n-depth comparison of different feature selection methods, accord-

ng to several criteria. For example, the nonlinear approach, which

ses the ‘maximum of the minimum’ criterion, is compared to the

inear approach that employs cumulative summation approximation.

n particular, JMIM is compared to JMI, with the results showing that

he non-linear approach performed better than the linear approach

hen tested with most of the datasets.

The goal function based on joint mutual information is compared

o the goal function based on conditional mutual information, with

he result showing better performance of joint mutual information in

ombination with the non-linear criterion.

Finally, the effect of using normalised mutual information instead

f mutual information is tested by comparing the performance of

MIM and JMI with NJMIM and DISR. The results show that, with the

iscretised datasets, the methods employing non normalised mutual

nformation such as JMI and JMIM perform better than those using

ormalised mutual information, such as DISR and NJMIM. This sug-

ests that division of the mutual information over the joint entropy

oes not improve performance.

In addition, the methods are compared in terms of their stability,

s described in detail in Section 6.5. The results demonstrate that the

ethods employing ‘maximum of the minimum’ criterion, such as

MIM, JMIM, and NJMIM, show less average stability than the meth-

ds which employ cumulative summation, although there is no dom-

nant method.

. Conclusion

This paper presents two new feature selection methods based on

nformation theory: Joint Mutual Information Maximisation (JMIM)

nd Normalised Joint Mutual Information Maximisation (NJMIM).

hese methods are designed to resolve the problem of choosing

edundant and irrelevant features in certain circumstances, which

s characteristic of filter feature selection methods. The latter is

chieved through the use of the mutual information and the ‘max-

mum of the minimum’ nonlinear approach for the goal function

esign.

The methods have been evaluated using public datasets and com-

ared with five other feature selection methods: Joint Mutual Infor-
ation (JMI), Conditional Mutual Information Maximisation (CMIM),

aximum Relevancy Minimum Redundancy (mRMR), Double Input

ymmetrical Relevance (DISR), and Information Gain (IG) in terms of

heir ability to select features with high discriminative power, and

heir stability. To evaluate the performance of the proposed methods,

n experiment is conducted using eight datasets from the UCI Reposi-

ory. In addition, to test the behaviour of the methods with extremely

mall sample datasets, three other datasets from Peng et al. (2005)

re used.

Overall, JMIM decreases the average classification error by 0.88%

n absolute terms and almost by 6% in relative terms in comparison

o the next best performing method, JMI. The statistical significance

f the reported results is further confirmed by ANOVA test. Moreover,

his method produces the best trade-off between accuracy and stabil-

ty. The limitations of our approach are those which are characteristic

f other filter approaches: it disregards the interaction between the

eatures and the classifier, as well as the higher dimensional joint mu-

ual information between more than two features, which sometimes

an lead to suboptimal choice of features.

Future work includes more experiments using other search strate-

ies to validate the proposed method in a wider range of search

lgorithms, employing parallel computation techniques to estimate

igher dimensional joint mutual information in which two or more of

he features from the selected subset are used simultaneously to test

he significance of the candidate feature, automating the selection of

he optimal subset by introducing a cut-off parameter measuring the

elevancy of the features.

Further improvements can be made by studying the information

hared between features and class labels and classifying the fea-

ures into strongly relevant, relevant, weakly relevant, and redundant

ased on the information that the feature adds to the selected subset.

In terms of applications relevant to expert and intelligent systems,

MIM method would be of benefit for choosing the most relevant fea-

ures in classification tasks. In addition to the analysis of the public

atasets in this article, the method could be used in many other ap-

lications where the relevance of the features for the classification

ask needs to be analysed.

eferences

ache, K., & Lichman, M. (2013). UCI machine learning repository. Irvine, CA: Univer-

sity of California, School of Information and Computer Science. (http://archive.
ics.uci.edu/ml).

ajwa, I., Naweed, M., Asif, M., & Hyder, S. (2009). Feature based image classification by
using principal component analysis. ICGST International Journal on Graphics Vision

and Image Processing, 9, 11–17.

attiti, R. (1994). Using mutual information for selecting features in supervised neural
net learning. IEEE Transactions on Neural Networks, 5, 537–550.

olón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of fea-
ture selection methods on synthetic data. Knowledge and Information Systems, 34,

483–519.
rown, G., Pocock, A., Zhao, M., & Lujan, M. (2012). Conditional likelihood maximisa-

tion: a unifying framework for information theoretic feature selection. Journal of

Machine Learning Research, 13, 27–66.
handrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers

and Electrical Engineering, 40, 16–28.
heng, H., Qin, Z., Feng, C., Wang, Y., & Li, F. (2011). Conditional mutual information-

based feature selection analysing for synergy and redundancy. Electronics and
Telecommunications Research Institute, 33, 210–218.

over, T., & Thomas, J. (2006). Elements of information theory. New York: John Wiley &

Sons.
ing, C., & Peng, H. (2003). Minimum redundancy feature selection from microarray

gene expression data. In Proceedings of the computational systems bioinformatics:
IEEE Computer Society (pp. 523–528).

ougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretiza-
tion of continuous features. In Proceedings of the twelfth international conference on

machine learning (pp. 194–202).
uda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: John Wiley and

Sons.

l Akadi, A., El Ouardighi, A., & Aboutajdine, D. (2008). A powerful feature selection
approach based on mutual information. International Journal of Computer Science

and Network Security, 8, 116–121.
stévez, P. A., Tesmer, M., Perez, A., & Zurada, J. M. (2009). Normalized mutual informa-

tion feature selection. IEEE Transactions on Neural Networks, 20, 189–201.

http://archive.ics.uci.edu/ml
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0005a
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0012


8532 M. Bennasar et al. / Expert Systems With Applications 42 (2015) 8520–8532

L

M

M

P

R

Q

S

T

T

V

V

Y

Y

Y

Y

Z

Z

Fleuret, F. (2004). Fast binary feature selection with conditional mutual information.
Journal of Machine Learning Research, 5, 1531–1555.

Freeman, C., Kulić, D., & Basir, O. (2015). An evaluation of classifier-specific filter mea-
sure performance for feature selection. Pattern Recognition, 48, 1812–1826.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Jour-
nal of Machine Learning Research, 3, 1157–1182.

Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006). Feature extraction foundations
and applications. New York/Berlin, Heidelberg: Springer Studies in fuzziness and

soft computing.

Hoque, N., Bhattacharyya, D. K., & Kalita, J. K. (2014). MIFS-ND: a mutual information-
based feature selection method. Expert Systems with Applications, 41(14), 6371–

6385.
Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical Pattern Recognition: A Review.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4–37.
Jakulin, A. (2003). Attribute interactions in machine learning. (M.S.c thesis), Computer

and Information Science, University of Ljubljana.

Jakulin, A. (2005). Machine learning based on attribute interactions (Ph.D. thesis), Com-
puter and Information Science, University of Ljubljana.

Janecek, A., Gansterer, W., Demel, M., & Ecker, G. (2008). On the relationship between
feature selection and classification accuracy. Journal of Machine Learning Research:

Workshop and Conference Proceedings, 4, 90–105.
Karegowda, A. G., Jayaram, M. A., & Manjunath, A. S. (2010). Feature subset selection

problem using wrapper approach in supervised learning. International Journal of

Computer Applications, 1, 13–17.
Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proceedings of

the 10th International Workshop on Machine Learning (ML92) (pp. 249–256).
Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Research

Logistic Quarterly, 2, 83–97.
Kuncheva, L. (2007). A stability index for feature selection. In Proceedings of the

25th IASTED International Multi-Conference on Artificial Intelligence and Applications

(pp. 390–395).
Kwok, N., & Choi, C. (2002). Input feature selection for classification problems. IEEE

Transactions on Neural Networks, 13, 143–159.
Lee, J., & Kim, D. (2015). Fast multi-label feature selection based on information-

theoretic feature ranking. Pattern Recognition, 48, 2761–2771.
Liang, J., Wang, F., Dang, C., & Qian, Y. (2014). A group incremental approach to feature

selection applying rough set technique. IEEE Transactions on Knowledge and Data

Engineering, 26(2), 294–308.
Lin, T., Li, H., & Tsai, K. (2004). Implementing the fisher’s discriminant ratio in a k-

means clustering algorithm for feature selection and dataset trimming. Journal of
Chemical Information and Computer Sciences, 44, 76–87.

Liu, H., & Motoda, H. (2008). Computational methods of feature selection. New York:
Chapman & Hall/CRC Taylor & Francis Group.
iu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for
classification and clustering. IEEE Transactions on Knowledge and Data Engineering,

17, 491–502.
eyer, P. E., & Bontempi, G. (2006). On the use of variable complementarity for feature

selection in cancer classification. In Proceedings of European workshop on applica-
tions of evolutionary computing: Evo Workshops (pp. 91–102).

eyer, P. E., Schretter, C., & Bontempi, G. (2008). Information-theoretic feature selec-
tion in microarray data using variable complementarity. IEEE Journal of Selected

Topics in Signal Processing, 2, 261–274.

eng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: cri-
teria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 27, 1226–1238.
odgers, J., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coeffi-

cient. The American Statistician, 42, 59–66.
ian, Y., Wang, Q., Cheng, H., Liang, J., & Dang, C. (2015). Fuzzy-rough feature selection

accelerator. Fuzzy Sets and Systems, 258, 61–78.

aeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in
bioinformatics. Bioinformatics, 23, 2507–2517.

ang, E. K., Suganthana, P. N., Yao, X., & Qina, A. K. (2005). Linear dimensionality reduc-
tion using relevance weighted LDA. Pattern Recognition, 38, 485–493.

urk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuro-
science, 3, 72–86.

ergara, J., & Estévez, P. (2014). A review of feature selection methods based on mutual

information. Neural Computing and Applications, 24, 175–186.
idal-Naquet, M., & Ullman, S. (2003). Object recognition with informative features

and linear classification. In Proceedings of the 10th IEEE international conference on
computer vision (pp. 281–289).

ang, H., & Moody, J. (1999). Feature selection based on joint mutual information. In
Proceedings of international ICSC symposium on advances in intelligent data analysis

(pp. 22–25).

u, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data with appli-
cation to face recognition. Pattern Recognition, 34, 2067–2070.

u, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redun-
dancy. Journal of Machine Learning Research, 5, 1205–1224.

u, L., Ding, C., & Loscalzo, S. (2008). Stable feature selection via dense feature groups.
In Proceedings of the 14th ACM SIGKDD international conference on knowledge dis-

covery and data mining (pp. 803–811).

hang, Y., Yang, A., Xiong, C., Wang, T., & Zhang, Z. (2014). Feature selection using data
envelopment analysis. Knowledge-Based Systems, 64, 70–80.

hang, Y., Yang, C., Yang, A., Xiong, C. Y., Zhou, X., & Zhang, Z. (2015). Feature selection
for classification with class-separability strategy and data envelopment analysis.

Neurocomputing, 166, 172–184.

http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0021
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0022
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0022
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0023
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0023
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0024
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0034
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0035
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0036
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0037
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0038
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0039
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0040
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0041
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0042
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0043
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0044
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0045
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046
http://refhub.elsevier.com/S0957-4174(15)00467-4/sbref0046

	Feature selection using Joint Mutual Information Maximisation
	1 Introduction
	2 Information theory
	3 Related work
	4 Limitations of the current feature selection criteria
	5 Proposed methods for feature selection
	5.1 Joint Mutual Information Maximisation (JMIM)
	5.2 Advantages over existing alternative methods
	5.3 Normalised Joint Mutual Information Maximisation (NJMIM)

	6 Evaluation
	6.1 Data
	6.2 Performance analysis on low dimensional datasets
	6.3 Performance analysis on high dimensional datasets
	6.4 Performance analysis with Peng et al. (2005) datasets
	6.5 Evaluating and validating results
	6.6 Stability of the methods

	7 Discussion
	8 Conclusion
	 References