Submitted 16 May 2019
Accepted 20 October 2019
Published 18 November 2019

Corresponding author
Davide Nardone,
davide.nardone@live.it

Academic editor
Tzung-Pei Hong

Additional Information and
Declarations can be found on
page 21

DOI 10.7717/peerj-cs.237

Copyright
2019 Nardone et al.

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

A Sparse-Modeling Based Approach for
Class Specific Feature Selection
Davide Nardone, Angelo Ciaramella and Antonino Staiano
Dipartimento di Scienze e Tecnologie, Università degli Studi di Napoli ‘‘Parthenope’’, Naples, Italy

ABSTRACT
In this work, we propose a novel Feature Selection framework called Sparse-Modeling
Based Approach for Class Specific Feature Selection (SMBA-CSFS), that simultaneously
exploits the idea of Sparse Modeling and Class-Specific Feature Selection. Feature
selection plays a key role in several fields (e.g., computational biology), making it
possible to treat models with fewer variables which, in turn, are easier to explain, by
providing valuable insights on the importance of their role, and likely speeding up
the experimental validation. Unfortunately, also corroborated by the no free lunch
theorems, none of the approaches in literature is the most apt to detect the optimal
feature subset for building a final model, thus it still represents a challenge. The
proposed feature selection procedure conceives a two-step approach: (a) a sparse
modeling-based learning technique is first used to find the best subset of features,
for each class of a training set; (b) the discovered feature subsets are then fed to a
class-specific feature selection scheme, in order to assess the effectiveness of the selected
features in classification tasks. To this end, an ensemble of classifiers is built, where each
classifier is trained on its own feature subset discovered in the previous phase, and a
proper decision rule is adopted to compute the ensemble responses. In order to evaluate
the performance of the proposed method, extensive experiments have been performed
on publicly available datasets, in particular belonging to the computational biology field
where feature selection is indispensable: the acute lymphoblastic leukemia and acute
myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large
B-cell lymphoma, and the malignant glioma. SMBA-CSFS is able to identify/retrieve
the most representative features that maximize the classification accuracy. With top 20
and 80 features, SMBA-CSFS exhibits a promising performance when compared to its
competitors from literature, on all considered datasets, especially those with a higher
number of features. Experiments show that the proposed approach may outperform
the state-of-the-art methods when the number of features is high. For this reason, the
introduced approach proposes itself for selection and classification of data with a large
number of features and classes.

Subjects Bioinformatics, Data Mining and Machine Learning, Data Science
Keywords Feature selection, Sparse coding, Bioinformatics, Dictionary learning, Ensemble
learning

INTRODUCTION
Data analysis is the process of evaluating data, that is often subject to high-dimensional
feature spaces, i.e., where data are represented in, whatever the area of study, from biology
to pattern recognition to computer vision. High dimensionality often translates into

How to cite this article Nardone D, Ciaramella A, Staiano A. 2019. A Sparse-Modeling Based Approach for Class Specific Feature Selec-
tion. PeerJ Comput. Sci. 5:e237 http://doi.org/10.7717/peerj-cs.237

https://peerj.com/computer-science
mailto:davide.nardone@live.it
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.237
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://doi.org/10.7717/peerj-cs.237


over-fitting, large computational costs and poor performance thus getting a learning
task in trouble. Consequently, high-dimensional feature spaces need to be lowered since
its feature vectors are generally uninformative, redundant, correlated to each other and
also noisy. In this paper, we focus on feature selection, which is undertaken to identify
discriminative features by eliminating the ones with little or no predictive information,
based on certain criteria, in order to treat with data in low dimensional spaces.

Feature Selection (FS) is the process of selecting a subset of relevant features to use in
model construction. FS plays a key role in computational biology, for instance, microarray
data analysis involves a huge number of genes with respect to (w.r.t.) a small number
of samples, and effectively identifying the most significant differentially expressed genes
under different conditions is prominent (Xiong, Fang & Zhao, 2001). The selected genes
are very useful in clinical applications such as recognizing diseased profiles (Calcagno
et al., 2010; Staiano et al., 2013; Di Taranto et al., 2015; Camastra, Di Taranto & Staiano,
2015), nonetheless, because of its high costs, the number of experiments that can be
used for classification purposes is usually limited due to the small number of samples
compared to the large number of genes in an experiment, that gives rise to the Curse
of Dimensionality problem (Friedman, Hastie & Tibshirani, 2001), which challenges the
classification as well as other data analysis tasks (Staiano et al., 2004; Ciaramella et al.,
2008). Furthermore, microarray data are usually not immune from several issues, such as
sensitivity, accuracy, specificity, reproducibility of results, and noisy data (Draghici et al.,
2006). For these reasons, it is unsuitable to use microarray data as they are; however, after
several corrections, the relevant genes can be selected by FS approaches, and for instance
use Real-Time PCR (Xiong, Fang & Zhao, 2001) to validate the results.

Taking a look at the literature, by googling the keyword ‘‘feature selection’’, one gets
lost in an ocean of techniques (the reader may refer to classical reviews in Saeys, Inza
& Larrañaga (2007), Guyon & Elisseeff (2003), Hoque, Bhattacharyya & Kalita (2014) on
the topic), often designed to tackle a specific data set. The reasons for the abundance of
techniques are in the heterogeneity of the available scientific data sets and also by the
limitations dictated by no free lunch theorems (Wolpert & Macready, 1997), determining
the existence of no general-purpose technique which is well suited to a plethora of different
kind of data. A typical taxonomy organizes FS techniques (Jović, Brkić & Bogunović, 2015)
in three main categories, namely filter, wrapper and embedded methods, whose belonging
algorithms select a single feature subset from a complete list of features. Another perspective
instead, divides FS techniques in two classes, namely, Traditional Feature Selection (TFS)
for all classes (that includes filter, wrapper and embedded methods mentioned so far),
and Class-Specific Feature Selection (CSFS) (Fu & Wang, 2002). Usually, a TFS algorithm
selects one subset of features for all classes although it may be not the best one for some
classes, thus leading to undesirable results. Differently, a CSFS policy permits to select
a distinct subset of features for each class, and it can use any traditional feature selector,
for choosing, given the set of classes of a classification problem, one distinct grouping of
features for each class. Depending on the type of the feature selector, the overall process may
slightly change. Nevertheless, it is worth pointing out that a CSFS scheme heavily depends
on the use of a specific classifier, while its use should be independent of both the classifier

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 2/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


of the classification step and the feature selector strategy. To this end, a General Framework
CSFS has been proposed in (Pineda-Bautista, Carrasco-Ochoa & Martınez-Trinidad, 2011)
which allows using any traditional feature selector as well as any classifier.

In this paper, on the basis of the general framework for CSFS, we propose a novel
strategy to FS, namely a Sparse-Modeling Based Approach for Class-Specific Feature
Selection, consisting of a two-step procedure. Firstly, a sparse modeling based learning
technique is used to find the best subset of features for each class of the training set. In doing
so, it is assumed that a class is represented by using a subset of features, called representatives,
such that each sample in a specific class, can be described as a linear combination of them.
Secondly, the discovered feature subsets are fed to a class-specific feature selection scheme
in order to assess the effectiveness of the selected features in classification tasks. To this
end, an ensemble of classifiers is built by training a given classifier, one for each class, on
its own feature subset, i.e., the one discovered in the previous step, and a proper decision
rule is adopted to compute the ensemble responses. In this way, the dilemma of choosing
specific TFS strategy and classifiers in the CSFS framework is effectively mitigated.

METHODS
The sparse-modeling based approach for class-specific feature selection, is based on the
concepts of sparse modeling and class-specific feature selection that need to be properly
introduced.

Sparse Modeling fundamentals
An active developing field of statistical learning is focused around the notion of
sparsity (Tibshirani, 1994; Ciaramella & Giunta, 2016). A Sparse Model (SM) is a model that
can be much easier to estimate and interpret than a dense model. The sparsity assumption
allows extracting meaningful features from large data sets. The aim of the first phase of the
proposed approach is to use a sparse modeling for finding data representatives without any
transformation and to be performed directly in the data space. In other words, we wish to
find a ranking of the most representative features that best reconstruct the data collection.
Most approaches are based on a l1-norm regularization such as LASSO (Tibshirani,
1994 and Sparse Dictionary Learning Elhamifar, Sapiro & Vidal, 2012). Formally, given
a set of features in Rm arranged as columns of a data matrix X=[x1,...,xn], the task is
to find representative features given a fixed feature space belonging to a collections of
data points (see Mairal et al., 2008; Aharon, Elad & Bruckstein, 2006; Engan, Aase & Husoy,
1999; Jolliffe, 1986; Ramirez, Sprechmann & Sapiro, 2010). That task can conveniently be
described in the Dictionary Learning (DL) framework, where the aim is to simultaneously
learn a compact dictionary D=[d1,...,dk]∈Rm×k and coefficients C=[c1,...,cn]∈Rk×n,
with k �n, that can well represent collections of data points (Ciaramella, Gianfico &
Giunta, 2016). The best representation of the data is obtained by minimizing the following
objective function
n∑

i=1

‖xi−Dci‖
2
2=‖X−DC‖

2
F (1)

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 3/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


w.r.t. the dictionary D and the coefficient matrix C, subject to appropriate constraints.
However, the dictionary learned atoms almost never correspond to the original feature

space (Aharon, Elad & Bruckstein, 2006; Ramirez, Sprechmann & Sapiro, 2010; Mairal et al.,
2009). In order to find a subset of features that best represent the entire feature space, the
optimization problem in Eq. (1) is reformulated forcing the dictionary D to be the data
matrix X (Elhamifar, Sapiro & Vidal, 2012):
n∑

i=1

‖xi−Xci‖
2
2=‖X−XC‖

2
F, (2)

where F is the Frobenius norm. Equation (2) is minimized w.r.t the coefficient matrix
C , [c1,...,cn]∈Rn×n, subject to additional constraints. In other words, the reconstruction
error of each feature component is minimized by linearly combining all the components
of the feature space. To choose k�n representatives involved in the linear reconstruction
of each component in Eq. (2), the following constraint is added to the model

‖C‖0,q≤k, (3)

where the mixed `0/`q norm is defined as ‖C‖0,q ,
∑N

i=1I(
∥∥ci∥∥q >0), ci denotes the i-th

row of C, and I(·) denotes the indicator function. In a nutshell,‖C‖0,q counts the number
of nonzero rows of C. The indices of the nonzero rows of C correspond to the indices of
the columns of X which are chosen as the representative features. Since the aim is to select
k�n representative features that can reconstruct each feature of the X matrix up to a fixed
error, the optimization problem to solve is

minimize
C

‖X−XC‖2F

subject to ‖C‖0,q≤k,1
T C=1T (4)

where 1T C=1T is the affine constraint for selecting representatives that are invariant w.r.t.
a global translation of the data (as requested by dimensionality reduction methods). This
is an NP-hard problem as it implies a combinatorial calculation over every subset of the k
columns of X. Therefore, relaxing `0 to `1 norm, the problem becomes

minimize
C

‖X−XC‖2F

subject to ‖C‖1,q≤τ,1
T C=1T (5)

where ‖C‖1,q ,
∑N

i=1

∥∥ci∥∥q is the sum of the `q norms of the rows of C and τ > 0 is an
appropriate chosen parameter. The solution of the optimization (Eq. (5)) not only provides
the representative features as the nonzero rows of the C, but also provides information about
the ranking of the selected features. More precisely, a representative that has higher ranking
takes part in the reconstruction process more than the others, hence, its corresponding
row in the optimal coefficient matrix C has many nonzero elements with large values.
Conversely, a representative with lower ranking takes part in the reconstruction process
less than the others, hence, its corresponding row in C has a few nonzero elements with
smaller values. Thus, the k representative features xi1,...,xik are ranked as i1≥ i2≥···≥ ik,
whenever for the corresponding rows of C one gets∥∥ci1∥∥q≥∥∥ci2∥∥q···≥∥∥cik∥∥q, (6)

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 4/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Procedure SMBA
Input: X, N ×M matrix where N is the number observations and M is the num-

ber of features
θ={α,δ,ρ,η}, parameters vector

Output: I, set of features selected
1 Variables initialization
33 while � >δ and t >ρ do
4 βt+1←(XT X+ρI)−1

5 θt+1←(Sλ/ρ(βt+1+µt/ρ))
6 µt+1←µt +ρ(βt+1−θt+1)
7 �←compute_error(β,θ)
8 end
9 I ← find_representatives(θ,η)

From a practical point of view, the optimization problem (Eq. (5)) can be expressed by
using the Lagrange multipliers

minimize
C

1
2
‖X−XC‖2F +λ‖C‖1,q subject to 1

T C=1T. (7)

In practice, the algorithm is implemented using an Alternating Direction Method of
Multipliers (ADMM) optimization framework (Boyd et al., 2011). In particular, the features
of a given data set are obtained considering representatives of small pairwise coherence
features as in a sparse dictionary learning method. It is worth observing the resemblance
with the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1994). The
latter consists of an approach to regression analysis that performs both variable selection
and regularization in order to enhance the prediction accuracy and interpretation ability
of the statistical model it produces. Recall that the objective of LASSO, in its basic form, is
to solve

minimize
β

1
N

∥∥y−Xβ∥∥22
subject to ‖β‖1≤ t, (8)

where y =[y1,...,yN] is the N-dimensional vector of outcomes, X the covariate matrix, t
is a free parameter that determines the amount of regularization and β is the sparse vector
to estimate.

From Eq. (8), one can observe that a sparse matrix can be estimated as in Eq. (7) by
considering X itself as outcome and adding the affine constraint. In the following, the
LASSO will be used for classification tasks, adopting a sigmoid function, as it will be
described in the experimental setup.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 5/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Algorithm 1: Sparse-Modeling Based Approach for Class-Specific Feature Selection

Input : X = {x1,...,xn}data set
y, class labels
θ, SMBA parameters
m, maximum number of features to select
C, classifier model (e.g., SVM, KNN, etc)
K, number of folds for performing K-Cross Validation

Output: ACM, Average Classification Metrics on K folds
1 begin
2 X ←Data standardization
3 X ←Class balancing(X) by using SMOTE Chawla et al., 2002
4 X ←Random shuffling(X)
5 Divide X into K folds
6 foreach ki ∈K folds do
7 Set the ki fold as the test set Xtest
8 Use the remaining K-1 folds as the train set Xtrain
9 Perform the Class-sample separation on the train set Xtrain
10 (Note that I is the subset of features selected for each class ci ∈Xtrain)
11 foreach Xci ∈Xtrain do
12 I ={Ici ...Icc}←SMBA(Xci, θ)
13 end
14 for j←1 to m do
15 Build an ensemble classifier Ej ={e1,j,...,ec,j}using the j-th selected feature

∈ Ici and the classifier C
16 foreach O∈Xtest do
17 (ACMj)←Use Ej to classify the instance O
18 end
19 (ACM)←(ACMj)
20 end
21 end
22 (ACM)←Average(ACM)
23 end

A Sparse-Modeling Based Approach for Class Specific Feature
Selection
A General Framework for Class-Specific Feature Selection (GF-CSFS) is described in
(Pineda-Bautista, Carrasco-Ochoa & Martınez-Trinidad, 2011). The proposed Sparse-
Modeling Based Approach for Class-Specific Feature Selection (SMBA-CSFS) tries to

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 6/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Figure 1 A Sparse-Modeling Based Approach for Class-Specific Feature Selection.
Full-size DOI: 10.7717/peerjcs.237/fig-1

best represent each class-sample set of an input data set by only using few representative
features. More specifically, the method is made up of the following steps:
1. Class-sample separation: Unlike the GF-CSFS, SMBA-CSFS does not employ the Class

binarization stage to transform a c-class problem into c binary problems, instead it
just uses a simple Class-sample separation. Basically, it consists of differentiating the
samples among all the classes of the training set for a given data set into several disjoint
sets/configurations of samples, one for each class (See Fig. 1).

2. Class balancing: Once the class sample set of the training set has been split apart (by
applying the above Class-sample separation step), it may be possible that each class-
subset results unbalanced. Therefore, the SMOTE (Chawla et al., 2002) re-sampling
method is applied to balance each class-subset. Technically speaking, it is important to
point out that steps 1–2 are interchangeable, meaning that there are no differences in
doing the first one before the other.

3. Intra-Class-Specific feature selection: The Sparse-Modeling Based Approach is used for
retrieving, minimizing Eq. (7), the most representative features for each class-sample
set of the training set that best represent/reconstruct the whole class of objects. In
doing so, the approach takes advantage of the intra-class properties for selecting the
best feature subset (describing each class) which is used to improve the classification
accuracy against TFS and GF-CSFS.

4. Classification: Since the training set gets split into different class-sample subsets, we
embraced the idea of using a wise-ensemble procedure for training a classification model
for discriminating new incoming instances. As in Pineda-Bautista, Carrasco-Ochoa &
Martınez-Trinidad (2011), given a class ci, a classifier ei is trained on the original
data set only using the selected features for ci, for i=1,...,c. Overall, an ensemble
classifier E ={e1,...,ec} is constructed. In order to classify a new instance O through

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 7/25

https://peerj.com
https://doi.org/10.7717/peerjcs.237/fig-1
http://dx.doi.org/10.7717/peerj-cs.237


the ensemble, the natural dimension of O needs to be lowered to the dimension di of
the classifier ei,i=1...,c. This way, for determining to which class O belongs to, an
ad-hoc majority rule is used:
(a) If a classifier outputs the same class for which the features, used for training ei

were selected, i.e., the ei output is ci, then O belongs to ci. In case of a tie, i.e.,
when several classifiers respond ci, a majority vote is needed among all classifiers to
determine the class of O. If still a tie occurs, O will belong to the class that received
more votes among the tied classes.

(b) If no classifier outputs the class whose selected features are used for training ei
belongs to the class winning the majority voting. If there is a tie, then O will belong
to the class that received more votes among the tied classes.

Finally, since a recursive tie may occur, in that case, the instance O would be classified
as ci by randomly choosing a class among all the tied classes. The algorithm in Fig. 1,
illustrates the pseudo-code describing the CSFS-SMBA procedure. Basically, it first
standardizes, class-balances and shuffles the data set X, then divides it into k folds,
assigning the ki-th fold as test set Xtest and the remaining K −1 folds as train set
Xtrain. The algorithm iteratively performs the task of class-sample separation, to split the
sample belonging to different classes Xci, on which the algorithm 1 (illustrated in page
4) is performed to output the m most representative features for each class (line 12).
The selected features are first used, one at time, for training an ensemble classifier Ej,
and later for classifying each instance O belonging to the test set Xtest . Finally, for all
the ensemble models up to m selected features, the algorithm outputs the ACM matrix,
storing several model evaluation metrics.

EXPERIMENTAL RESULTS
In the experiments, the SMBA-CSFS performance have been assessed on nine publicly
available microarray data sets. The classifiers used to determine the goodness of the selected
feature subsets are a Support Vector Machine (SVM) with a linear kernel and parameter
C =1, a Naive Bayes, a K-Nearest Neighbors (KNN) using k =5, and a Decision Tree.

Data sets description
In order to validate the introduced approach, a number of data sets exemplifying the typical
data processing in the biological field are used in the experiments. In the following, a brief
description of all the data sets employed in the experiments.
1. The ALLAML data set (Golub et al., 1999) contains in total 72 samples in 2 classes: ALL

and AML, which have 47 and 25 samples, respectively. Every sample contains 7,129
gene expression values.

2. The LEUKEMIA data set (Golub et al., 1999) contains in total 72 samples in 2 classes:
acute lymphoblastic and acute myeloid. It is a modified version of the original ALLAML
data set, where the original baseline genes (7,129) were cut off before further analysis.
The number of genes that are used in the binary classification task is 7,070.

3. The CLL_SUB_111 data set (Haslinger et al., 2004) has gene expressions from high
density oligonucleotide arrays containing genetically and clinically distinct subgroups

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 8/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


of B-cell chronic lymphocytic leukemia (B-CLL). The data set consists of 11,340
attributes, 111 instances and 3 classes.

4. The GLIOMA data set (Nutt et al., 2003) contains in total 50 samples in 4 classes: cancer
glioblastomas, non-cancer glioblastomas, cancer oligodendrogliomas and non-cancer
oligodendrogliomas, which have 14, 14, 7, 15 samples, respectively. Each sample has
12,625 genes. After a preprocessing, the data set has been shrunk to 50 samples and
4,433 genes.

5. The LUNG data set (Bhattacharjee et al., 2001) contains in total 203 samples in 5 classes:
adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell
lung carcinomas and normal lung, with 139,21,20,6,17 samples, respectively. The
genes with standard deviations smaller than 50 expression units were removed getting
a data set with 203 samples and 3,312 genes.

6. The LUNG_DISCRETE data set (Peng, Long & Ding, 2005) contains 73 samples in 7
classes where, each sample consists of 325 gene expressions. The cardinalities of each
sample in the LUNG_DISCRETE data set are 6,5,5,16,7,13,21, respectively.

7. The DLBCL data set (Alizadeh et al., 2000) is a modified version of the original DLBCL
data set. It consists of 96 samples in 9 classes, where each sample is defined by the
expression of 4,026 genes. The cardinalities of each sample in the DLBCL data set are
46,10,9,11,6,6,4,2,2, respectively.

8. The CARCINOM data set (Su et al., 2001) contains 174 samples in 11 classes:
prostate, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary,
pancreas, lung adenocarcinomas and lung squamous cell carcinoma, with
26,8,26,23,12,11,7,27,6,14,14 samples, respectively. After a preprocessing as
described in Yang et al. (2006), the data set has been shrunk to 174 samples and
9,182 genes.

9. The GCM data set (Ramaswamy et al., 2001) contains 190 samples in 14 classes: breast,
prostate, lung, colorectal, lymphoma, bladder, melanoma, uterus, leukemia, renal,
pancreas, ovary, mesothelioma and central nervous system, where each sample consist
of 16,063 gene expression signatures. The cardinalities of each sample in the data set
are 11,11,20,11,30,11,22,10,11,11,11,10,11,10, respectively.
All data sets are available at the following data repository (Nardone, Ciaramella &

Staiano, 2019a). All the information about the data sets are summarized in Table 1.

Experiment setup
To validate the effectiveness of the SMBA-CSFS model, it has been compared against several
TFS and the GF-CSFS proposed in Pineda-Bautista, Carrasco-Ochoa & Martınez-Trinidad
(2011). SMBA-CSFS is firstly compared against TFS methods and, since the framework
in Pineda-Bautista, Carrasco-Ochoa & Martınez-Trinidad (2011) can use any TFS method
as base for performing CSFS, some experiments using both filter and wrapper methods
(injection process) were made. In addition, the accuracy results were also compared against
those obtained on the basis of all the features (BSL). The following TFS methods have been
chosen for comparing purposes:

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 9/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Table 1 Data sets description.

Size # Features # Classes

ALLAML 72 7,129 2
LEUKEMIA 72 7,070 2
CLL_SUB_111 111 11,340 3
GLIOMA 50 4,434 4
LUNG_C 203 3,312 5
LUNG_D 73 325 7
DLBCL 96 4,026 9
CARCINOM 174 9,182 11
GCM 190 16,063 14

• LASSO (Tibshirani, 1994): LASSO method involves penalizing the absolute size of
the regression coefficients and it is usually used for creating parsimonious models in
presence of a large number of features. The model implemented is a modified version
of the classical LASSO, adapted for classification purposes. In particular, in Eq. (8), the
product Xβ is transformed by a sigmoid function in order to address the classification
problem.
• EN (Zou & Hastie, 2005): Elastic Net is a hybrid of ridge regression and LASSO
regularization. Like LASSO, Elastic Net can generate reduced models by achieving
zero-valued coefficients. Experimental studies have suggested that the Elastic Net
technique can outperform LASSO on data with highly correlated features. As for LASSO,
a modified version adapted for classification purposes has been implemented.
• RFS (Nie et al., 2010): Robust Feature Selection method is a sparse based-learning
approach for feature selection which emphasizes the joint `2,1 norm minimization on
both loss and regularization function.
• ls-`2,1 (Tang, Alelyani & Liu, 2014): ls-`2,1 is a supervised sparse feature selection
method. It exploits the`2,1-norm regularized regression model for joint feature selection,
from multiple tasks where the classification objective function is a quadratic loss.
• ll-`2,1 (Tang, Alelyani & Liu, 2014): ll-`2,1 is a supervised sparse feature selection method
which uses the same concept of ls-`2,1 but instead uses a logistic loss as classification
objective function.
• Fisher (Gu, Li & Han, 2012): Fisher is one of the most widely used supervised filter
feature selection methods. It selects each feature as the ratio of inter-class separation
and intraclass variance, where features are evaluated independently and, the final feature
selection occurs by aggregating the m top ranked ones.
• Relief-F (Kira & Rendell, 1992; Kononenko, 1994): Relief-F is an iterative, randomized
and supervised filter approach that estimates the quality of the features according to
how well their values differentiate data samples that are near to each other; it does not
discriminate among redundant features and performance decreases with few data.
• mRMR (Peng, Long & Ding, 2005): Minimum-Redundancy-Maximum-Relevance is
a mutual information filter based algorithm which selects features according to the
maximal statistical dependency criterion.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 10/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


• MI (Kraskov, Stögbauer & Grassberger, 2004; Ross, 2014): Mutual Information is a
non-negative value, which measures the dependency between the variables. Features are
selected in a univariate way. The function relies on nonparametric methods based on
entropy estimation from k-nearest neighbors distances.
• SMBA: Sparse-Modeling Based Approach is nothing else that our SMBA-CSFS model
but that only takes into account the SDL strategy for selecting a subset of features
considering all the classes in the feature selection process.

We pre-processed all the data sets by using the Z-score (Kreyszig, 2010) normalization. To
fairly compare the considered supervised feature selection methods, we have firstly tuned
the parameters for all methods by using a ‘‘grid-search’’ strategy (Tang, Alelyani & Liu,
2014) and finally, for evaluating the performance of all the methods, it has been considered
a number of features ranging from 1 to 80 by performing a 5-fold Cross Validation (CV).

The performance of the classification algorithms among all the methods have
been evaluated by using the metrics of Accuracy along with the standard deviations
(ACC ± STD), Precision (P), Recall (R) and F-measure (F), which are computed as
illustrated in Sokolova & Lapalme (2009). In addition, to give a better and summarized
understanding between the performance of the models, we also computed the Area Under
the Curve (AUC) and the Receiver Operating Characteristic (ROC) curves, where the
former is a useful tool for evaluating the quality of class separation for a classifier while the
latter makes it easier to compare the ROC curve of one model to another.

DISCUSSION
The experiments have been performed on a workstation with a dual Intel(R) Xeon(R)
2.40 GHz and 64GB RAM. The developed code is available at Nardone, Ciaramella &
Staiano (2019b). For the sake of readability, all the results presented here account only for
the SVM classifier, since the performance proved that the proposed approach is a little
sensitive to the choice of a specific classifier (indeed, the performance of each classifier
are rather comparable). Nevertheless, the interested reader may refer to the Supplemental
Material for details on additional results concerning all the used classifiers. The experimental
results on 5-fold CV for the SVM classifier are summarized in Tables 2–5. Figures 2–5
show all the accounted model evaluation metrics for the ten feature selection methods on
the nine considered data sets.

We compared the performance of our method against TFS methods (see Tables 2–3)
and GF-CSFS framework (see Tables 4–5). By looking at accuracy, precision, recall
and F-measure, SMBA-CSFS is able to better discriminate among the classes of the
LUNG_C, LUNG_D, CARCINOM, DLBLC and GCM data sets in most of the cases, when
top 20 and 80 features are considered. In this latter case, when SMBA-CSFS performs
worse then its competitors, the corresponding performance tend to be comparable. On
the remaining data sets, each with a number of classes less than 5, namely, ALLAML,
LEUKEMIA, CLL_SUB_111 and GLIOMA, SMBA-CSFS is instead outperformed by some
of the competitors. Consequently, we can assert that SMBA-CSFS behaves better when
working with data sets with many classes (at least 5). One possible reason is due to the

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 11/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Table 2 SVM accuracy results (ACC ± STD) on top 20 features using 5-fold CV on different data sets. TFS methods are compared against our methods (SMBA and
SMBA-CSFS). FS: Fisher Score, mRMR: Minimum-Redundancy-Maximum-Relevance, MI: Mutual Information, RFS: Robust Feature Selector, EN: Elastic Net, BSL: all
features. The best results are highlighted in bold. The number in parentheses is the number of features when the performance is achieved.

Average Accuracy of top 20 features (%)

ALLAML LEUKEMIA CLL_SUB_111 GLIOMA LUNG_C LUNG_D DLBCL CARCINOM GCM

Fisher 96.84±0.04(19) 98.95±0.02(16) 75.20±0.1(19) 80±0.04(13) 91.94±0.02(19) 91.24±0.1(20) 97.11±0.02(19) 65.33±0.05(20) 94.9±0.00(20)

Relief 95.78±0.04(8) 97.89±0.03(12) 76.45±0.03(15) 80±0.07(19) 97.12±0.01(20) 95.2±0.03(14) 99.76±0.00(20) 86.52±0.03(18) 97.14±0.01(20)

mRMR 66.14±0.13(12) 98.95±0.02(9) 71.27±0.1(20) 66.67±0.1(17) 95.68±0.013(19) 95.22±0.02(20) 99.03±0.01(16) 89.57±0.04(20) 97.79±0.01(20)

MI 96.84±0.042(15) 98.95±0.02(10) 81.03±0.06(17) 78.33±0.04(12) 97.41±0.014(17) 94.53±0.03(18) 98.79±0.01(19) 93.25±0.05(20) 95.58±0.01(20)

ls-21 71.34±0.14(19) 59.42±0.2(12) 60.30±0.14(19) 55±0.07(20) 92.66±0.05(19) 93.86±0.04(20) 92.52±0.01(20) 66.99±0.03(20) 96.56±0.01(20)

ll-21 83±0.11(15) 88.36±0.06(20) 73.12±0.06(15) 0.75±0.12(17) 98.27±0.015(16) 93.24±0.04(16) 94.44±0.02(19) 83.49±0.03(20) 97.69±0.01(20)

RFS 87±0.01(15) 74.33±0.1(18) 64.73±0.09(15) 66.67±0.07(17) 94.10±0.022(20) 89.77±0.02(19) 91.06±0.03(18) 81.85±0.07(18) 96.77±0.01(20)

LASSO 98.95±0.02(17) 71.3±0.08(21) 68.02±0.06(20) 83.33±0.05(17) 97.99±0.012(16) 92.51±0.03(12) 99.52±0.01(16) 82.14±0.05(18) 97.07±0.01(20)

EN 98.95±0.02(17) 71.3±0.08(21) 68.02±0.06(20) 83.33±0.05(17) 97.99±0.012(16) 92.51±0.03(12) 99.52±0.01(16) 82.14±0.05(18) 97.07±0.01(20)

SMBA 93.68±0.084(16) 88.36±0.06(20) 70.60±0.10(19) 71.67±0.134(17) 97.84±0.00(20) 92.55±0.03(20) 99.28±0.01(20) 83.49±0.03(20) 97.69±0.01(20)

SMBA-CSFS 88.24±0.04(20) 81.93±0.02(20) 75.53±0.06(20) 73.34±0.18(16) 98.41±0.014(19) 97.93±0.03(19) 98.30±0.02(13) 94.95±0.02(19) 99.2±0.01(20)

BSL 97.89±0.04 98.95±0.021 84.26±0.06 85±0.1 99.57±0.00 98.62±0.02 100±0.00 98.65±0.01 100±0.00

N
ardone

etal.
(2019),P

eerJ
C

om
put.

S
ci.,D

O
I10.7717/peerj-cs.237

12/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Table 3 SVM Precision(P), Recall(R) and F-measure(F) on top 20 features using 5-fold CV on different data sets. TFS methods are compared against our methods
(SMBA and SMBA-CSFS). FS: Fisher Score, mRMR: Minimum-Redundancy-Maximum-Relevance, MI: Mutual Information, RFS: Robust Feature Selector, EN: Elastic
Net, BSL: all features. The best results are highlighted in bold. The number in parentheses is the number of features when the performance is achieved.

ALLAML LEUKEMIA CLL_SUB_111 GLIOMA LUNG_C LUNG_D DLBCL CARCINOM GCM(14)

P R F P R F P R F P R F P R F P R F P R F P R F P R F

Fisher 0.98(18) 0.98(18) 0.98 0.99(15) 0.99(15) 0.99 0.75(11) 0.75(11) 0.75 0.68(20) 0.67(14) 0.67 0.92(19) 0.92(19) 0.92 0.89(20) 0.88(15) 0.88 0.9(17) 0.99(20) 0.93 0.9(19) 0.89(20) 0.89 0.64(20) 0.64(20) 0.64

Relief 0.96(12) 0.96(12) 0.96 0.99(4) 0.99(4) 0.99 0.75(17) 0.75(17) 0.75 0.77(19) 0.77(19) 0.77 0.97(20) 0.97(20) 0.97 0.95(20) 0.95(15) 0.95 0.89(18) 1.0(20) 0.94 0.89(18) 0.88(18) 0.88 0.8(20) 0.8(20) 0.8

mRMR 0.8(19) 0.8(19) 0.8 0.98(6) 0.98(17) 0.98 0.64(14) 0.66(14) 0.65 0.7(12) 0.7(12) 0.7 0.97(20) 0.97(20) 0.97 0.96(19) 0.95(19) 0.95 0.95(20) 0.99(14) 0.92 0.88(20) 0.91(20) 0.89 0.85(20) 0.85(20) 0.85

MI 0.98(12) 0.98(12) 0.98 0.98(2) 0.98(2) 0.98 0.76(16) 0.76(16) 0.76 0.74(20) 0.73(17) 0.73 0.97(20) 0.97(20) 0.97 0.95(20) 0.95(20) 0.95 0.95(17) 0.99(19) 0.9 0.95(17) 0.95(17) 0.83 0.69(20) 0.69(20) 0.69

ls_l21 0.83(18) 0.81(18) 0.82 0.84(20) 0.82(20) 0.83 0.7(20) 0.7(20) 0.7 0.7(16) 0.7(17) 0.7 0.97(20) 0.97(20) 0.97 0.89(19) 0.88(19) 0.88 0.81(19) 0.93(17) 0.87 0.81(19) 0.81(20) 0.81 0.76(20) 0.76(20) 0.76

ll_l21 0.92(15) 0.91(15) 0.91 0.83(20) 0.83(20) 0.83 0.69(20) 0.69(20) 0.69 0.65(9) 0.65(9) 0.65 0.98(18) 0.98(18) 0.98 0.94(20) 0.93(20) 0.93 0.92(18) 0.96(19) 0.92 0.9(17) 0.86(20) 0.88 0.84(20) 0.84(20) 0.84

RFS 0.86(18) 0.84(19) 0.85 0.84(20) 0.76(20) 0.8 0.63(12) 0.64(12) 0.63 0.71(12) 0.7(12) 0.7 0.96(19) 0.96(19) 0.96 0.88(18) 0.86(18) 0.87 0.89(19) 0.93(16) 0.84 0.89(18) 0.84(19) 0.86 0.77(20) 0.77(20) 0.77

LASSO 0.84(20) 0.84(13) 0.84 0.77(20) 0.77(20) 0.77 0.71(6) 0.71(10) 0.71 0.79(14) 0.78(14) 0.78 0.94(20) 0.94(19) 0.94 0.93(19) 0.9(20) 0.91 0.84(18) 0.97(19) 0.9 0.84(18) 0.84(18) 0.84 0.8(20) 0.8(20) 0.8

EN 0.84(20) 0.84(13) 0.84 0.77(20) 0.77(20) 0.77 0.71(6) 0.71(10) 0.71 0.79(14) 0.78(14) 0.78 0.94(20) 0.94(19) 0.94 0.91(19) 0.9(20) 0.9 0.84(18) 0.97(19) 0.9 0.84(18) 0.84(18) 0.84 0.8(20) 0.8(20) 0.8

SMBA 0.9(13) 0.89(16) 0.89 0.83(20) 0.83(20) 0.83 0.7(11) 0.7(11) 0.7 0.68(15) 0.68(15) 0.68 0.97(18) 0.97(18) 0.97 0.91(19) 0.9(19) 0.9 0.92(19) 0.99(17) 0.92 0.9(19) 0.86(20) 0.88 0.84(20) 0.84(20) 0.84

SMBA-CSFS 0.83(16) 0.83(16) 0.83 0.86(20) 0.86(20) 0.86 0.67(20) 0.68(20) 0.67 0.8(20) 0.77(20) 0.78 0.98(15) 0.98(15) 0.98 0.99(19) 0.99(19) 0.99 1.0(20) 1.0(20) 1.0 0.99(20) 0.98(20) 0.98 0.97(20) 0.97(20) 0.97

BSL 1 1 1 1 1 1 0.74 0.74 0.74 0.92 0.92 0.92 0.93 0.93 0.93 0.8 0.8 0.8 1 1 1 0.98 0.98 0.98 1 1 1

N
ardone

etal.
(2019),P

eerJ
C

om
put.

S
ci.,D

O
I10.7717/peerj-cs.237

13/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Table 4 SVM accuracy results (ACC ± STD) on top 20 features using 5-fold CV on different data sets. GF-CSFS (Pineda-Bautista, Carrasco-Ochoa & Martınez-
Trinidad, 2011) framework is compared against our SMBA-CSFS. FS: Fisher Score, mRMR: Minimum-Redundancy-Maximum-Relevance, MI: Mutual Information,
RFS: Robust Feature Selector, EN: Elastic Net, BSL: all features. The best results are highlighted in bold. The number in parentheses is the number of features when the
performance is achieved.

Average Accuracy of top 20 features (%)

ALLAML LEUKEMIA CLL_SUB_111 GLIOMA LUNG_C LUNG_D DLBCL CARCINOM GCM

Fisher 95.90±0.03(13) 98.57±0.03(18) 80.41±0.02(7) 82±0.16(17) 95.09±0.03(20) 86.38±0.14(16) 100±0.00(14) 90.86±0.08(20) 98.98±0.0(18)

Relief 92.95±0.04(5) 95.81±0.03(10) 82.41±0.05(12) 80±0.19(12) 91.63±0.02(20) 86.39±0.07(20) 100±0.00(11) 89.68±0.03(17) 98.71±0.0(20)

mRMR 75.14±0.09(16) 98.57±0.03(11) 70.69±0.07(12) 62±0.12(14) 89.16±0.03(20) 86.48±0.09(17) 99.52±0.01(15) 81.61±0.07(20) 98.71±0.0(20)

MI 94.38±0.03(18) 97.14±0.03(4) 81.03±0.05(20) 82±0.21(19) 95.07±0.015(11) 79.90±0.18(14) 100±0.00(19) 90.86±0.06(11) 98.67±0.0(19)

ls-21 76.47±0.13(6) 65.52±0.08(3) 63.44±0.03(20) 46±0.21(7) 73.88±0.04(19) 75.43±0.07(18) 93.46±0.03(20) 39.68±0.04(19) 97.59±0.0(19)

ll-21 82.1±0.05(16) 80.67±0.09(15) 74.58±0.07(20) 68±0.13(18) 91.15±0.02(15) 67.24±0.12(15) 96.38±0.02(17) 72.40±0.05(17) 96.87±0.0(20)

RFS 79.24±0.168(17) 74.95±0.09(6) 71.94±0.10(19) 68±0.21(13) 82.79±0.05(17) 68.67±0.07(18) 96.62±0.01(20) 58.03±0.18(20) 96.97±0.01(20)

LASSO 95.73±0.02(6) 70.3±0.08(15) 71.29±0.05(18) 81.67±0.08(19) 96.26±0.00(18) 93.22±0.021(20) 100±0.00(10) 87.88±0.03(18) 96.09±0.0(20)

EN 95.73±0.04(10) 70.3±0.08(15) 68.73±0.10(19) 81.67±0.08(19) 95.97±0.012(18) 93.22±0.021(20) 100±0.00(10) 88.56±0.03(19) 96.09±0.0(20)

SMBA-CSFS 88.24±0.04(20) 81.93±0.02(20) 75.53±0.06(20) 73.34±0.18(16) 98.41±0.014(19)} 97.93±0.03(19) 98.30±0.02(13) 94.95±0.02(19) 99.2±0.01(20)

BSL 97.89±0.04 98.95±0.021 84.26±0.06 85±0.1 99.57±0.00 98.62±0.02 100±0.00 98.65±0.01 100±0.00

N
ardone

etal.
(2019),P

eerJ
C

om
put.

S
ci.,D

O
I10.7717/peerj-cs.237

14/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Table 5 SVM Precision(P), Recall(R) and F-measure(F) on top 20 features using 5-fold CV on different data sets. GF-CSFS (Pineda-Bautista, Carrasco-Ochoa &
Martınez-Trinidad, 2011) framework is compared against our SMBA-CSFS. FS: Fisher Score, mRMR: Minimum-Redundancy-Maximum-Relevance, MI: Mutual
Information, RFS: Robust Feature Selector, EN: Elastic Net, BSL: all features. The best results are highlighted in bold. The number in parentheses is the number of features
when the performance is achieved.

ALLAML LEUKEMIA CLL_SUB_111 GLIOMA LUNG_C LUNG_D DLBCL CARCINOM GCM(14)

P R F P R F P R F P R F P R F P R F P R F P R F P R F

Fisher 0.96(15) 0.96(14) 0.96 0.97(2) 0.97(2) 0.97 0.84(4) 0.84(4) 0.84 0.76(8) 0.75(8) 0.75 0.96(18) 0.96(18) 0.96 0.97(16) 0.97(16) 0.97 1.0(17) 1.0(17) 1.0 0.95(13) 0.95(13) 0.95 0.93(18) 0.93(18) 0.93

Relief 0.98(16) 0.98(16) 0.98 0.97(8) 0.97(8) 0.97 0.82(5) 0.82(5) 0.82 0.72(19) 0.7(15) 0.71 0.95(19) 0.95(19) 0.95 0.96(9) 0.95(9) 0.95 1.0(10) 1.0(10) 1.0 0.96(17) 0.96(17) 0.96 0.91(20) 0.91(20) 0.91

mRMR 0.69(8) 0.69(8) 0.69 0.97(13) 0.97(4) 0.97 0.84(15) 0.84(15) 0.84 0.77(20) 0.77(20) 0.77 0.97(18) 0.97(18) 0.97 0.97(17) 0.97(17) 0.97 1.0(11) 1.0(11) 1.0 0.97(15) 0.97(15) 0.97 0.91(20) 0.91(20) 0.91

MI 0.99(17) 0.99(17) 0.99 0.98(2) 0.98(17) 0.98 0.8(13) 0.8(13) 0.8 0.75(3) 0.75(3) 0.75 0.94(18) 0.94(18) 0.94 0.97(11) 0.97(11) 0.97 1.0(12) 1.0(12) 1.0 0.97(17) 0.97(16) 0.97 0.91(19) 0.91(19) 0.91

ls_l21 0.82(18) 0.78(18) 0.8 0.92(17) 0.91(17) 0.91 0.7(14) 0.69(14) 0.69 0.67(20) 0.67(20) 0.67 0.96(20) 0.96(20) 0.96 0.9(16) 0.9(16) 0.9 0.91(19) 0.91(19) 0.91 0.77(18) 0.77(18) 0.77 0.83(19) 0.83(19) 0.83

ll_l21 0.91(19) 0.9(19) 0.9 0.87(14) 0.86(14) 0.86 0.76(20) 0.76(20) 0.76 0.73(19) 0.73(19) 0.73 0.96(16) 0.96(16) 0.96 0.91(18) 0.9(18) 0.9 0.97(17) 0.97(17) 0.97 0.85(20) 0.85(20) 0.85 0.78(20) 0.78(20) 0.78

RFS 0.87(14) 0.85(14) 0.86 0.96(19) 0.96(19) 0.96 0.68(12) 0.69(12) 0.68 0.69(20) 0.67(20) 0.68 0.95(20) 0.95(20) 0.95 0.93(19) 0.91(19) 0.92 0.94(20) 0.93(20) 0.93 0.85(19) 0.85(19) 0.85 0.79(20) 0.79(20) 0.79

LASSO 0.87(16) 0.87(16) 0.87 0.72(16) 0.71(16) 0.71 0.78(18) 0.78(18) 0.78 0.8(18) 0.78(18) 0.79 0.94(17) 0.94(17) 0.94 0.89(20) 0.88(20) 0.88 0.97(19) 0.97(19) 0.97 0.84(20) 0.85(20) 0.84 0.73(20) 0.73(20) 0.73

EN 0.87(16) 0.87(16) 0.87 0.72(16) 0.71(16) 0.71 0.78(18) 0.78(18) 0.78 0.8(18) 0.78(18) 0.79 0.94(17) 0.94(17) 0.94 0.89(20) 0.88(20) 0.88 0.97(19) 0.97(19) 0.97 0.84(20) 0.85(20) 0.84 0.73(20) 0.73(20) 0.73

SMBA-CSFS 0.83(16) 0.83(16) 0.83 0.86(20) 0.86(20) 0.86 0.67(20) 0.68(20) 0.67 0.8(20) 0.77(20) 0.78 0.98(15) 0.98(15) 0.98 0.99(19) 0.99(19) 0.99 1.0(20) 1.0(20) 1.0 0.99(20) 0.98(20) 0.98 0.97(20) 0.97(20) 0.97

BSL 1 1 1 1 1 1 0.74 0.74 0.74 0.92 0.92 0.92 0.93 0.93 0.93 0.8 0.8 0.8 1 1 1 0.98 0.98 0.98 1 1 1

N
ardone

etal.
(2019),P

eerJ
C

om
put.

S
ci.,D

O
I10.7717/peerj-cs.237

15/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


Figure 2 Comparison of several TFS accuracies against SMBA and SMBA-CSFS on nine data sets:
(A) ALLAML(2), (B) LEUKEMIA(2), (C) CLL_SUB_111(3), (D) GLIOMA(4), (E) LUNG_C(5), (F)
LUNG_D(7), (G) DLBCL(9), (H) CARCINOM(11), (I) GCM(14), when a varying number of features is
selected. SVM classifier with 5-fold CV was used.

Full-size DOI: 10.7717/peerjcs.237/fig-2

sparse-modeling approach in selecting the features and the use of an ensemble classifier.
Indeed, since the ensemble is based on a majority voting schema, SMBA-CSFS is able to
guess, with higher probability, the belonging of samples coming from data sets with many
classes. Just think that, whenever our method draws from a sample of a two-class data set,
the probability of a right guess is proportional to a coin toss. Therefore if, on one hand,
this leads to good performance when the data set consists of many classes, the probability
of failure, on the other hand, increases in the case of data sets consisting of fewer classes.
Anyhow, the local structure of data distribution which is crucial for feature selection,
as stated in He, Cai & Niyogi (2005), may be a logical reason why the SBMA schema
performs better on certain data set rather than others. In addition, as shown in Fig. 2, it
is worth observing that SMBA-CSFS seems to perform better w.r.t. TFS competitors on a
fewer number of features. This would suggest that SMBA-CSFS is able to identify/retrieve
the most representative features that maximize the classification accuracy. To assert the

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 16/25

https://peerj.com
https://doi.org/10.7717/peerjcs.237/fig-2
http://dx.doi.org/10.7717/peerj-cs.237


Figure 3 Average ROC curves and the corresponding AUC values on the first 20 features compar-
ing the classification performance among SMBA-CSFS and TFS methods for nine data sets: (A) AL-
LAML(2), (B) LEUKEMIA(2), (C) CLL_SUB_111(3), (D) GLIOMA(4), (E) LUNG_C(5), (F) LUNG_D(7),
(G) DLBCL(9), (H) CARCINOM(11), (I) GCM(14). SVM classifier with 5-fold CV was used.

Full-size DOI: 10.7717/peerjcs.237/fig-3

previous results achieved, we computed the average ROC curves between SMBA-CSFS
and the other TFS methods on a subset of 20 and 80 features, respectively. Looking at
the AUC values in Fig. 3, it would suggest SMBA-CSFS as the best model to choose for
identifying the most representative features in a classification task when dealing with data
set with many classes. Concerning with the GF-CSFS competitors, as shown in Fig. 4, it
would suggest that the sparse modeling process, underlying the proposed SMBA scheme
for feature selection, is more suitable for retrieving the best features for the purpose of
classification, often leading to get satisfactory results. Such statement is also proved by
the good balance between precision and recall shown in Table 5 and the average ROC
curves shown in Fig. 5, where SMBA-CSFS still holds a candle w.r.t. GF-CSF methods. The
reader’s attention is drawn to the Supplemental Material for all the experimental results
and consideration arisen on the top 80 features.

To statistically validate the results and compare all the competing classifiers against the
proposed SMBA-CSFS, on both 20 and 80 feature subsets, we ran Non-Parametric multiple

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 17/25

https://peerj.com
https://doi.org/10.7717/peerjcs.237/fig-3
http://dx.doi.org/10.7717/peerj-cs.237


Figure 4 Comparison of several CSFS accuracies against SMBA-CSFS on nine data sets: (A)
ALLAML(2), (B) LEUKEMIA(2), (C) CLL_SUB_111(3), (D) GLIOMA(4), (E) LUNG_C(5), (F)
LUNG_D(7), (G) DLBCL(9), (H) CARCINOM(11), (I) GCM(14), when a varying number of features is
selected. SVM classifier with 5-fold CV was used.

Full-size DOI: 10.7717/peerjcs.237/fig-4

comparison tests (all vs all) (Demšar, 2006; Rodríguez-Fdez et al., 2015) which sequentially
performs a popular multi-class Friedman nonparametric test (Friedman, 1937) followed by
a Nemenyi Post-hoc multiple comparison (Dunn, 1961). The ranking of the classifiers, when
the top 20 and 80 features are selected, along with the corresponding p-values, are described
in the Supplemental Material. Looking at the Cumulative Rank (CR) for each classifier,
one can notice how SMBA-CSFS achieves optimal results, always finishing in the first three
positions. However, it is worth emphasizing that our method ranks systematically on the top
position when considering data sets consisting of five or more classes (named CR≥5). These
results prove again that SMBA-CSFS achieves good performance on data sets with many
classes. Moreover, by using different classifiers we do not observe noteworthy differences
in the results, meaning that the methodology is suitable for the classification of this kind
of data, independently from the selected classifier. However, by looking at the p-values,
corresponding to the single ranking method, one can better verify which algorithms have

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 18/25

https://peerj.com
https://doi.org/10.7717/peerjcs.237/fig-4
http://dx.doi.org/10.7717/peerj-cs.237


Figure 5 Average ROC curves and the corresponding AUC values on the first 20 features comparing
the classification performance among SMBA-CSFS and several CSFS methods for nine data sets:
(A) ALLAML(2), (B) LEUKEMIA(2), (C) CLL_SUB_111(3), (D) GLIOMA(4), (E) LUNG_C(5), (F)
LUNG_D(7), (G) DLBCL(9), (H) CARCINOM(11), (I) GCM(14). SVM classifier with 5-fold CV was
used.

Full-size DOI: 10.7717/peerjcs.237/fig-5

significantly different performance w.r.t. SMBA-CSFS. For detailed information regarding
the results, see the Supplemental Material. Concerning the computational complexity, from
several conducted experiments we observed that the proposed methodology may be slower
than other techniques (e.g., FS and Relief whose running times are in term of few seconds)
but comparable with SMBA. Its running time, depending on several parameters involved,
especially in the size of the number of instances and classes of the data sets, may vary from
a couple of hours to at most one day (see Table S9 for details on the computational time).
Nevertheless, SMBA-CSFS achieves appreciable performance when working on large data
sets and number of classes, and sometimes, in the biological field, the accuracy in finding
key features that are responsible for some biological processes is preferred to the execution
time. However, since most of the time consumed by the proposed approach is due to
the solution of the optimization problem by using the ADMM method, and because the

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 19/25

https://peerj.com
https://doi.org/10.7717/peerjcs.237/fig-5
http://dx.doi.org/10.7717/peerj-cs.237#supp-11
http://dx.doi.org/10.7717/peerj-cs.237


methodology is based on an ensemble of classifiers, a parallel computing approach could
be adopted to obtain a faster computational time (Deng et al., 2017).

CONCLUSIONS
We proposed a Sparse-Modeling Based Approach for Feature Selection with emphasizing
joint `1,2-norm minimization and the Class-Specific Feature Selection. Experimental
results, on nine different data sets, validate the unique aspects of SMBA-CSFS and
demonstrate the promising performance achieved against the-state-of-art methods. One
of the main characteristics of our framework is that, by jointly exploiting the idea of
Sparse Modeling and Class-Specific Feature Selection, it is able to identify/retrieve the
most representative features that maximize the classification accuracy in those cases where
a given data set is made up of many classes. Based on our experimental results, we can
conclude that, usually applying TFS allows achieving better results than using all the
available features. However, in many cases, applying the proposed SMBA-CSFS method
allows improving the performance of just TFS as well as GF-CSFS injected with several
TFS methods. It has to be stressed, that SMBA-CSFS seems actually suitable for large data
sets consisting of many classes, while on data sets with less than five classes other methods
appear to be more effective. Although SMBA, SMBA-CSFS and TFS performance slightly
differ on the whole, it is worth highlighting that SMBA-CSFS achieves its best performance
when considering fewer features (i.e., from 1 to 20) on data sets with many classes, which
is an important goal when certain biological tasks are taken into account. However, we do
believe that these techniques might be effectively used in a systematic way after a microarray
analysis. Indeed, a better gene selection step could avoid the waste of many resources in
post-array wet analysis (e.g., Real Time-PCR) allowing researchers to focus their attention
just on relevant features. Finally, we think this method demonstrated to be an interesting
alternative among FS approaches on microarray data.

As future work, the focus will be moved towards the biologic interpretations of the
SMBA framework behavior, by systematically studying the selected genes, especially taking
into account the SMBA-CSFS approach which, as proved by the experimental results,
is more effective in selecting genes of interest than the standard SMBA. Furthermore,
we are planning to test our approach on EPIC data set (Demetriou et al., 2013), after a
thorough analysis of pre-filtering, and a parallel implementation to substantially reduce its
computational time.

ACKNOWLEDGEMENTS
The research was entirely developed when Davide Nardone was a Master Degree student
in Applied Computer Science at University of Naples Parthenope.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 20/25

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.237


ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This work was supported by Dipartimento di Scienze e Tecnologie Università degli Studi di
Napoli ‘‘Parthenope’’ (Sostegno alla ricerca individuale per il triennio 2016–2018 project).
The funders had no role in study design, data collection and analysis, decision to publish,
or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
Dipartimento di Scienze e Tecnologie Università degli Studi di Napoli ‘‘Parthenope’’
(Sostegno alla ricerca individuale per il triennio 2016–2018 project).

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Davide Nardone conceived and designed the experiments, performed the experiments,
analyzed the data, prepared figures and/or tables, performed the computation work,
authored or reviewed drafts of the paper, approved the final draft.
• Angelo Ciaramella and Antonino Staiano conceived and designed the experiments,
prepared figures and/or tables, authored or reviewed drafts of the paper, approved the
final draft.

Data Availability
The following information was supplied regarding data availability:

The data supporting the experiments in this article are available at Zenodo: Davide
Nardone. (2019). Biological datasets for SMBA (Version 1.0.0). http://doi.org/10.5281/
zenodo.2709491.

A Python software package is available through GitHub at https://github.
com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-
Selection, containing all the source codes used to run the software.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.237#supplemental-information.

REFERENCES
Aharon M, Elad M, Bruckstein A. 2006. K-SVD: an algorithm for designing overcom-

plete dictionaries for sparse representation. IEEE Transactions on Signal Processing
54(11):4311–4322 DOI 10.1109/TSP.2006.881199.

Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H,
Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson JJ, Lu L, Lewis DB,

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 21/25

https://peerj.com
http://doi.org/10.5281/zenodo.2709491
http://doi.org/10.5281/zenodo.2709491
https://github.com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-Selection
https://github.com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-Selection
https://github.com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-Selection
http://dx.doi.org/10.7717/peerj-cs.237#supplemental-information
http://dx.doi.org/10.7717/peerj-cs.237#supplemental-information
http://dx.doi.org/10.1109/TSP.2006.881199
http://dx.doi.org/10.7717/peerj-cs.237


Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage
JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown P,
Staudt LM. 2000. Distinct types of diffuse large B-cell lymphoma identified by gene
expression profiling. Nature 403(6769):503–511 DOI 10.1038/35000501.

Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J,
Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson
BE, Golub TR, Sugarbaker DJ, Meyerson M. 2001. Classification of human
lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma
subclasses. Proceedings of the National Academy of Sciences of the United States of
America 98(24):13790–13795 DOI 10.1073/pnas.191502998.

Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. 2011. Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations
and Trends in Machine learning 3(1):1–122 DOI 10.1561/2200000016.

Calcagno G, Staiano A, Fortunato G, Brescia-Morra V, Salvatore E, Liguori R,
Capone S, Filla A, Longo G, Sacchetti L. 2010. A multilayer perceptron neural
network-based approach for the identification of responsiveness to interferon
therapy in multiple sclerosis patients. Information Sciences 180(21):4153–4163
DOI 10.1016/j.ins.2010.07.004.

Camastra F, Di Taranto M, Staiano A. 2015. Statistical and computational methods for
genetic diseases: an overview. Computational and Mathematical Methods in Medicine
2015:954598.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
DOI 10.1613/jair.953.

Ciaramella A, Cocozza S, Iorio F, Miele G, Napolitano F, Pinelli M, Raiconi G, Tagli-
aferri R. 2008. Interactive data analysis and clustering of genomic data. Neural
Networks 21(2–3):368–378 DOI 10.1016/j.neunet.2007.12.026.

Ciaramella A, Gianfico M, Giunta G. 2016. Compressive sampling and adaptive dictio-
nary learning for the packet loss recovery in audio multimedia streaming. Multime-
dia Tools and Applications 75(24):17375–17392 DOI 10.1007/s11042-015-3002-x.

Ciaramella A, Giunta G. 2016. Packet loss recovery in audio multimedia streaming by
using compressive sensing. IET Communications 10(4):387–392
DOI 10.1049/iet-com.2014.0995.

Demetriou CA, Chen J, Polidoro S, Van Veldhoven K, Cuenin C, Campanella G,
Brennan K, Clavel-Chapelon F, Dossus L, Kvaskoff M, Drogan D, Boeing H,
Kaaks R, Risch A, Trichopoulos D, Lagiou P, Masala G, Sieri S, Tumino R, Panico
S, Quirós JR, Sánchez Perez MJ, Amiano P, Huerta Castaño JM, Ardanaz E,
Onland-Moret C, Peeters P, Khaw KT, Wareham N, Key TJ, Travis RC, Romieu
I, Gallo V, Gunter M, Herceg Z, Kyriacou K, Riboli E, Flanagan JM, Vineis P. 2013.
Methylome analysis and epigenetic changes associated with menarcheal age. PLOS
ONE 8(11):e79391 DOI 10.1371/journal.pone.0079391.

Demšar J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research 7(Jan):1–30.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 22/25

https://peerj.com
http://dx.doi.org/10.1038/35000501
http://dx.doi.org/10.1073/pnas.191502998
http://dx.doi.org/10.1561/2200000016
http://dx.doi.org/10.1016/j.ins.2010.07.004
http://dx.doi.org/10.1613/jair.953
http://dx.doi.org/10.1016/j.neunet.2007.12.026
http://dx.doi.org/10.1007/s11042-015-3002-x
http://dx.doi.org/10.1049/iet-com.2014.0995
http://dx.doi.org/10.1371/journal.pone.0079391
http://dx.doi.org/10.7717/peerj-cs.237


Deng W, Lai M-J, Peng Z, Yin W. 2017. Parallel multi-block ADMM with o (1/k) conver-
gence. Journal of Scientific Computing 71(2):712–736 DOI 10.1007/s10915-016-0318-2.

Di Taranto MD, Staiano A, D’Agostino MN, D’Angelo A, Bloise E, Morgante A,
Marotta G, Gentile M, Rubba P, Fortunato G. 2015. Association of USF1 and
APOA5 polymorphisms with familial combined hyperlipidemia in an Italian pop-
ulation. Molecular and Cellular Probes 29(1):19–24 DOI 10.1016/j.mcp.2014.10.002.

Draghici S, Khatri P, Eklund A, Szallasi Z. 2006. Reliability and reproducibility
issues in DNA microarray measurements. Trends in Genetics 22(2):101–109
DOI 10.1016/j.tig.2005.12.005.

Dunn OJ. 1961. Multiple comparisons among means. Journal of the American Statistical
Association 56(293):52–64 DOI 10.1080/01621459.1961.10482090.

Elhamifar E, Sapiro G, Vidal R. 2012. See all by looking at a few: sparse modeling for
finding representative objects. In: IEEE conference on computer vision and pattern
recognition. Piscataway: IEEE, 1600–1607.

Engan K, Aase SO, Husoy JH. 1999. Method of optimal directions for frame design.
In: 1999 IEEE international conference on acoustics, speech, and signal processing.
Piscataway: IEEE, 2443–2446.

Friedman J, Hastie T, Tibshirani R. 2001. The elements of statistical learning. Vol. 1.
New-York: Springer.

Friedman M. 1937. The use of ranks to avoid the assumption of normality implicit in the
analysis of variance. Journal of the American Statistical Association 32(200):675–701
DOI 10.1080/01621459.1937.10503522.

Fu X, Wang L. 2002. A GA-based RBF classifier with class-dependent features. In:
Evolutionary computation, 2002. CEC’02. Proceedings of the 2002 congress on, vol. 2.
IEEE, 1890–1894.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H,
Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. 1999. Molecular
classification of cancer: class discovery and class prediction by gene expression
monitoring. Science 286(5439):531–537 DOI 10.1126/science.286.5439.531.

Gu Q, Li Z, Han J. 2012. Generalized fisher score for feature selection. ArXiv preprint.
arXiv:1202.3a725.

Guyon I, Elisseeff A. 2003. An introduction to variable and feature selection. Journal of
Machine Learning Research 3:1157–1182.

Haslinger C, Schweifer N, Stilgenbauer S, Döhner H, Lichter P, Kraut N, Stratowa C,
Abseher R. 2004. Microarray gene expression profiling of B-cell chronic lymphocytic
leukemia subgroups defined by genomic aberrations and VH mutation status.
Journal of Clinical Oncology 22(19):3937–3949 DOI 10.1200/JCO.2004.12.133.

He X, Cai D, Niyogi P. 2005. Laplacian score for feature selection, advances in nerual
information processing systems. Cambridge: MIT Press.

Hoque N, Bhattacharyya DK, Kalita JK. 2014. MIFS-ND: a mutual information-based
feature selection method. Expert Systems with Applications 41(14):6371–6385
DOI 10.1016/j.eswa.2014.04.019.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 23/25

https://peerj.com
http://dx.doi.org/10.1007/s10915-016-0318-2
http://dx.doi.org/10.1016/j.mcp.2014.10.002
http://dx.doi.org/10.1016/j.tig.2005.12.005
http://dx.doi.org/10.1080/01621459.1961.10482090
http://dx.doi.org/10.1080/01621459.1937.10503522
http://dx.doi.org/10.1126/science.286.5439.531
http://arXiv.org/abs/1202.3a725
http://dx.doi.org/10.1200/JCO.2004.12.133
http://dx.doi.org/10.1016/j.eswa.2014.04.019
http://dx.doi.org/10.7717/peerj-cs.237


Jolliffe IT. 1986. Principal component analysis and factor analysis. In: Principal compo-
nent analysis. New York: Springer, 115–128.

Jović A, Brkić K, Bogunović N. 2015. A review of feature selection methods with appli-
cations. In: 2015 38th international convention on information and communication
technology, electronics and microelectronics (MIPRO). IEEE, 1200–1205.

Kira K, Rendell LA. 1992. A practical approach to feature selection. In: Proceedings of the
ninth international workshop on machine learning. 249–256.

Kononenko I. 1994. Estimating attributes: analysis and extensions of RELIEF. In:
European conference on machine learning. Berlin, Heidelberg: Springer, 171–182.

Kraskov A, Stögbauer H, Grassberger P. 2004. Estimating mutual information. Physical
Review E 69(6):66–138 DOI 10.1103/PhysRevE.69.066138.

Kreyszig E. 2010. Advanced engineering mathematics. Chichester: John Wiley & Sons.
Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. 2008. Discriminative learned dictio-

naries for local image analysis. In: IEEE conference on computer vision and pattern
recognition, 2008. CVPR 2008. Piscataway: IEEE, 1–8.

Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. 2009. Non-local sparse models for
image restoration. In: IEEE 12th international conference on computer vision and
pattern recognition. Piscataway: IEEE, 2272–2279.

Nardone D, Ciaramella A, Staiano A. 2019a. Biological datasets. Available at https:
//zenodo.org/record/3405292#.XXkAtugzaUk.

Nardone D, Ciaramella A, Staiano A. 2019b. Source code. Available at https://github.
com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-
Selection.

Nie F, Huang H, Cai X, Ding CH. 2010. Efficient and robust feature selection via joint
`2,1-norms minimization. In: Advances in neural information processing systems.
Vancouver, British Columbia, Canada, 1813–1821.

Nutt CL, Mani D, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann
C, McLaughlin ME, Batchelor TT, Black PM, Von Deimling A, Pomeroy SL, Golub
SL, Louis DN. 2003. Gene expression-based classification of malignant gliomas
correlates better with survival than histological classification. Cancer Research
63(7):1602–1607.

Peng H, Long F, Ding C. 2005. Feature selection based on mutual information
criteria of max-dependency, max-relevance, and min-redundancy. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27(8):1226–1238
DOI 10.1109/TPAMI.2005.159.

Pineda-Bautista BB, Carrasco-Ochoa JA, Martınez-Trinidad JF. 2011. General
framework for class-specific feature selection. Expert Systems with Applications
38(8):10018–10024 DOI 10.1016/j.eswa.2011.02.016.

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C-H, Angelo M, Ladd C, Reich
M, Latulippe E, Mesirov JP, Poggio T, Gerald WL, Loda MF, Lander ES, Golub
TR. 2001. Multiclass cancer diagnosis using tumor gene expression signatures.
Proceedings of the National Academy of Sciences of the United States of America
98(26):15149–15154 DOI 10.1073/pnas.211566398.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 24/25

https://peerj.com
http://dx.doi.org/10.1103/PhysRevE.69.066138
https://zenodo.org/record/3405292#.XXkAtugzaUk
https://zenodo.org/record/3405292#.XXkAtugzaUk
https://github.com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-Selection
https://github.com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-Selection
https://github.com/DavideNardone/A-Sparse-Coding-Based-Approach-for-Class-Specific-Feature-Selection
http://dx.doi.org/10.1109/TPAMI.2005.159
http://dx.doi.org/10.1016/j.eswa.2011.02.016
http://dx.doi.org/10.1073/pnas.211566398
http://dx.doi.org/10.7717/peerj-cs.237


Ramirez I, Sprechmann P, Sapiro G. 2010. Classification and clustering via dictionary
learning with structured incoherence and shared features. In: 2010 IEEE conference on
computer vision and pattern recognition (CVPR). Piscataway: IEEE, 3501–3508.

Rodríguez-Fdez I, Canosa A, Mucientes M, Bugarín A. 2015. STAC: a web platform for
the comparison of algorithms using statistical tests. In: Fuzzy systems (FUZZ-IEEE),
2015 IEEE international conference on. Piscataway: IEEE, 1–8.

Ross BC. 2014. Mutual information between discrete and continuous data sets. PLOS
ONE 9(2):e87357 DOI 10.1371/journal.pone.0087357.

Saeys Y, Inza I, Larrañaga P. 2007. A review of feature selection techniques in bioinfor-
matics. Bioinformatics 23(19):2507–2517 DOI 10.1093/bioinformatics/btm344.

Sokolova M, Lapalme G. 2009. A systematic analysis of performance measures
for classification tasks. Information Processing & Management 45(4):427–437
DOI 10.1016/j.ipm.2009.03.002.

Staiano A, De Vinco L, Ciaramella A, Raiconi G, Tagliaferri R, Amato R, Longo G,
Donalek C, Miele G, Di Bernardo D. 2004. Probabilistic principal surfaces for
yeast gene microarray data mining. In: Proceedings of the fourth IEEE international
conference on data mining, ICDM 2004. Piscataway: IEEE, 202–208.

Staiano A, Di Taranto MD, Bloise E, D’Agostino MN, D’Angelo A, Marotta G, Gentile
M, Jossa F, Iannuzzi A, Rubba P, Fortunato G. 2013. Investigation of single
nucleotide polymorphisms associated to familial combined hyperlipidemia with
random forests. In: Neural nets and surroundings. Vol. 19(1). Berlin, Heidelberg:
Springer, 169–178.

Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM,
Moskaluk CA, Frierson H, Hampton GM. 2001. Molecular classification of human
carcinomas by use of gene expression signatures. Cancer Research 61(20):7388–7393.

Tang J, Alelyani S, Liu H. 2014. Feature selection for classification: a review. In:
Data classification: algorithms and applications. Boca Raton: CRC Press, 37–64
DOI 10.1201/b17320.

Tibshirani R. 1994. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, Series B 58:267–288.

Wolpert DH, Macready WG. 1997. No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation 1(1):67–82 DOI 10.1109/4235.585893.

Xiong M, Fang X, Zhao J. 2001. Biomarker identification by feature wrappers. Genome
Research 11(11):1878–1887 DOI 10.1101/gr.190001.

Yang K, Cai Z, Li J, Lin G. 2006. A stable gene selection in microarray data analysis. BMC
Bioinformatics 7(1):228 DOI 10.1186/1471-2105-7-228.

Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society: Series B (Statistical Methodology) 67(2):301–320
DOI 10.1111/j.1467-9868.2005.00503.x.

Nardone et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.237 25/25

https://peerj.com
http://dx.doi.org/10.1371/journal.pone.0087357
http://dx.doi.org/10.1093/bioinformatics/btm344
http://dx.doi.org/10.1016/j.ipm.2009.03.002
http://dx.doi.org/10.1201/b17320
http://dx.doi.org/10.1109/4235.585893
http://dx.doi.org/10.1101/gr.190001
http://dx.doi.org/10.1186/1471-2105-7-228
http://dx.doi.org/10.1111/j.1467-9868.2005.00503.x
http://dx.doi.org/10.7717/peerj-cs.237