1 

 
A Bi-objective Feature Selection Algorithm for Large Omics Datasets  
Luís Cavique 

1,2 [0000-0002-5590-1493]
, Armando B. Mendes 

3,4 [0000-0003-3049-5852]
, 

Hugo F.M.C. Martiniano 
1,5 [0000-0003-2490-8913] 

and Luís Correia 
1 [0000-0003-2439-1168]

 
1
 MAS-BioISI, FCUL, Lisboa, Portugal, 

2
 Universidade Aberta, Lisboa, Portugal,  

3 
Universidade Açores, Ponta Delgada, Portugal,  
4
Algoritmi, Universidade do Minho, Portugal, 

5
 Instituto Dr. Ricardo Jorge, Lisboa, Portugal,  

luis.cavique@uab.pt, armando.b.mendes@uac.pt,  

hfmartiniano@fc.ul.pt, luis.correia@ciencias.ulisboa.pt 

 
Abstract. Feature selection is one of the most important concepts in data 

mining when dimensionality reduction is needed. The performance measures 

of feature selection encompass predictive accuracy and result 

comprehensibility. Consistency based methods are a significant category of 

feature selection research that substantially improves the comprehensibility 

of the result using the parsimony principle.  In this work, the bi-objective 

version of the algorithm Logical Analysis of Inconsistent Data is applied to 

large volumes of data. In order to deal with hundreds of thousands of 

attributes, heuristic decomposition uses parallel processing to solve a set 

covering problem and a cross-validation technique. The bi-objective 

solutions contain the number of reduced features and the accuracy. The 

algorithm is applied to omics datasets with genome-like characteristics of 

patients with rare diseases. 

 
Keywords: feature selection, logical analysis of data, heuristic 

decomposition, bi-objective optimization 

 
1. Introduction 

 
In feature selection two main objectives sustain the performance measures, the 

predictive accuracy and the result comprehensibility [Liu, Yu 2005]. Some models do 

not consider the predictive accuracy, whereas others tend to destroy the underlying 

semantics of the features after reduction. It would be highly desirable to find a method 

that could not only reduce the number of features, but also preserve the data semantics.  

 
In this context, Rough Set theory emerges as an important tool to discover data 

dependencies and reduce dimensionality. Rough Set theory was initially proposed as a 

tool to reason about vagueness and uncertainty in information systems by Pawlak 

[Pawlak 1982] and later it was also proposed for attribute selection [Pawlak 1991]. 

Rough Sets determine a lower and an upper approximation for each class, rather than 

correct or exclude data inconsistencies. 

 
In parallel, Peter Hammer's group [Crama et al. 1988], [Boros et al. 2000], with works 

in discrete optimization, developed LAD, Logical Analysis of Data. The key features of 

LAD are the discovery of the minimum number of attributes necessary to explain all 

mailto:luis.cavique@uab.pt
mailto:armando.b.mendes@uac.pt
mailto:hfmartiniano@fc.ul.pt


2 

 
observations and the detection of hidden patterns in a dataset with two classes. An 

extension of the Boolean approach uses nominal non-binary attributes. 

 
LAD and Rough Set approaches are a subset of filter models that aim to reduce the 

number of attributes of datasets using two phases: problem transformation and 

optimization. Their specificity is to keep the semantics of the data by only removing the 

redundant data.  

 
By combining the two approaches, we proposed Logical Analysis of Inconsistent Data, 

LAID [Cavique et al. 2013], which blends the best characteristics of both methods. The 

LAID method should deal with integer attributes associated with costs as in LAD and 

be tolerant to inconsistency as in Rough Sets. The integration of both approaches is so 

close that LAID can be seen as a Rough Set extension. 

 
One key area for the application of the LAID method is high-dimensional omics 

datasets, such as those generated by high-throughput sequencing technologies. In the 

past few years, technological developments in this area have resulted in an astonishing 

growth of the amount of data produced. As most disease related omics datasets are 

subject to restricted access, artificial datasets with similar characteristics have been 

created.  

 
The goal of this work is to solve a feature selection problem with a dataset involving 

two thousand observations and one million attributes. To deal with millions of attributes 

a heuristic decomposition is used in a parallel computer environment. We will consider 

two performance measures for comprehensibility of the result and accuracy.  In the 

comprehensibility of the solution, we aim to achieve the minimum number of attributes 

with the maximum actionable knowledge that better explains the dataset to the final 

user. The accuracy is given by a cross-validation technique. 

 
This document extends on works with LAID characterization [Cavique et al. 2011], 

[Cavique et al. 2013]. The application of LAID to larger datasets with the development 

of a heuristic decomposition approach with a sub-problem algorithm was presented in 

[Cavique et al. 2017]. Following the previous work, the novelty of this document is the 

presentation of the master problem algorithm with a bi-objective approach for omics 

datasets. The bi-objective reflects both the predictive accuracy concern and the result 

comprehensibility. 

 
This document is organized as follows. In Section 2, we present the related concepts of 

feature selection, omics datasets and heuristic decomposition. In Section 3, we present 

the sub-problem algorithm that finds the feature selection with the minimum number of 

attributes and computes the accuracy. In Section 4, the heuristic decomposition 

algorithm and the bi-objective master problem algorithm are presented. In Section 5, the 

computational results are shown. Finally, in the last section we draw some conclusions 

about this new approach. 

 
3 

 
2. Related work 

 
This work combines the areas of attribute selection and problem decomposition to deal 

with omics datasets. Consequently, in this section we introduce the related topics: 

feature selection, omics data and heuristic decomposition. As they will be combined in 

the proposed sub-problem algorithm, two feature selection methods, Rough Sets and 

LAD are detailed. 

 
2.1. Feature Selection 

The motivation to reduce the dimensionality of the feature space is closely related to the 

decreased time required to double information in the world every year.  Surveys on 

feature selection methods can be found in [Liu, Yu 2005], [Chandrashekar, Sahin 2014]. 

 
In feature selection, given thousands to millions of features, the goal is to select the 

most relevant ones. The performance measures have two main goals, the predictive 

accuracy and the result comprehensibility. In this work we emphasize the 

comprehensibility of the results, following Occam’s razor principle that aims to obtain 

the simplest model. 

 
We consider a dataset D={O, X∪C} where O={O1,O2,…,On} is a non-empty set of 
observations (instances or cases), X={X1,X2,…,Xm} is a non-empty set of features 

(attributes or columns) and C is the class attribute. 

 
In this brief review we use the feature selection taxonomies reported in [Liu, Yu 2005].  

There are three basic models in feature selection: Filter, Wrapper and Hybrid model.  

 
In the Filter model the most popular independent criteria are consistency measures, 

distance measures, information measures and dependency measures. In this sub-section 

the distance measure and the consistency measure are detailed because they are used in 

this document.  

 
Distance measure criteria are applied in observations within the same class and in 

observations of diverse classes. In the same class, to obtain the disagreement value, the 

logic operator XOR is applied, where it returns True if (Ox, Xa) ≠ (Oy, Xa) and False 

otherwise. With diverse classes, the logic operator XOR can also be applied, where if 

(Ox, Xa) ≠ (Oy, Xa), feature Xa should be chosen because it differentiates the classes. 

Comparisons of observations within the same class measure the incoherence or noise of 

the feature. On the other hand, comparisons of observations between different classes 

measure how strong a feature is in the discrimination or separation of the classes. 

RELIEF algorithm [Kira, Rendell 1992] evaluates a feature subset based on the 

subtraction between the distance of observations in different classes and the distance of 

observations within the same class. 

 
Using the consistency measure criteria, an inconsistency occurs when two or more 

observations have the same values in all attributes but belong to different decision 

classes. This measure is used in algorithms that attempt to find the minimum number of 

features with the minimum number of inconsistencies. The well-known algorithm 

FOCUS [Almuallim, Dietterich 1991] uses the concept of consistency, where irrelevant 


4 

 
features are removed. The LAD algorithm [Crama et al. 1988], [Boros et al. 2000] is 

developed in two steps. In the first step, a disjoint matrix is obtained using the XOR 

operator to compare observations from different classes. In the second step, a set 

covering algorithm is applied to ensure data consistency. 

 
In the feature selection process using addition or removal of attributes, the information 

measure is given by the information gain from an added or removed feature.  

 
The dependency measure of a feature is related to how important the correlation is 

between the feature and the class.  

 
The Filter model is divided into two sequential steps. The feature selection step is 

executed before the learning phase of the prediction model, and there is no interaction 

between the selection and the prediction model.  

 
The Wrapper model [John et al. 1994] is also divided into two steps, but with strong 

interaction between the feature selection phase and the learning phase, where the results 

of the prediction are used as a criterion of feature choice.  

 
Filter models are more intuitive and show better performance since they build the 

solution in a constructive process without iterations. On the other hand, they present the 

disadvantages of ignoring the predictive process.  Consequently, most of the relevant 

features might not be adequate in the prediction model and the selection criterion is hard 

to estimate.  

 
In the Wrapper models, the selection criterion is easy to estimate since the features are 

chosen by the prediction model, and therefore they are classified as model-aware as they 

incorporate the knowledge of the predictor. Contrary to the Filter models, Wrapper 

models are computationally expensive and less intuitive. The main disadvantage of this 

method is the increasing overfitting risk when the number of observations is 

insufficient. In other words, the Wrapper models do not identify statistical dependency, 

so the features might not be the main explanatory variables and therefore the model can 

lose its theoretical basis.  

 
Hybrid methods have been proposed to reduce features in classification by combining 

the advantages of the two previous methods. The main disadvantage of the method is 

the significant low scalability when the feature number increases. 

 
To sum up, in Filter methods, the selection criterion is hard to estimate, whereas 

Wrapper methods tend to destroy the underlying feature semantics.   

 
In this work, we opt for Filter methods for two main reasons: (i) given that large 

volumes of data should be processed, Filter methods are more computationally efficient, 

(ii) given that the result comprehensibility is an important criterion, Filter methods are 

preferable since they preserve the semantics of the data.  

 
5 

 
2.2. Rough Sets and Logical Analysis of Data (LAD) 

In this sub-section two Filter methods are presented, Rough Sets and Logical Analysis 

of Data (LAD). Both methods try to preserve the semantics of the data and the 

algorithms are composed of two-phases, the problem transformation and the problem 

optimization. Rough Sets algorithm transforms data into a discernibility matrix and 

obtains the result by solving a SAT optimization problem. LAD firstly converts data 

into a disjoint matrix and then applies the Set Covering problem to obtain the result. 

 
Rough Sets 

Rough Sets theory was initially proposed as a tool to reason about vagueness and 

uncertainty in information systems by Pawlak [1982] and later it was also proposed for 

attribute selection by Pawlak [1991]. The applications of the Rough Sets method are 

wide leading to meaningful results in many fields, such as conflict analysis, finance, 

industry, multimedia, medicine, and most recently bioinformatics [Polkowski 2002] 

[Peters, Skowron 2010]. A detailed example of attribute selection with Rough Sets can 

be found in Cavique et al. [2013]. 

 
In Rough Sets, table values can have any integer value rather than binary values. 

Inconsistencies occur when two cases have the same values for all attributes but belong 

to different decision classes.  A practical example is two sick people that have the same 

symptoms but different diseases. With real data this is possible, because the table might 

have a missing attribute that could discriminate between them. 

 
Rough Sets do not correct or exclude the inconsistencies, but for each class they 

determine a lower and an upper approximation. Given D={O,X∪C}, the subset of 
objects YO and the subset of attributes BX, Pawlak’s Rough Sets theory defines two 

approximation spaces: the lower and upper rough approximation. The lower 

approximation BL(Y) is the least composed set that is contained in Y, and the upper 

approximation B
U
(Y) is the greatest composed set that contains Y.  

 
As a consequence the approximation space leads to BL(Y)YB
U
(Y). Also, the lower 

and upper approximations of a subset YO can be seen as operators in the universe of 

objects dividing it into three disjoint regions, the positive region POS(Y), the negative 

region NEG(Y) and the boundary region BR(Y): POS(Y) = BL(Y), NEG(Y) = O  

B
U
(Y) and BR(Y) = B

U
(Y)BL(Y). 

 
When the lower and upper approximations are equal, BL(Y)=B
U
(Y), there are no 

inconsistencies and the rough set is called crispy rough set. Another way to identify the 

roughness of the set is using measures. The accuracy approximation measure is given 

by: 𝛼(𝑌) =
| 𝐵𝐿(𝑌)|

| 𝐵𝑈(𝑌)|
 where |Y| denotes the cardinality of Y0 and 0(Y)1. If (Y)=1, 

X is crisp; otherwise, it is a Rough Set. 

 
The goal of Rough Sets is to discover decision rules from dataset D. The minimum 

number of attributes required to explain all the observations.  Discovering the minimum 

number of attributes is an NP-hard problem. One of the following techniques is 

normally used: Reduction by Heuristics or Discernibility Matrix. 

 
6 

 
In Reduction by Heuristics, the search for a core is given by the following procedure: 

for each iteration, one attribute is removed, and the augmentation of inconsistency is 

verified. As already referred, inconsistency occurs when two or more observations have 

the same values for all attributes but belong to different decision classes. If the 

inconsistency does not increase, the attribute should be removed. When no further 

attributes can be removed, the remaining ones are considered indispensable and thus the 

core is found. Using a discernibility matrix of D, denoted by M, an (nn) matrix is 

defined as follows, where M(i,j)= denotes that this case does not need to be 

considered. 

 
𝑀(𝑖, 𝑗) = {
{𝑥 ∈ 𝑋: 𝑥(𝑜𝑖 ) ≠ 𝑥(𝑜𝑗 )}   𝑖𝑓   𝑐(𝑜𝑖 ) ≠ 𝑐(𝑜𝑗 )

∅                                                    𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 

The discernibility matrix keeps the distinct attributes for each pair of observations 

belonging to different classes. Discernibility function F(B) is a Boolean function, 

written in the disjunctive normal form (DNF), that is a normalization of a logical 

formula which is a conjunction of disjunction clauses. F(B) determines the minimum 

subset of attributes that allows the differentiation of classes: F(B)= { M(i,j): i, j=1,2, 

…, n; M(i,j) }. The F(B) decision problem is equivalent to the Satisfiability problem 

(SAT), which was the first known example of an NP-complete problem. 

 
Logic Analysis of Data (LAD) 

The LAD method developed by Peter Hammer´s group [Crama et al. 1988], [Boros et 

al. 2000] refers to the discovery of the minimum number of attributes that are necessary 

to explain all observations and the detection of hidden patterns in a dataset with two 

classes.  

 
The method works on binary data. Let D be the dataset of all observations, then each 

observation is described using several attributes, and each observation belongs to a 

class. An extension of the Boolean approach is needed when nominal non-binary 

attributes are used. The binarization (or discretization) of these attributes is performed 

by associating a string of Boolean variables to each attribute. 

 
Dataset D is given as a D
+
 set for ‘positive’ observations and as a set D

−
 set for 

‘negative’ observations, where D = D
+
∪D

−
 and the sets are disjoint D

+
∩D

−
=Ø. 

Observations are classified as positive or negative based on a hidden function, and the 

goal of the LAD method is to approximate this hidden function with a union of 

intervals. In order to systematize the process, a disjoint matrix [a(i,j)] will be defined as: 

 
𝑎(𝑖, 𝑗) = {
1         ∀𝑖: 𝑥𝑗 (𝑜𝑎) ≠ 𝑥𝑗 (𝑜𝑏 ), 𝐶(𝑜𝑎) ≠ 𝐶(𝑜𝑏 ), (𝑜𝑎, 𝑜𝑏 ) ∈ 𝑂x𝑂

0                                                                                      𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 

Given the disjoint matrix the Set Covering optimization problem is used to find the 

selected attributes. 

 
7 

 
2.3. Omics Datasets 

Omics is a neologism associated with biology suffixes such as genomics, 

transcriptomics, proteomics, or metabolomics. Omics datasets are typically 

characterized by high dimensionality and small samples, that is, a large number of 

features and few observations. One prime example is data from DNA sequencing, 

especially those generated by the third generation, also called next-generation or high-

throughput sequencing machines.  

 
A typical genome differs from the reference human genome from 4.1 to 5.0 million 

variants. For large cohorts, the number of genetic variants is even higher. The 1000 

Genomes Project sequenced the genomes of 2504 individuals from 26 populations 

producing a total of 88 million genetic variants [The 1000 Genomes Project 

Consortium, 2015].  

 
Presently, the amount of sequencing data is doubling every seven months [Stephens et 

al. 2015] and, with the advent of new, high-throughput sequencing machines, the 

growth is expected to accelerate even more. When combined with the high 

dimensionality of the datasets, this growth makes the development of scalable and 

accurate feature selection methods even more essential. 

 
Since biomedical genome sequence datasets are subject to restricted access, we 

generated artificial datasets with similar characteristics using SeqSIMLA [Yao, Chung 

2016; Chung et al. 2014] (http://seqsimla.sourceforge.net/). SeqSIMLA generates 

simulated sequencing and phenotype data for case-control cohorts with allele 

frequencies and linkage disequilibrium patterns based on real datasets. The user can 

specify around 20 parameters, like disease prevalence or odds ratio for the marginal 

effects or interactions. 

 
2.4. Heuristic Decomposition Approach 

To solve a problem with hundreds of thousands of attributes, a Decomposition Heuristic 

is introduced. In Linear Programming optimization some approaches can be mentioned 

such as Column Generation, Dantzig-Wolfe decomposition and Benders decomposition. 

A detailed introduction can be found in [Boyd et al. 2008]. Decomposition is the act of 

breaking a large problem into sub-problems. Decomposition of a problem is possible if 

the structure of the problem is maintained. If problem A is separable into sub-problems 

A
1
, A

2
, …, A

max_k
 it can also run in a parallel computer environment. 

 
Instead of minimizing the evaluation function f(A) the decomposition method 

minimizes the sub-problems f(A
1
), f(A

2
), ..., f(A

max_k
) and finally minimizes the master 

problem using function g(), which aggregates solutions of the sub-problems.  Algorithm 

1 presents the parallel version and Algorithm 2 the sequential version. 

 
8 

 
Algorithm 1: General Heuristic Decomposition (parallel version): 

Input: sub-problems A
1
, A

2
,…, A

1..max_k 

Output: solution of the master-problem algorithm 

1. solve the sub-problems:  

 sub-problem 1: y1=minimize f(A
1
) 

 sub-problem 2: y2=minimize f(A
2
) 

 ... 

 sub-problem k: ymax_k=minimize f(A
max_k

) 

2. solve the master-problem: 

 master = minimize g(y1, y2,…, ymax_k) 

 
In the sequential version of the heuristic decomposition, Algorithm 2, updates the 

master-problem at each iteration by adding additional information from the solved sub-

problem. 

 
Algorithm 2: General Heuristic Decomposition (sequential version): 

Input: sub-problems A
1
, A

2
,…, A

1..max_k
 
 
Output: solution of the master-problem algorithm 

1. for k=1 to max_k 

1.1. solve sub-problem: yk = minimize f(A
k
) 

1.2. solve master-problem: master = minimize g(master, yk) 

2. end for 

 
The decomposition principle is applied in exact methods and in metaheuristic methods 

[Joncour et al. 2010]. Given the large number of attributes, a heuristic decomposition 

approach, coined by [Smet et al. 2016] is used in this work. 

 
In Section 3 and Section 4, we detail the sub-problem and the master problem 

algorithms for the Logical Analysis of Inconsistent Data to deal with large datasets. 

 
3. Sub-problem Algorithm  

 
As already stated, in this work we propose a feature selection technique to deal with 

inconsistent data in large datasets. The inconsistent data is processed using the LAID 

algorithm. The sub-problem algorithm returns two criteria, the information about the 

quality of the filter (number of features) and the prediction quality (accuracy). Thus, the 

output of this filter method is the pair (number of features, accuracy).  

 
To solve the sub-problem, in Algorithm 3, we describe the LAID Algorithm for the bi-

criteria feature selection. Firstly, to remove any inconsistency, we add a dummy binary 

variable named “je ne sais quoi”, ‘jnsq’, since two observations with the same attributes 

that belong to different classes work against the consistency measure. In a second step, 

the Disjoint Matrix Generation [Ai,j] and Cost Vector [cj] are created, where both are 

based on the definition of the distance measure. In the third step, a Heuristic for the Set 

Covering problem is applied, returning the reduced set of features. In the fourth step, a 

cross-validation method is applied to retrieve the accuracy of the solution. Finally, the 


9 

 
algorithm returns the pair (number of features, accuracy). The algorithm can be 

specified as follows: 

 
Algorithm 3:  Logical Analysis of Inconsistent Data  

Input: dataset D = {O, X∪C} with binary variables  
Output: (number of features, accuracy) 

1. check data inconsistencies and add dummy variable ‘jnsq’ as a discriminant feature 

2. disjoint matrix generation [Ai,j] and cost vector [cj]   

3. number of features = Minimum Set Covering Problem  

4. accuracy = Cross-validation 

 
To exemplify the different steps of the algorithm a running example with six features 

and five observations is used. 

 
3.1. Data inconsistencies 

Given the inconsistency of two observations Oa and Ob with the same values for all the 

attributes X(Oa)=X(Ob) and belonging to different classes, a dummy binary variable “je 

ne sais quoi” [Cavique et al. 2013] is added, assigning a value of 1 whenever the 

class=1, or: 

 
jnsqa = {
1     if (𝑋(Oa) = X(Ob)) (class (Oa) ≠ class (Ob))  (class(Oa) = 1)

0                                                                                                            otherwise
 

Instead of using the complex approximations of the Rough Sets keeping the data 

inconsistencies, LAID does not exclude the inconsistency by adding a dummy variable 

that allows the subsequent application of the other steps of the LAD method. 

 
Given the dataset with six features {X1, …, X6} and five observations {O1, …, O5}, the 

first step of LAID is verifying the existence of inconsistencies in the dataset. For each 

inconsistency the dummy variable ‘jnsq’ is assigned ‘1’ in class 1. 

 
In Table 1 the dummy variable ‘jnsq’ is added in the last feature column to avoid data 

inconsistencies between observation O2 and O3, which present the same values for all 

the features. 
 

Table 1. Variable ‘je ne sais quoi’ is added 

observation\feature X1 X2 X3 X4 X5 X6 

X7 

(jnsq) Class 

O1 1 1 0 1 1 1 0 0 

O2 1 0 1 1 1 0 0 0 

O3 1 0 1 1 1 0 1 1 

O4 1 0 1 0 0 1 0 1 

O5 0 0 1 0 0 0 0 1 

 
10 

 
3.2. Disjoint Matrix Generation [Ai,j] and Cost Vector [cj]   

In the second step the Cost Vector [cj] and Disjoint Constraints Matrix Generation [Ai,j] 

are created. Both, the Disjoint Matrix Generation [Ai,j] and Cost Vector [cj] use the 

concept of distance measure for independent criteria in feature selection. 

 
The Cost Vector [cj] is obtained by adding the pair comparisons within the same class. 

This calculation is used to measure the general incoherence, or noise of feature j. 

 
On the other hand, the Disjoint Matrix [Ai,j] is obtained by using the pair comparisons 

belonging to different classes. The disjoint matrix [Ai,j] for each attribute Xj and pair of 

observations (Oa, Ob), is defined as: 

 
Ai,j = {
1    ∀i:  Xj(Oa) ≠ Xj(Ob), class(Oa) ≠ class(Ob),  (Oa,Ob)∈O×O

0                                                                                    otherwise
                 

The dimension of index i in matrix [Ai,j] depends on the constraint structure of dataset 

D. Each constraint results from the comparison of two different arbitrary observations 

Oa and Ob that belong to distinct classes. If one attribute j is different in the observations 

Oa and Ob the value of Ai,j becomes value 1, denoting that at least one column (attribute) 

j must be maintained  in order to differentiate the rows (or constraints) i.  
 

The cost vector [cj] and the disjoint matrix [Ai,j] will be used as input in the Minimum 

Set Covering problem, where all constraints (or lines i) must be covered, at least once 

by some attributes. 
 

Table 2 presents pairs of observations of the same class. For each observation the 

features of the same class should be equal to avoid disagreement.  The sum of the 

disagreement of each feature is called cost and stored in the Cost Vector [cj]. As stated 

before, for observations Oa, Ob and attribute Xj, the value of disagreement is given by 

the operator XOR that returns True if (Oa, Xj) ≠ (Ob, Xj) and False otherwise. 

 
Table 2. Cost vector [cj] shows the disagreement within the same class  

Pairs same 

class X1 X2 X3 X4 X5 X6 

X7 

(jnsq) Class 

O1,O2   1 1     1   1 

O3,O4       1 1 1 1 0 

O3,O5 1     1 1   1 0 

O4,O5 1         1   0 

cost [cj] 2 1 1 2 2 3 2   

 
To obtain the disjoint matrix [Ai,j], for each pair of observations,  the features of 

different classes should also be different in order to be able to discriminate them. The 

number of lines generated in [Ai,j] is N.M, where N and M are the number of 

observations of class 0 and class 1, respectively. The value kj corresponds to the number 

of lines covered for each feature j. 

 
11 

 
Table 3 presents the pairs of observations that belong to different classes, where the 

logic operator XOR can also be applied, since, if (Oa, Xj) ≠ (Ob, Xj), the feature Xj 

should be chosen because it distinguishes the two classes. 

 
Table 3.  Matrix [Ai,j] shows the disjoint between different classes  

Pairs different 

classes X1 X2 X3 X4 X5 X6 

X7 

(jnsq) 

1 - O1,O3   1 1     1 1 

2 - O1,O4   1 1 1 1     

3 - O1,O5 1 1 1 1 1 1   

4 - O2,O3             1 

5 - O2,O4       1 1 1   

6 - O2,O5 1     1 1     

kj 2 3 3 4 4 3 2 

 
In this running example the unitary cost of each column is f(cj, kj) = cj/kj. In Table 4 the 

unitary cost f(cj, kj) is shown. 

 
Table 4. Unitary cost of each column f(cj, kj) 

Features X1 X2 X3 X4 X5 X6 X7 (jnsq) 

cj 2 1 1 2 2 3 2 

kj 2 3 3 4 4 3 2 

cj/kj 1 1/3 1/3 1/2 1/2 1 1 

 
In this example two classes were used.  However, the method can be extended to 

multiple classes.  In this extension, observations of different classes should be compared 

in the generation of the disjoint matrix [Ai,j]. 

 
3.3. Minimum Set Covering Problem  

In the third step of LAID a heuristic is applied to solve the Set Covering Problem. 

Given matrix [Ai,j] and variables yj that denote which columns/features are selected, the 

optimization problem that finds the minimum number of columns/features covering all 

the rows is the Set Covering problem and can be defined in binary linear programming 

as: 

 
Minimize f= ∑ 𝑐𝑗 .yj 

Subject to ∑ 𝐴𝑖,𝑗 .y𝑗 ≥1 

and y
j
∈{0,1}       j=1,…,m 

 
In this sub-section, a greedy heuristic approach is used to solve the Minimum Set 

Covering problem, which reuses the algorithm proposed in [Chvatal 1979]. Algorithm 4 

is composed of two phases, a constructive phase that adds columns to cover all the lines, 

and in a second phase removes the redundant columns. 

 
12 

 
Algorithm 4: Set Covering Problem 

Input: sub-matrix [Ai,j] and vector [cj]
 

Output: a subset of attributes with minimum cost 

-- Constructive Phase 

1. Initialize R = [Ai,j], S=Ø 

2. While R ≠ Ø do 

2.1.   Choose the best line i*∈R such that |Ai*,j)|=min |Ai,j|, ∀j 
2.2.   Choose the best column j* that covers line i*, considering f(cj, kj) 

2.3.   Update R and S, R=R\Ai,j*, ∀i, S=S∪{j*} 
4. End while 

-- Remove redundant columns 

5. Sort cover S by descending order of costs 

6. For each Si do if (S\Si is still a cover) then S=S\Si 

7. Return S 

 
In the Constructive Phase the algorithm chooses line i* with the fewest elements (2.1).  

Next it selects column j* that covers line i*, with the lowest cost considering f(cj, kj) 

(2.2). In the first iteration from features {X1, X2, X3, X4, X5, X6, X7} solution {X2} is 

obtained with cost=3. This solution is not admissible since there are 3 uncovered lines. 

In the second iteration X4 is added to the solution, with a global cost of 7, where one 

covered line is missing. In the third iteration the admissible solution {X2, X4, X7} is 

found, with cost 9. The constructive phase ends and the redundancies removal phase 

occurs, where a new admissible solution {X4, X7} is found, with cost 6. 

 
Given the subset {X4, X7}, in Table 5 the reduced dataset is presented. The number of 

features is reduced from 7 to 2 and the number of observations is also reduced from 5 to 

3. Nearly all the combinations with 2 features are presented, except for X4=0 and X7=1 

because no observations occur. 

 
Table 5. Reduced dataset 

observation\feature X4 X7 (jnsq) Class 

O1, O2 1 0 0 

O3 1 1 1 

O4, O5 0 0 1 

no observations 0 1 ? 

 
3.4. Leave-one-out cross-validation   

The major method for supervised learning model validation is cross-validation. Cross-

validation technique involves the partition of the sample dataset into n subsamples, 

using (n-1) for repeated training and the remaining for testing. In this work, we adopt 

the special case of Leave-One-Out cross-validation, which consists in removing one 

observation from the original sample, training the algorithm with the remaining 

observations and then, testing the observation using the resulting model. The input file 


13 

 
for the Leave-One-Out cross-validation is the reduced dataset and not the original one, 

which results in a reduced computational effort. 

 
To compare observation i with the testing observation x, we use the Hamming distance, 

H(i,x). The Hamming distance between the known observation O1=(1,1,0,0) and testing 

observation x=(1,0,1,0) is H(1,x)=2, since there are two differences, namely  in the 

second and third elements. 

 
In the predicting process, the original LAD algorithm that compares observations 

belonging to two disjoint classes uses the so-called discriminant function, which is 

similar to the k-nearest neighbor classification algorithm.  

 
In the k-nearest neighbor prediction algorithm, k observations are chosen which present 

the lowest Hamming distance in comparison with the testing observation.  

 
As the class of the observation is known, we can count the number of observations of 

class 0 and class 1. The prediction of the x class is given by the most frequent class, 

which in statistics is the mode. Therefore, the prediction class of x is the mode of the 

classes of the k observations: 

 
predicted_class(x) = mode (class(O1), class(O2), …, class(Ok)) 

 
The performance of a classification model is based on the total counts of correct and 

incorrect predictions. These counts are tabulated in a table known as Confusion Matrix. 

The Confusion Matrix is a specific square table, actual class versus predicted class, 

which allows the visualization of the performance of an algorithm. Based on the 

Confusion Matrix, many performance measures can be extracted. In this work the 

accuracy or hit-rate measure is used, expressed by: 

 
accuracy =
 number of correct predictions

total number of predictions
 

The sub-problem algorithm returns the pair (number of features, accuracy). The number 

of features is given by Algorithm 4, the Set Covering Problem, and the accuracy by the 

Leave-one-out cross-validation. This sub-problem algorithm is used by the Heuristic 

Decomposition presented in the next section. 

 
4. Heuristic Decomposition and Master-problem Algorithm 

 
In many Linear Programming optimization problems, the constraints and variables may 

be decomposed in blocks, where each constraint set only involves a variable set, with 

the following structure: 

 
Minimize    x1 + x2 + x3 + x4 + x5 + x6  

Subject to   x1 + x2                                  ≥ 1 

                                  x3 + x4                   ≥ 1 

                                                  x5 + x6   ≥ 1                                      

 
14 

 
As our problem is not so regular, we must resort to a Heuristic Decomposition for the 

Set Covering problem. The matrix [Ai,j] is divided into max_k sub-problems resulting in 

max_k sub-matrices [Ai,j]
k
. The structure of the problem is maintained since each sub-

problem returns a feasible (or admissible) solution.  Comparing this approach with 

meta-heuristics the sub-matrices correspond to different neighborhoods in Variable 

Neighborhood Search or to new regions in the diversification process of Tabu Search, 

where feasible solutions can be found. 

 
Each sub-problem returns the pair (number of features, accuracy) for the bi-criteria 

feature selection, where the goal is to minimize the number of features and to maximize 

the accuracy.  

 
4.1. Heuristic Decomposition  

The original problem with one million features is very hard to solve, given its 

dimensionality. In the decomposition each sub-problem should return an admissible 

solution, that is, all lines should be covered by the chosen features/columns. Given the 

dimension of each sub-problem the value of max_k can be found. 

 
As each sub-problem returns a feasible solution, the problem is perfectly separable or 

independent. A parallel version of the Heuristic Decomposition for the bi-criteria 

feature selection is implemented. Algorithm 5 reuses the LAID procedure, described in 

Algorithm 4, and we consider the following notation: 

nf – number of features 

ar – accuracy 

 
Algorithm 5: Heuristic Decomposition for the Bi-criteria Feature Selection 

Input: sub-problems A[i,j]
1..max_k

 
Output: Pareto front with pairs (nf, ar) 

1. solve the sub-problems:  

 sub-problem 1: (nf
1
, ar

1
) = LAID (A[i,j]

1
) 

 sub-problem 2: (nf
2
, ar

2
) = LAID (A[i,j]

2
) 

 ... 

 sub-problem max_k: (nf
max_k

, ar
max_k

) = LAID (A[i,j]
max_k

) 

2. solve the master-problem: 

 find Pareto optimal set ((nf
1
, ar

1
), (nf

2
, ar

2
), …, (nf

max_k
, ar

max_k
)) 

 
In our Heuristic Decomposition for feature selection, using parallel processing, the 

matrix A[i,j] is divided into max_k sub-matrices. Feasible sub-solutions are returned by 

the LAID procedure with the metrics, number of features and accuracy. Each pair 

(number of features, accuracy) corresponds to a point in a bi-objective optimization 

problem. To find the Pareto optimal set, the master problem bi-criteria decision making 

is detailed in sub-section 4.2. 

 
15 

 
4.2. Bi-objective Master-problem Algorithm 

Multi-objective optimization (MOO), or multi-criteria optimization, deals with the 

optimization of two or more conflicting objectives, subject to a set of constraints.  

 
Given the vector x=(x1,…, xm) of decision variables, M is the number of objectives and 

J the number of constrains, multi-objective optimization can be stated as follows 

[Collette, Siarry 2011].  

 
Minimize / Maximize fm(x),   m = 2,…, M objectives 

Subject to cj(x) {≤, = ,} 0      j = 1, 2,…, J constrains 

and            xi  0 

 
In multi-objective optimization more than one optimal solution can be obtained, 

whereas classic optimization has only one objective. Variable S represents the set of 

feasible solutions associated with equality and inequality constraints. F(x) =(f1(x), f2(x), 

..., fm(x)) is the vector of objectives to be optimized. 

In multi-objective optimization the dominance concept is central. In the maximization 

problem, the objective vector u=(u1, …, um) dominates v=(v1, …, vm), denoted u > v, if 

and only if,  ui ≥ vi: ∀i, and at least one component v is smaller, ui > vi: ∃i.  
 

A solution is non-dominated, or Pareto solution, if and only if, there is no solution that 

dominates it. In other words, a solution x∗  S is Pareto optimal if for every x ∈ S, F(x) 
does not dominate F(x∗). 

 
The Pareto optimal set, P
*
, includes all the Pareto solutions, i.e., the set of all solutions 

whose associated vectors are non-dominated. 

 
When plotted in space, non-dominated vectors are collectively known as the Pareto 

front, PF
*
. The procedure to generate Pareto front is to compute as many points as 

possible and then build a surface that includes those points. The surface created by the 

Pareto front can be linear, convex or concave. The Pareto front is defined as PF∗ = 

{F(x), x  P∗}. 

  
The ideal vector contains the best solutions considering the m objectives separately, at 

the same point.  A point y∗ = (y∗1, y∗2, ..., y∗m) is an ideal vector if it optimizes each 

objective function fi in F(x). 

 
Given a large number of feasible solutions in MOO problems, it is important to 

differentiate two stages: the optimization of the objective functions and the decision-

making process. Three strategies are commonly used: (i) a priori, making decisions 

before optimizing; (ii) a posteriori, optimizing before making decisions and (iii) 

compromising between optimizing and decision making. 

 
16 

 
In this work, given the criteria number of features and accuracy, the MOO is reduced to 

a bi-objective optimization problem. Our approach is formulated as a Bi-objective 

Feature Selection, such that the aim is to: 

 
 minimize the number of features (f1) and  

 maximize the accuracy of the reduced set of features (f2). 
 

Figure 1, adapted from [Talbi 2009], shows a bi-objective optimization. The black 

circles of the Pareto solution dominate the solutions represented by triangles.  The 

efficient front is given by the curve that includes all the Pareto solutions. In this case we 

want to minimize f1 and maximize f2.  

 
The ideal solution is obtained by the combination of the solution that minimizes f1 with 

a second solution that maximizes f2. 

 
Figure 1.  Pareto set to minimize f1 (number of features)  

and maximize f2 (accuracy) 

 
After the generation of the bi-objective sub-solutions generated by the sub-problem, the 

master problem chooses the non-dominant solutions and includes them in the Pareto 

optimal set. The Pareto set contains only the solutions with the least number of features 

and the best accuracy, which substantially reduces the number of bi-criteria solutions 

that will be analyzed by the decision maker, as in this case, the omics dataset analyst. 

 
Regarding the optimization and decision-making stages, we select the a posteriori 

strategy, which does not require previous information from the decision maker.  

 
4.3. Parallel processing for heuristic decomposition and MOO  

The rapid increase of powerful processors and fast networks has enabled the emergence 

of computer clusters, large-scale network of machines (grids) and platforms for high-

performance computing (HPC). In terms of designing parallel heuristics, three major 

parallel levels are identified: the algorithmic level, the interaction level and the solution 

level [Talbi 2009]. 


17 

 
 In the algorithm level, different instances of the heuristic can run independently 
in different processors, and the search is equivalent to the sequential execution. 

The sub-problem is independent, and the sub-solutions are not shared in the 

processors. The master problem collects the sub-solutions returned by the sub-

problems.  

 
 In the interaction level solutions, the sub-problems are independent, although the 
master problem reuses the sub-solutions by sharing them with the different 

processors. The main objective of this level is to speed up the algorithm by 

reducing the search time. This approach can be applied in population-based 

heuristics. 

 
 Finally, at the solution level, the problem is dependent, i.e., not separable, and so 

processors return partial (or non-feasible) sub-solutions. The master algorithm 

should ensure the feasibility of the final solution. 

 
Regarding the parallel levels, we opt for the algorithm level because only feasible sub-

solutions are returned from the sub-problems.  

 
In this work, parallel processing perfectly matches large volumes of data and 

decomposition techniques applied to feature selection objectives and MOO. 

 
To deal with large volumes of omics datasets, the decomposition is a strategy that 

combines with parallel processing. Large omics datasets are parceled out for the 

heuristic decomposition approach. Then the heuristic sub-problems can run in parallel 

resulting in a sequence schematized as follows: 

 
large omics datasets  heuristic decomposition  parallel processing 

 
On the other hand, the feature selection with two objectives influences the MOO 

approach. MOO collects multiple solutions, which perfectly combine with parallel 

processing that also deals with many solutions simultaneously. Given the pair (number 

of feature, accuracy) MOO can be applied. Then MOO can run in parallel processing 

resulting in the following sequence: 

 
pair (number of feature, accuracy)  MOO  parallel processing 

 
Parallel processing integrates the two concepts, computing multiple solutions for the 

decomposed heuristic and for MOO. The result is unified in the Pareto optimal set 

computed by the master algorithm.    

 
18 

 
5. Computational results 

 
To implement the computational results of this algorithm, some choices such as the 

computational environment, the datasets and the performance measures must be made. 

 
The computer programs were written in C language and the GCC/Dev-C++ compiler 

was used. The computational results were obtained from an Intel Core Duo CPU 3.0 

GHz processor with 4.0 GB of main memory running under the Windows 10 operating 

system. The INCD (National Infrastructure for Distributed Computing, Portugal) 

computer cluster was used to run the LAID algorithm in parallel. 

 
As already stated, since omics datasets with rare diseases information are subject to 

restricted access, we generated artificial datasets with similar characteristics using 

SeqSIMLA. The dataset under study is comprised of 1700 observations in the affected 

group, with class 1, associated with rare disease patients and 300 observations in the 

control group, with class 0, referring to healthy people. The disease prevalence of 0.015 

was chosen. Reference haplotypes and recombination rates from chromosome 1 of the 

European population of the 1000 Genomes Project were used for data generation 

[Chung et al. 2014]. Initially 30 variants were chosen as disease sites, of which 19 had 

allele frequencies larger than 0 (ranging from 0.0001 to 0.447). We use a fixed value of 

odds ratio of 2.0 for the marginal association with the phenotype for all variants and all 

non-variant features were removed. In the original dataset the number of attributes is 

close to one million. The number of patient observations, Oa, is 1700 and the number of 

observations of controls, Ob, is 300.  The generation of matrix [Ai,j], for each 

observation Oa is compared with all the observations Ob, resulting in a number of lines 

Oa×Ob, as exemplified in Table 3. The number of lines in the matrix [Ai,j] is 510,000 

(1700 x 300) and the final dimension of the matrix is 510,000 x 1,000,000. 

 
Two performance measures are used to evaluate results, the computational time and 

quality of the solutions. The quality solution is given by two criteria, the minimum 

number of attributes that better explains the dataset and the accuracy of the reduced 

dataset. 

 
5.1. Computational runtime 

To study the performance of large matrices the time complexity should be considered to 

estimate the total runtime. 

 
For a two-class problem the number of observations is given by the tuple (O0, O1); (100, 

100), (500, 500), (700, 700) and (1700, 300). The number of columns is tested from 100 

to 5000 variables. 

 
19 

 
Figure 2. Computational time varying the number of columns and observations 

 
In the Decomposition Heuristic for the Set Covering problem, for each sub-matrix 

[Ai,j]
k
, the computational time is shown in Figure 2, where a linear behavior can be 

found between runtime and the number of columns. 

 
The total runtime for matrix [Ai,j] with dimension 510,000 x 1,000,000, and k=200, that 

is 200 sub-matrices [Ai,j]
k
 of 510,000 x 5,000 each, can be estimated in 570,800 

seconds, or approximately seven days. Since it is possible to decompose the problem, it 

can run in a computer parallel environment. In the implementation we distributed the 

task using 10 machines, resulting in a runtime of 58,000 seconds, or approximately 16 

hours. 

 
The speedup SN is defined as the time T1 taken to complete a program with one 

processor divided by the time TN taken to complete the same program with N 

processors. 

𝑆𝑁 =
𝑇1
𝑇𝑁

 
The efficiency EN using N processors is defined as the speedup SN divided by the 

number of processors N. 

𝐸𝑁 =
𝑆𝑁
𝑁

 
In our work we opt for the algorithm level, where instances of the heuristic can run 

independently using different processors, and speedup is linear, with an efficiency EN of 

100%, meaning all the processors are being fully used all the time. 

 
5.2. Quality of the solutions 

The challenge of this work is to solve a feature selection problem with a dataset 

involving one million attributes. In the computational experiments performed with the 

sub-dataset with less than 1000 columns, we verified that the admissibility of the 

solutions is found by merging 3 sub-matrices [Ai,j]
k
. For sub-datasets with 5000 


20 

 
columns, feasible solutions are found for each sub-matrix. Running 200 times, hundreds 

of possible good solutions are found and then used in the master problem phase. 

Keeping this in mind, we add a new criterion, the accuracy. 

 
As previously stated, the solution quality is given by two objectives, the number of 

attributes of the feasible reduced set and the accuracy in the cross-validation. The 

number of attributes is given by the Set Covering heuristic where there are no 

parameters to point out. To obtain the accuracy in the Leave-One-Out cross-validation, 

we use k=3 in the k-nearest neighbor prediction algorithm. The Pareto optimal set is the 

result of the master-problem algorithm.  

 
Figure 3 represents the Pareto front in a chart with axis, number of features and 

accuracy. The Pareto front contains the set of efficient points. However, in this case 

only point (19, 0.84) belongs to the efficient set, all the other points are dominated by 

this one. After analyzing the results three solutions are found with the performance of 

(19, 0.84). 

 
The feature selection was accomplished given the reduction from 5000 features to 19 

features, which is quite expressive.  

 
A scale for classifying the accuracy is: excellent (0.91-1.00), good (0.81-0.90), fair 

(0.71-0.80), poor (0.61-0.70) and fail (0.51-0.60). So, the accuracy found of 0.84 is 

considered not excellent but good. 

 
Figure 3. Pareto front of the objectives, accuracy and number of features 

 
Definitive quality assessment of the sub-dataset must be evaluated by human specialists 

on rare diseases. A huge reduction is attained, from one million features to three 

reduced sub-datasets with 19 features each. 

 
21 

 
5.3. Final remarks of the computational experiment 

The final remark of this study regarding the large volumes of data can be summarized 

considering three aspects: the data, the algorithm and the environment.  

 
Regarding the data, large datasets with many dimensions are becoming the new normal, 

where the in-memory data access should be replaced by the on-disk data access. The 

Hierarchical Data Format (HDF5), originally developed at the National Center for 

Supercomputing Applications, is designed to store and organize large amounts of data 

that can be accessed on-disk in multiple ways. 

 
The new algorithms need  low time complexity and problem decomposition is essential 

to parallelize the problem. The intense use of computational expensive algorithms, like 

meta-heuristics, should be reduced, giving way to simple heuristics. Furthermore, with 

meta-heuristics the setting of multiple parameters is nontrivial, while with heuristics the 

number of parameters is substantially reduced. 

 
Finally, the desired environment is in the cloud, running the programs in parallel in 

High-Performance Computing. The European research roadmap High-Performance 

Computing 2030 [European Commission 2018] should open new perspectives.    

 
6. Conclusions  

 
Given the large datasets available in bioinformatics, the aim of this work is to present a 

feature selection method able to deal with thousands of observations and millions of 

attributes.  

 
In feature selection, two performance measures are considered: the predictive accuracy 

and result comprehensibility using the parsimony principle. To preserve the data 

semantics, the minimum number of attributes that better explain the dataset is found, 

achieving the maximal actionable knowledge. On the one hand, accuracy is essential to 

measure the quality of the solution.  On the other hand, the consistency measure is a 

core issue in the result comprehensibility. In this work, we opt for Filter methods for 

two main reasons, the large volumes of data that should be computed and the data 

semantics preservation. This document extends the previous work of the LAID 

algorithm to large datasets, using a bi-objective approach that includes the two 

performance measures.  

 
The sub-problem algorithm returns the performance metric (number of features, 

accuracy). We present LAID using four steps. Initially, the consistency is assured by the 

inclusion of the dummy variable “je ne sais quoi”. In the set covering problem, the 

disjoint matrix [Ai,j] and cost vector [cj] reuse the distance measure concept for the 

independent criteria in feature selection. Therefore, the algorithm combines consistency 

measure and the distance measure approaches. To compute the accuracy the Leave-One-

Out cross-validation is used.  

 
In the Decomposition Heuristic algorithm, the disjoint matrix is divided into sub-

matrices [Ai,j]
max_k

. Each sub-problem algorithm returns a pair of measures, which 


22 

 
allows the bi-objective optimization. The algorithm finds multiple feasible solutions that 

are gathered in the master phase. In the master problem a Pareto optimal set is found 

selecting the subset of non-dominant solutions and reducing the number of feasible sub-

solutions, which facilitate the work of the end user. 

 
The parallel processing integrates two concepts, computing multiple solutions for the 

decomposed heuristic and for MOO. The parallel techniques work perfectly to 

decompose the large volumes of omics datasets. And the parallel approach also 

combines with the two objectives of feature selection and the diverse sub-solutions of 

the bi-objective optimization.  

 
The computational runtime for the feature selection with dimension 2,000 observations 

and 1,000,000 features, running in parallel, seems suitable for user requirements. In the 

experiment the Pareto optimal set includes only one point, where 19 features are chosen, 

and the accuracy achieves the percentage of 84%.  

 
In future work, following the stabilization of the requirements of the end users, we 

intend to compare our performance measures with other algorithms. 

 
 Acknowledgements 
The authors would like to thank the FCT support UID/Multi/04046/2013. This work 

used the EGI, European Grid Infrastructure, with the support of the IBERGRID, Iberian 

Grid Infrastructure, and INCD (Portugal). 

 
REFERENCES 

1. Almuallim, H., Dietterich, T.G. (1991). Learning with many irrelevant features. 
In: Proceedings of the 9th National Conference on Artificial Intelligence, MIT 

Press, pp. 547-552. 

2. Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I. 
(2000). An Implementation of Logical Analysis of Data. IEEE Transactions on 

Knowledge and Data Engineering, vol. 12(2), pp. 292-306. 

3. Boyd, S., Xiao, L., Mutapcic, A., Mattingley, J. (2008). Notes on decomposition 
methods. Notes for EE364B, Stanford University, pp. 1-36.  

4. Cavique, L., Mendes, A.B., Funk, M. (2011). Logical analysis of inconsistent 
data (LAID) for a paremiologic study. In: Processing 15th Portuguese 

Conference on Artificial Inteligence, EPIA.  

5. Cavique, L., Mendes, A.B., Funk, M., Santos, J.M.A. (2013). A feature selection 
approach in the study of azorean proverbs. In: Exploring Innovative and 

Successful Applications of Soft Computing, Advances in Computational 

Intelligence and Robotics (ACIR) Book Series, IGI Global, pp. 38‐58.  

6. Cavique L., Mendes A.B., Martiniano H.F.M.C. (2017). A Feature Selection 
Algorithm Based on Heuristic Decomposition, In: Oliveira E., Gama J., Vale Z., 

Lopes Cardoso H. (eds) Progress in Artificial Intelligence, EPIA 2017, Porto, 

Portugal, Lecture Notes in Computer Science, Springer, vol. 10423, pp.525-536.  

7. Chandrashekar, G., Sahin, F. (2014). A survey on feature selection methods. 
Computers and Electrical Engineering, vol. 40(1), pp. 16-28. 

8. Chvatal, V. (1979).  A greedy heuristic for the set-covering problem. 
Mathematics of Operations Research, vol. 4, pp. 233-235.  


23 

 
9. Chung R.H., Tsai W.Y., Hsieh C.H., Hung K.Y., Hsiung C.A., Hauser E.R. 
(2014). SeqSIMLA2: Simulating Correlated Quantitative Traits Accounting for 

Shared Environmental Effects in User-Specified Pedigree Structure. Genetic 

Epidemiology. vol. 39(1) pp.20-24. 

10. Collette Y., P. Siarry (2011). Multiobjective Optimization, Principles and Case 
Studies, Decision Engineering Series, Springer.  

11. Crama, Y., Hammer, P. L., Ibaraki, T. (1988). Cause-effect relationships and 
partially defined Boolean functions. Annals of Operations Research, vol. 16, pp. 

299-326. 

12. European Commission (2018). The European declaration on High-Performance 
Computing, url: https://ec.europa.eu/digital-single-market/en/news/european-

declaration-high-performance-computing 

13. John, G.H., Kohavi, R., Pfleger. K. (1994). Irrelevant Features and the Subset 
Selection Problem. In: Proceedings of the 11th International Conference on 

Machine Learning, ICML 94, pp. 121-129.  

14. Joncour, C., Michel. S., Sadykov, R., Sverdlov, D., Vanderbeck, F. (2010).  
Column generation based primal heuristics. Electronic Notes in Discrete 

Mathematics, Elsevier, vol. 36, pp. 695–702.  

15. Kira, K., Rendell, L.A. (1992). The feature selection problem: traditional 
methods and a new algorithm. In: Proceedings of 9th National Conference on 

Artificial Intelligence, pp. 129-134.  

16. Liu, H., Yu, L. (2005). Toward integrating feature selection algorithms for 
classification and clustering. IEEE Transactions on Knowledge and Data 

Engineering, vol.17, vol. 4, pp. 491-502.  

17. Pawlak, Z. (1982). Rough Sets. International Journal of Computer and 
Information Science, vol. 1, pp. 341-356. 

18. Pawlak, Z.: Rough Sets (1991). Theoretical Aspects of Reasoning about Data. 
Kluwer Academic Publishers, Boston.  

19. Peters, J.F. & Skowron A. (2010). Transactions on Rough Sets XI. Lecture 
Notes in Computer Science / Transactions on Rough Sets, Springer. 

20. Polkowski, L. (2002). Rough Sets, Mathematical Foundations. Advances in 
Soft Computing, Physica-Verlag Heidelberg, Germany. 

21. Smet P., Ernst, A., Vanden Berghe, G. (2016). Heuristic decomposition 
approaches for an integrated task scheduling and personnel rostering problem. 

Computers and Operations Research, vol. 76, pp. 60-72.  

22. Stephens Z.D., S.Y. Lee, F. Faghri, R.H. Campbell, C. Zhai, M.J. Efron, R. Iyer, 
M.C. Schatz, S. Sinha, G.E. Robinson (2015). Big Data, Astronomical or 

Genomical?, PLoS Biology vol. 13(7): e1002195, doi:10.1371/ 

journal.pbio.1002195.  

23. Talbi E.G. (2009). Metaheuristics, From Design to Implementation, John Wiley 
& Sons, Inc.  

24. The 1000 Genomes Project Consortium (2015). A global reference for human 
genetic variation, Nature, vol. 526, pp.68-74.  

25. Yao PJ, Chung RH (2016). SeqSIMLA2_exact, simulate multiple disease sites 
in large pedigrees with given disease status for diseases with low prevalence, 

Bioinformatics, vol. 32(4), pp. 557-562.