MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks


MKL-GRNI: A parallel multiple kernel
learning approach for supervised inference
of large-scale gene regulatory networks
Nisar Wani1 and Khalid Raza2

1 Govt. Degree College Baramulla, Jammu & Kashmir, India
2 Department of Computer Science, Jamia Millia Islamia, New Delhi, India

ABSTRACT
High throughput multi-omics data generation coupled with heterogeneous
genomic data fusion are defining new ways to build computational inference
models. These models are scalable and can support very large genome sizes with
the added advantage of exploiting additional biological knowledge from the
integration framework. However, the limitation with such an arrangement is the
huge computational cost involved when learning from very large datasets in a
sequential execution environment. To overcome this issue, we present a multiple
kernel learning (MKL) based gene regulatory network (GRN) inference approach
wherein multiple heterogeneous datasets are fused using MKL paradigm.
We formulate the GRN learning problem as a supervised classification problem,
whereby genes regulated by a specific transcription factor are separated from other
non-regulated genes. A parallel execution architecture is devised to learn a large
scale GRN by decomposing the initial classification problem into a number of
subproblems that run as multiple processes on a multi-processor machine.
We evaluate the approach in terms of increased speedup and inference potential
using genomic data from Escherichia coli, Saccharomyces cerevisiae and Homo
sapiens. The results thus obtained demonstrate that the proposed method
exhibits better classification accuracy and enhanced speedup compared to other
state-of-the-art methods while learning large scale GRNs from multiple and
heterogeneous datasets.

Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning
Keywords Gene regulatory networks, GRN inference, large-scale GRN, Systems biology,
Network biology

INTRODUCTION
The problem of understanding gene interactions and their influence through network
inference and analysis is of great significance in systems biology (Albert, 2007). The aim
of this inference process is to establish relationships between genes and construct a
network topology based on the evidence provided by different data types. Among various
network inference studies, gene regulatory network inference (GRNI) has remained of
particular interest to researchers with extensive scientific literature generated in this
domain. Gene regulatory networks (GRNs) are biological networks where genes
serve as nodes and the edges connecting them serve as regulatory relations (Lee et al., 2002;
Raza & Alam, 2016). Standard methods for GRN inference such as RELNET

How to cite this article Wani N, Raza K. 2021. MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-
scale gene regulatory networks. PeerJ Comput. Sci. 7:e363 DOI 10.7717/peerj-cs.363

Submitted 19 October 2020
Accepted 29 December 2020
Published 28 January 2021

Corresponding author
Khalid Raza, kraza@jmi.ac.in

Academic editor
Othman Soufan

Additional Information and
Declarations can be found on
page 17

DOI 10.7717/peerj-cs.363

Copyright
2021 Wani and Raza

Distributed under
Creative Commons CC-BY 4.0

http://dx.doi.org/10.7717/peerj-cs.363
mailto:kraza@�jmi.�ac.�in
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.363
http://www.creativecommons.org/licenses/by/4.0/
http://www.creativecommons.org/licenses/by/4.0/
https://peerj.com/computer-science/


(Butte & Kohane, 1999), ARACNE (Margolin et al., 2006), CLR (Faith et al., 2007),
SIRENE (Mordelet & Vert, 2008) and GENIE3 (Huynh-Thu et al., 2010) mostly use
transcriptomic data for GRN inference. Among these methods, our approach is modeled
along the same principle as SIRENE. SIRENE is a general method to infer unknown
regulatory relationships between known transcription factors (TFs) and all the genes of
an organism. It uses a vector of gene expression data and a list of known regulatory
relationships between known TFs and their target genes. However, integration of this data
with other genomic data types such as protein–protein interaction (PPI), methylation
expression, sequence similarity and phylogenetic profiles has drastically improved GRN
inference (Hecker et al., 2009). A comprehensive list of state-of-the-art data integration
techniques for GRN inference has been reviewed in (Wani & Raza, 2019a).

In this article, we aim to integrate gene expression, methyl expression and TF-DNA
interaction data using advanced multiple kernel learning (MKL) library provided by
shogun machine learning toolbox (Sonnenburg et al., 2010) and design an algorithm to
infer gene regulatory networks (GRNs). Besides, we also integrate PPI data and other
data such as gene ontology information as source of prior knowledge to enhance the
accuracy of network inference. The problem of network inference is modeled as a
binary classification problem whereby a gene being regulated by a given TF is treated as a
positive label and negative otherwise. To infer a large-scale network, the MKL model
needs to be trained for each TF with a set of known regulations for the whole genome.
Given N TFs, we need to train N different classification models individually and then
combine the results from these models for a complete network inference task. As the
number of TFs increase, the number of classification models also increase, creating
resource deficiency and long execution times for the inference algorithm. The proposed
approach attempts to provide a solution to this problem by distributing these classification
models to different processors on a multi-processor hardware platform using parallel
processing library from Python. The results from these models are stored in a shared queue
object which are later on used for network inference. A detailed description of the model is
contained in the methods section.

RELATED LITERATURE
An early attempt to learn and classify gene function from integrated datasets using
kernel methods was carried out in Pavlidis et al. (2002). They trained a support vector
machine (SVM) for gene function classification with a heterogeneous kernel derived from
a combination of two different types of data (e.g., gene expression and phylogenetic
profiles). Since SVM does not learn from multiple kernel matrices simultaneously, they
proposed three different ways to fuse two datasets and referred to these fusion methods as
(i) early integration, (ii) intermediate integration and (iii) late integration approaches.
In early integration, feature vectors from heterogeneous data types are concatenated to
build a single length vector for a given set of genes. This extended dataset is then
transformed into a kernel matrix using appropriate kernel function and serves as an
input to the SVM model from where we can draw biological inferences. In the case of
intermediate integration, the two datasets are first transformed into their respective kernel

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 2/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


matrices; subsequently these kernel matrices are added together to yield an integrated
kernel for SVM training. For late integration, the authors trained the SVM models
individually using the heterogeneous datasets. The probability scores which act as
discriminant values obtained from separate SVM models are then added together for gene
function prediction.

In fact, kernel-based methods as effective integration techniques were first proposed in
Lanckriet et al. (2004), wherein a 1-norm soft margin SVM is trained for a classification
problem, separating membrane proteins from ribosomal proteins. They combined
heterogeneous biological datasets such as PPI, amino acid sequences and gene expression
data characterizing different proteins by transforming them into multiple positive
semidefinite kernel matrices using different kernel functions. Their findings reveal an
improved classifier performance when all datasets are integrated as a unit compared to
testing the classifier on individual datasets. In an earlier study (Lanckriet et al., 2003)
on function prediction for baker’s yeast proteins, they trained an SVM classifier with
multiple datasets of different types and achieved an improved performance over a classifier
trained using single data type.

In yet another study for network inference using kernel data integration (Yamanishi,
Vert & Kanehisa, 2004), the authors fused four different datasets, namely gene
expression data, protein interaction data, protein localization data and data from
phylogenetic profiles. These datasets are transformed into different kernel matrices.
Datasets comprising of gene expression, protein localization and data from phylogenetic
profiles were kernelized using Gaussian, polynomial and linear kernel functions. Graph
datasets were kernelized using diffusion kernel (Kondor & Lafferty, 2002). This study
compared both unsupervised and supervised inference methods on single and integrated
datasets. To assess the accuracy of the methods, the inferred networks are compared
with a gold standard protein network. Contrary to the unsupervised approaches, the
supervised approach seems to make interesting predictions and capture most of the
information from the gold standard. They observed that data from transcriptomic and
phylogenetic profiles seem to contribute with an equal quantum of information followed
by noisy PPI and localization data. Applying a supervised approach to integrated datasets
seems to produce overall best results, therefore highlighting the importance of guided
network inference from integrated prior biological knowledge.

In another study, Ben-Hur & Noble (2005) applied kernel methods to PPI studies
and proposed a pair-wise kernel between two pairs of proteins in order to construct a
similarity matrix. This pairwise kernel is based on three sequence kernels, a spectrum
kernel, a motif, and a Pfam kernel. They further extended this experiment to explore
the effect of adding kernels from non-sequence data, such as gene ontology annotations,
homology scores and Mutual clustering coefficient (MCC) derived from protein
interactions computed in each cross-validation fold. Integrating these non-sequence
features with the pairwise kernel resulted in improved performance than any method by
itself.

Another integration and supervised learning method that uses MKL is the Feature
Selection Multiple Kernel Learning (FSMKL) proposed by Seoane et al. (2013). The feature

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 3/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


selection is performed on variable number of features per kernel, separating feature
sets from each data type with greater relevance to the given problem. The selection criteria
uses statistical scoring by ranking features that are statistically aligned with the class
labels and biological insights, where genes that are present in a specific pathway are chosen.
They integrate gene expression, copy number variation and other genomic data from
KEGG pathway. These data are transformed into their base kernels and integrated
using MKL framework into a combined kernel. The prior biological knowledge in the
form of pathway information serves as central criterion for FSMKL to cluster samples.
The authors claim that FSMKL performance is comparable to the other state-of-the-art
breast cancer prognosis methods from DREAM challenge. Speicher & Pfeifer (2015)
adopted an unsupervised approach to discover cancer subtypes from an integrated kernel
using MKL. The proposed method called Regularized MKL Locality Preserving Projections
(rMKL-LPP) integrates multi-omics data such as gene expression, DNA methylation
and miRNA expression profiles of multiple cancer types from TCGA (Tomczak,
Czerwińska & Wiznerowicz, 2015). This regularized version extends the dimensionality
reduction variant of the MKL technique (MKL-DR) proposed by Yan et al. (2007).
The regularization term allows to use different types of kernels during optimization
process and also avoids overfitting. They cluster the samples by applying k-means on the
distance summation of each sample’s k-Nearest Neighbors by applying Locality Preserving
Projections (LPP). Also many approaches have been proposed for parameter estimation
of such large-scale and integrated models. Besides, cross validation, grid search and
randomised parameter optimization methods (Remli et al., 2019) have proposed a
cooperative enhanced scatter search for parameter for high dimensional biological models.
Their proposed method is executed in a parallel environment and can be faster than other
methods in providing accurate estimate of model parameters.

Multiple kernel Learning approach has also been applied to the domain of drug-
target interaction network inference and drug bioactivity prediction. For drug-target
interaction prediction, Nascimento, Prudêncio & Costa (2016) proposed a new MKL based
algorithm that selects and combines kernels automatically on a bipartite drug-protein
prediction problem. Their proposed method extends the Kronecker regularized least
squares approach (KronRLS) (Van Laarhoven, Nabuurs & Marchiori, 2011) to fit in a
MKL setting. The method uses L2 regularization to produce a non-sparse combination
of base kernels. The proposed method can cope with large drug vs. target interaction
matrices; does not require sub-sampling of the drug-target network; and is also able to
combine and select relevant kernels. They performed the comparative analysis of their
proposed method with top performers from single and integrative kernel approaches and
demonstrated the competitiveness of KronRLS-MKL to all the evaluated scenarios.
Similarly for drug bioactivity prediction (Cichonska et al., 2018) proposed pairwise MKL
method in order to address the scalability issues in handling massive pairwise kernel
matrices in terms of both computational complexity and memory demands of such
prediction problems. The proposed method has been successfully implemented to the drug
bioactivity inference problems and provides a general approach other pairwise MKL
spaces.

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 4/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Since MKL is applied to solve large scale learning problems, various efforts have been
undertaken to devise a scheme whereby MKL algorithm can be run in a multiprocessor
and distributed computational environment. The authors in Chen & Fan (2014) have
proposed a parallel multiple kernel learning (PMKL) using hybrid alternating direction
method multipliers (H-ADMM). The proposed method makes the local processors to
co-ordinate with each other to achieve the global solution. The results of their experiments
demonstrated that PMKL displays fast execution times and higher classification accuracies.
Another important study to address the scalability and computational requirements in
the domain of large scale learning has been carried out by Alioscha-Perez, Oveneke &
Sahli (2019). They proposed SVRG-MKL an MKL solution with inherent scalability
properties that can combine multiple descriptors involving millions of samples.
They conducted extensive experimental validation of their proposed method on several
benchmarking datasets confirming a higher accuracy and significant speedup for
SVRG-MKL. In one of our recent works, we proposed a data fusion and inference model,
called iMTF-GRN, based on Non-negative Matrix Tri-factorization that integrates the
diverse types of biological data (Wani & Raza, 2019b). The advantage of our proposed
parallel MKL-GRNI approach is that it is simple to implement and does not need complex
coding to distribute multiple classification problems in a multiprocessor environment.
Our method employs shared queue objects for distributing inputs and collecting outputs
from multiple processors compared to PMKL (Chen & Fan, 2014) where multiple
processors are explicitly made to co-ordinate using the hybrid alternating direction
method of multipliers (H-ADMM) introducing complexity and an added computational
overhead. Also, we chose basic addition operation to fuse multiple kernels compared to
Kron-RLS MKL (Cichonska et al., 2018) method, where the fusion of multiple kernels is
achieved by performing Kronecker product operation which requires calculating the
inverse of individual kernels, hence a computational overhead compared to a basic
arithmetic operation. Also for MKL implementation, we used the Shogun toolbox, which
is a highly optimized, stable and efficient tool developed in C++ by Sonnenburg et al.
(2010) making it a suitable candidate for computing-intensive and large-scale learning
problems.

MATERIALS AND METHODS
The proposed method adopts a supervised approach to learn new interactions between
a TF and the whole genome of an organism. The algorithm operates on multiple
datasets that characterize the genes of an organism. Since we are adopting an integrated
approach, datasets such as gene expression, known TF-gene regulations, PPI, and
DNA-methylation data can be combined using MKL approach. All these datasets are
carefully chosen owing to their role in gene regulation. The TF-gene interaction data serves
a dual purpose. It supplies the algorithm with prior knowledge about the regulatory
relationships, and for each TF, the known target gene list also form the labels for the
MKL classifier. For each TF, a set of known gene targets serve as positive examples.
For negative examples, we divide our input into two subsets; the MKL classifier is trained
using positive examples for which no prediction is needed, and the other subset contains

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 5/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


negative examples. We perform 10-fold cross-validation using the same scheme and obtain
discriminant values for all the genes with no prior regulation knowledge for this TF.
This whole procedure is repeated for all the TFs. The idea here is to identify the set of genes
whose expression profiles match those of positive examples even though the classifier is
supplied with some false negative examples in the training set. A graphical overview of
this architecture is depicted in Fig. 1. The problem of GRN inference from integrated
datasets through supervised learning using MKL is not a trivial task. The nature of the
complexity raises manifold while considering GRN inference of organisms with large
genomes sizes. In this scenario, the model training and testing becomes TF specific.
Therefore, the inference problem is decomposed into a set of classification subproblems
corresponding to the total number of TFs present in the input Gene-TF interaction
matrix. A sequential approach to such a problem scenario would require to run each
subproblem one after the other in a loop. However, as we increase the number of TFs,
the execution time of the algorithm also increases. To overcome such problems, we devise
a strategy of parallel execution for the algorithm wherein multiple subproblems run
simultaneously across different processors of a multi-processor hardware platform as
explained in Algorithm 1.

Outputs generated by each model in the form of confidence scores (probability that a
given TF regulates a gene) are stored in a shared queue object. Once all the subproblems
finish their execution, the shared object is iterated to collect the results generated by
all the models in order to build a single output matrix. In case the number of TFs is
more than the number of available processors, they are split into multiple groups and
dispatched to each processor with the condition that the number of TFs are divided in such
a manner so that all the processors receive equal number of classification models.

Figure 1 Application architecture of MKL-GRNI (A) Combined kernel (B) Decomposed regulation
matrices (C) Parallel distribution and model building (D) Model execution (E) Writing results to
shared object. Full-size DOI: 10.7717/peerj-cs.363/fig-1

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 6/20

http://dx.doi.org/10.7717/peerj-cs.363/fig-1
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Kernel methods for genomic data fusion
Kernel methods represent a mathematical framework which embeds data points
(genes, proteins, drugs, etc) from input spaceIto feature space F by employing a kernel
function. Genomic datasets viz., mRNA expression levels from RNA-seq, DNA
methylation profiles and TF-gene regulation matrix obtained from different databases
comprise heterogeneous datasets that can be fused using kernel methods and serve as
the building blocks for inference of gene regulatory networks. A modular and generic
approach to pattern analysis, kernel methods can operate on very high dimensional
data in feature space by performing an inner product on the input data using a kernel
function (Shawe-Taylor & Cristianini, 2004). An algorithm is devised that can work with

Algorithm 1 MKL-GRNI Parallel approach for supervised inference of large-scale gene regulatory
networks.

Input: k datasets D1, D2, . . . ., Dk

Input: Regulation binary matrix R for Classification labels

Output: A matrix of decision scores DS for TF-Gene interaction

begin

Transform D1, D2, . . . ., Dk int k1, k2, . . . ., kn kernels using appropriate kernel function

Fuse n Kernels as K = k1 + k2+…+kn

define mkl parameters params (C, norm, epsilon)

/* Distribute Source TF’s among multiple CPU’s */

foreach cpu in the cpu list do

do in parallel

foreach TF in source TF list do

/* Set MKL parameters and Data */

set mkl.kernel ) K

set mkl.labels )R

set mkl.parameters ) params

/* Obtain decision scores for MKL algorithm between each TF and all
genes in the genomes */

DSTF ) ApplyMKL()

end

put DSTFk in queue Q

end

end

foreach q in Q do

DSTFk ) q.val

end

end

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 7/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


such data and learn patterns. Such an algorithm is more generic as they operate on
any data type that can be kernelized. These kernels are data specific, such as Gaussian,
polynomial and sigmoid kernels for vectorial data, diffusion kernels for graph data,
and string kernels for different types of sequence data. The kernel part is data specific,
creating a flexible and modular approach to combine multiple modules to obtain complex
learning systems. A graphical depiction of this fusion technique is shown in Fig. 2.
The choice of different kernel functions for transforming datasets into their respective
kernel matrices is made after a thorough analysis of literature in the field of kernel methods
and MKL methods.

MKL model
Multiple kernel learning is based on integrating many features of objects such as genes,
proteins, drugs, etc., via their kernel matrices and represents a paradigm shift from

Figure 2 Genomic data fusion by combining kernel matrices from multiple kernels into a single
combined kernel. Full-size DOI: 10.7717/peerj-cs.363/fig-2

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 8/20

http://dx.doi.org/10.7717/peerj-cs.363/fig-2
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


machine learning models that use single object features (Sonnenburg et al., 2006).
This combined information from multiple kernel matrices is provided as an input to
MKL algorithm to perform classification/regression tasks on unseen data. Information
represented by the kernel matrices can be combined by applying the basic algebraic
operations, such as addition, multiplication, and exponentiation such that the positive
semi-definiteness of the candidate kernels is preserved in the final kernel matrix.
The resultant kernel can be defined by following equations using k1 and k2 as candidate
kernel matrices and ϕ1(x) and ϕ2(x), their corresponding embedding in the feature space.

K ¼ k1 þ k2 (1)
with the new induced embedding

�x ¼ �1ðxÞ; �2ðxÞ (2)
Given a kernel set K = {k1, k2, : : : , km}, an affine combination of m parametrized

kernels can be formed as given by: -

K ¼
Xm

i¼1
liki (3)

subject to the constraint that μi (weights) are positive that is, μi ≥ 0, i = 1……..m. With
these kernel matrices as input, a statistical classifier such as SVM separates the two classes
using a linear discriminant by inducing a margin in the feature space. To find this
discriminant, an optimization problem, known as a quadratic program (QP) needs to
be solved. QP belongs to a class of convex optimization problems, which are easily solvable.
Shogun toolbox solves this MKL optimization problem using semidefinite programing
(SDP) first implemented for MKL learning by Lanckriet et al. (2004). Based on this margin,
we classify SVM algorithms into hard, 1-norm soft and 2-norm soft margin SVM.
Here we use the 1-norm soft margin SVM and SDP for MKL optimization and
classification from heterogeneous datasets explained in our earlier work on MKL for
biomedical image analysis (Wani & Raza, 2018). A detailed literature on SVM algorithms
is covered in (Scholkopf & Smola, 2001).

Datasets
To test the parallel MKL algorithm on multiple datasets, we downloaded gene expression
data of Escherichia coli and Saccharomyces cerevisiae from DREAM5 Network inference
challenge (Marbach et al., 2012) along with their gold standard network and human
breast cancer transcriptomic data from TCGA. Some prominent features of these data are
shown in Table 1.

Because the MKL paradigm provides the platform to fuse heterogeneous datasets, we
download PPI data for both E. coli and S. cerevisiae from STRING database (Szklarczyk
et al., 2011). The PPI data is supplied as prior biological knowledge to the algorithm
in order to improve its inference accuracy as MKL can learn from multiple datasets.
To supplement the human transcriptome with additional biological knowledge,

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 9/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


we download DNA methylation expression data for all the genes in the transcriptome from
the TCGA broad institute data portal (https://gdac.broadinstitute.org/). The regulation
data (i.e., known interaction between genes and TFs) for E.coli and S. cerevisiae were
extracted from the gold standard network provided in the DREAM dataset, however,
for GRN inference in humans, the regulation data has been collected from a number of
databases that store TF-gene interaction data derived from ChIP-seq and ChIP-ChIP
experiments. We collected a list of 66 TFs from the ENCODE data portal (https://www.
encodeproject.org/) for which ChIP-seq experiments were carried out on MCF7 breast
cancer cell lines across different experimental labs. The targets of these TFs were extracted
from ENCODE (ENCODE Project Consortium, 2004), TRED (Jiang et al., 2007) and
TRRUST (Han et al., 2015) databases.

Hardware and software requirements
The hardware platform used in this study is an IBM System X3650 M4 server model that
includes an Intel Xeon processor having 24 cores and a primary memory of 32 GB
with extendable option of 64 GB. The system supports a 64-bit memory addressing
scheme having powerful 3.2 GHz/1066 MHz Intel Xeon processors with 1066 MHz front-
side bus (FSB) and 4 MB L2 cache (each processor is dual core and comes with 2 × 2 MB
(4 MB) L2 cache). The system also supports hyper threading features for more efficient
program execution. In order to exploit this multi-core and multithreading features present
in the hardware system we used multiprocessing Python package to dispatch different
sub-problems across multiple cores of the computing system. The process of distribution
of different learning sub-problems among different cores of a multi-core machine has
been demonstrated in Fig. 1. For fusion of multiple datasets we use MKL approach
whereby different datasets are first converted into similarity matrices (Kernels) and then
joined to generate a final integrated matrix for learning TF-gene targets. We use MKL
Python library provided by Shogun Machine Learning toolbox for implementing the
proposed algorithm.

RESULTS
All the genomic datasets are transformed into their respective kernel matrices by using an
appropriate kernel function. For example, datasets such as gene expression and DNA
methylation expression data are transformed using a Gaussian radial basis function.
The PPI data is converted into a diffusion kernel, K = eβH, where H is the negative
Laplacian derived from adjacency and Degree matrix H = A − D of PPI graph.
The TF-Target gene regulation data is organized as a binary matrix of labels (i.e., 1 and −1)

Table 1 Dataset description of different organisms for supervised GRN inference.

Organism Genes Samples Transcription factors Known regulations Known targets

E. coli 4,297 805 140 1,979 953

S. cerevisiae 5,657 536 120 4,000 2,721

Homo sapiens 19,201 1,212 66 73,052 12,028

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 10/20

https://gdac.broadinstitute.org/
https://www.encodeproject.org/
https://www.encodeproject.org/
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


with genes in rows and TFs in columns. The number of rows correspond to the genome
size of the organism and the number of columns correspond to the total number of
TFs being used for GRN inference. The elements of each column with value 1 signify that a
gene gi is regulated by TFj and −1 otherwise. Such an organization of the regulation
data allows us to use each column of the matrix as a label for individual classification
problems in a supervised learning environment.

We perform two sets of experiments with our proposed approach in order to evaluate
the scalability and the inference potential of the supervised learning from heterogeneous
datasets using MKL paradigm. Our first experiment records execution times required
to learn from varying genome and sample sizes on single and multi-processor
architectures, given a set of TFs. Our second experiment focuses on the evaluation of
inference potential of this approach on different genome and sample sizes. Since our
problem of GRN inference is complex, the experiment aims to evaluate the parallel
nature of the MKL algorithm by decomposing supervised inference of GRNs for multiple
TFs into a number of subproblems and distribute them to multiple processors for parallel
execution. Varying the genome and sample sizes in these experiments is to evaluate
how efficiently MKL based models scale to large genomes where most of the GRN models
developed till date do not perform optimally as reported in Marbach et al. (2012).
The proposed method is implemented in Python and the code along with data is available
at (https://github.com/waninisar/MKL-GRNI).

To assess the performance of the parallel MKL-GRNI on different genomes
characterized by datasets in Table 1. We execute the algorithm and embed the required
code for the evaluation metrics. Once the algorithm completes its execution run, all
the essential metrics are recorded for further analysis. The metrics are computed to
evaluate the capacity of our approach in terms of reduced computational cost and
enhanced inference accuracy when dealing with complex and large-scale inference tasks.
Initially the algorithm is run in sequential mode for all the organisms for a set of
32 TFs, and later on in parallel mode on 8 and 16 CPUs. Performance metrics for all the
datasets are plotted in Fig. 3. A brief description of these important performance metrics is
given below:

SPEEDUP
We calculate speedup as a measure of relative performance of executing our algorithm in
sequential and parallel processing environments. The speed up is calculated as under:-

SðjÞ ¼ Tð1Þ=TðjÞ (4)
Where S(j) is the speedup on j processors, T(1) is the time it takes on a single processor

and T(j) is the time program takes on j processors.

EFFICIENCY
Efficiency is defined as the ratio of speedup to the number of processing elements (j CPUs
in our case). It measures the utilization of the computation resources for a fraction of time.
Ideally in parallel system, speedup is equal to j and efficiency is equal to 1. However, in

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 11/20

https://github.com/waninisar/MKL-GRNI
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


practice, speedup is less than j and efficiency is between zero and one, depending on the
effectiveness with which the processing elements are utilized. We calculate efficiency E(j)
on j processors as given below:

EðjÞ ¼ SðjÞ=j (5)

REDUNDANCY
Redundancy is computed as the ratio between number of operations executed in parallel
and sequential modes. It measures the required increase in the number of computations
when the algorithm is run on multiple processors.

RðjÞ ¼ OðjÞ=Oð1Þ (6)

QUALITY
Quality measures the relevance of using parallel computation and is defined as the ratio
between the product of speedup and efficiency to that of redundancy.

QðjÞ ¼ SðjÞxEðjÞ=RðjÞ (7)

Figure 3 Performance metrics for parallel MKL-GRNI algorithm: (A) Speedup, (B) Efficiency,
(C) Redundancy, (D) Quality. Full-size DOI: 10.7717/peerj-cs.363/fig-3

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 12/20

http://dx.doi.org/10.7717/peerj-cs.363/fig-3
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


It is evident from the Fig. 1 that there is marked increase in the speedup as we
move from a sequential (single CPU) to parallel execution (i.e., 8 and 16 CPUs). For an
E. coli genome with a sample size of 500 and 32 TFs used for inference, the algorithm
shows a sharp speedup as we move from sequential execution to parallel execution on
8 processors, however when the number of processors is increased to 16, there is marginal
increase in speedup for E. coli. On the other hand, there is considerable increase in speedup
recorded for 8 and 16 processors on higher genomes, such as S. cerevisiae and Homo
sapiens, suggesting an increase in the capacity of the parallel algorithm to reduce the
execution times. To assess the resource utilization using our parallel approach, the
efficiency metric shows considerable drop in utilization of compute resources for all
the three datasets, because only a section of algorithm runs in parallel. This can be inferred
from the computed redundancy for sequential and parallel executions. The redundancy
plot shows slight increase in terms of the computational cost incurred when running
our computational problem in parallel, thereby suggesting less computational overhead as
we switch from sequential to parallel mode of execution. To evaluate the relevance of
parallel execution to our problem, we calculate quality metric for all the three datasets.
From the barplots we can observe that parallel algorithms are less relevant when applied to
smaller genomes as is evident in case of E. coli. But there is steady improvement in quality
metric as move from S. cerevisiae to Homo sapiens with relevance indicator high
when yeast dataset is run on 8 processors and human dataset on 16 processors.
These improvements in speedup and quality metrics when running the algorithm in
parallel provides us with a framework to work with more complex datasets and organisms
with large genome sizes to infer very large scale GRNs using a supervised approach.

To assess the inference potential of this supervised method we compare the proposed
approach with other methods that infer gene interactions from single and integrated
datasets. Initially we apply MKL-GRNI to DREAM5 E.coli data, we performed a 10-fold
cross-validation to make sure that model is trained on all the known regulations.
At each cross-validation step, important performance metrics such as precision, recall and
F1 score are recorded and then averaged for the whole cross-validation procedure.
We then compared our network inference method with inference methods that predict
TF-target gene regulations, such as CLR (Faith et al., 2007) and SIRENE (Mordelet & Vert,
2008). The results are recorded in Table 2.

After running all the inference procedures, it is observed that the average precision ,
recall and F1 metrics generated by running MKL-GRNI is quite higher than those
generated by other comparable methods. The improvement with MKL-GRNI can be
attributed to the additional biological knowledge in the form of protein-protein
interactions between E.coli genes to aid in the inference process.

To test the proposed method on integrated data, We perform a 10 fold cross-validation
procedure on the input data. In this experiment, the known target genes of each organism
as depicted in Table 1 are split into training and test sets. The model is trained on the
features from the training set, and the network inference is performed between the genes
in the test set, important evaluation metrics, such as Precision, Recall and F1 scores are
recorded for each iteration and averaged across cross-validation runs. Table 3 summarizes

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 13/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


these metric for varying genome and sample size for human breast cancer dataset and
Table 4 contains results for all the three genomes.

It is evident from these results that the MKL-GRNI algorithm scales well for higher
genomes sizes. These metrics highlight the learning and inference potential of MKL.
Looking at Table 3 we observe an average recall of 80% and an average precision of 58%
with an average F1 measure of 65% for a genome size of 5,000 and sample size of 100,
with an increase in these metrics as we increase the sample size to 500 and 1,000
respectively. However, as we start increasing the size of the genome, these metrics start
a gradual decline for smaller sample size and again show a marginal increase as we increase
the sample size for a fixed genome size. Although there is no direct rule of determining
the number of samples corresponding to the size of the genome in omics studies, the
improvements in precision, recall and F1 measures suggests an improvement in learning
and inference potential of MKL algorithm with an increase in the number of samples.
Also the tabulated metrics for all the three genomes in Table 4 show a considerable decline

Table 2 Average precision, recall and F1 measures for various inference methods.

Method Average precision Average recall Average F1 score

CLR 0.275 0.55 0.36

SIRENE 0.445 0.73 0.55

MKL-GRNI 0.46 0.97 0.62

Table 3 Precision, recall and F1 measure recorded for different combination of genome and sample
sizes for Breast cancer data.

No. of genes No. of samples Average recall Average Precision Average F1 measure

5,000 100 0.8005 0.5817 0.6582

5,000 500 0.8005 0.6169 0.6848

5,000 1,000 0.8354 0.6347 0.6968

10,000 100 0.7350 0.4406 0.5509

10,000 500 0.7660 0.4537 0.5699

10,000 1,000 0.7860 0.4937 0.6065

19,201 100 0.7499 0.3746 0.4996

19,201 500 0.7444 0.3893 0.5112

19,201 1,000 0.7499 0.4246 0.5422

Table 4 Precision, recall and F1 measures averaged across cross-validation runs for complete
genomes.

Organism No. of genes No. of samples Avg. precision Avg. recall Avg. F1 measure

E. coli 4,297 802 0.46 0.97 0.62

S. cerevisiae 5,657 536 0.42 0.84 0.56

Homo sapiens 19,201 1,012 0.37 0.73 0.49

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 14/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


in the evaluation metrics as we move from smaller to larger genomes, suggesting a decrease
in inference potential of the algorithm for larger datasets. The possible decline in the
performance metrics can be attributed to increase in the genome size as we move from
simple prokaryotic to more complex eukaryotic genomes. This increase in the genome
sizes versus the sample size leads to curse of dimensionality and therefore making difficult
to learn properly from skewed datasets.

We also compare our MKL-GRNI with a recently developed Integrative random forest
for gene regulatory network inference (iRafNet) (Petralia et al., 2015). We select DREAM5
datasets of E. coli and S. cerevisiae and integrate PPI and gene expression data from
both datasets. For MKL we build Gaussian and diffusion kernels from expression and
PPI data. For iRafNet , the expression data serves as the main data and the PPI data is
used as support data. Sampling weights are then derived from PPI data by building a
diffusion kernel as K = eH where H is a graph laplacian for PPI data. Sampling weights
from K are derived as WPPIi, j = K(i, j) that is, the element K (i,j). The sampling
weights thus obtained are then integrated with main data set (i.e., gene expression data).
Putative regulatory links are then predicted using importance scores generated using
the iRafNet R package. The AUC and AUPR scores obtained using iRafNet and
MKL-GRNI are listed in Table 5.

The AUC and AUPR scores of MKL-GRNI thus obtained are comparable to iRafNet
for both datasets. However, iRafNet reports a lower AUC and higher AUPR scores
compared to MKL-GRNI when run on E. coli data. But once we move towards a higher
genome size, these scores start dropping marginally for both iRafNet and MKL-GRNI
approaches. The slight higher AUC scores in case of MKL-GRNI can be attributed to some
extent to the skewed class label distribution where in negative labels far outnumber the
positive ones because of limited known regulations. This class imbalance leads to higher
predictive accuracy (AUC) but lower precision-recall scores (AUPR). On the other
hand regression based GRN inference techniques have been reported to perform well
for smaller genomes with GENIE3 (Huynh-Thu et al., 2010) being a start performer in
DREAM5 network inference challenges. The higher AUPR generated by iRafNet in case
of E. coli can be attributed to the way potential regulators are sampled using prior
information from sampling weights (PPI), therefore decreasing false positives and
increasing precision and recall. But for higher genomes (i.e, yeast in our case) the
performance of both approaches begins to fall as reported by (Mordelet & Vert, 2008).
Present implementation of iRafNet does not provide the ability to run the random forest
algorithm in parallel. Therefore, using iRafNet for GRNI of higher genomes can incur
huge computational cost by running thousands of decision trees in sequential mode.

Table 5 AUC and AUPR scores for E. coli and S. cerevisiae using iRafNet and MKL-GRNI.

Datasets iRafNet MKL-GRNI

AUC AUPR AUC AUPR

E. coli 0.901 0.552 0.925 0.44

S. cerevisiae 0.833 0.39 0.89 0.42

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 15/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Since our main motive in this study is to parallelize the inference algorithm for large-scale
GRNI, the higher speedup and higher quality provided by running MLK-GRNI in
parallel can be used as a trade-off for slightly lower AUPR compared to iRafNet run in
sequential mode with marginally higher AUPR scores.

DISCUSSION AND CONCLUSION
Here we present a scalable and parallel approach to GRN inference using MKL as
integration and supervised learning framework. The algorithm has been implemented in
Python using Python interface to MKL provided by shogun machine learning toolbox
(Sonnenburg et al., 2010). The ability of kernel methods in pattern discovery and learning
from genomic data fusion of multi-omics data using MKL has already been demonstrated
in a number of inference studies. Our focus here is to explore the scalability option
for large-scale GRN inference in a supervised machine learning setting, besides assessing
the inference potential across different genomes.

The approach undertaken can be considered as a parallel extension to SIRENE
(Mordelet & Vert, 2008). Although SIRENE performs better than other unsupervised and
information theoretic based inference methods as reported by (Mordelet & Vert, 2008).
However, it lacks the ability to learn from heterogeneous genomic datasets that can
provide essential and complementary information for GRN inference. Another limitation
is the sequential execution of the TF-specific classification problems that incur the
huge cost in terms of execution times as we move from E. coli genomes to more complex
and large genomes of mice and humans. Therefore to facilitate very large scale GRN
inference using supervised learning approach, we use the concept of decomposing the
initial problems of learning GRN into many subproblems, where each subproblem is
aimed to infer a GRN for a specific TF. Our algorithm distributes all such learning
problems to different processors on a multi-processor hardware platform and dispatches
them for simultaneous execution, thereby reducing the execution time of the inference
process substantially. The results from each execution are written to a shared queue object,
once all the child processes complete their execution, the queue object is iterated to build
a single output matrix for genome-scale GRN inference. We also assess the inference
potential of our MKL based parallel GRN inference approach by computing essential
evaluation metrics for machine learning based approaches. A quick survey of scientific
literature on GRN inference methods will ensure that the results obtained by our approach
are comparable to other state-of-the-art methods in this domain and some cases better
than inference methods that employ only gene expression data (e.g., CLR, ARACNE,
SIRENE, etc. ). A drawback of our approach is that only TFs with known targets can
be used to train the inference model. Also, the performance of the algorithm tends to
decrease if the model training is carried out using TFs with few known targets, leading to a
bias in favor of TFs with many known neighbors (i.e., hubs) and is less likely to predict
new associations for TFs with very few neighbors. Besides, we are not able to identify
new TFs among the newly learned interaction, nor the model can predict whether a given
gene is upregulated or downregulated by a particular TF.

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 16/20

http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Therefore additional work is needed to improve the efficiency of the parallel
algorithm and the inference potential of the MKL-GRNI. In our current implementation,
we integrate only two datasets for GRNI, therefore leaving the scope to use more omics
sources that can be integrated for improved performance of the inference model. Also,
the MKL framework provides a mechanism to weigh the contribution of individual
datasets that can be used to select informative datasets for integration. Further, we do
not identify TFs from the predicted target genes and can be considered in future extension
to this work. Besides, novel techniques to choose negative examples for training our
parallel MKL-GRNI model can be incorporated to decrease the number of false positives
and improve the overall precision/recall scores for genomes of higher organisms.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
Nisar Wani is supported by Teacher Fellowship of University Grants Commission,
Ministry of Human Resources Development, Govt. of India vide letter No. F.B No.
27-(TF-45)/2015 under Faculty Development Programme. The funders had no role in
study design, data collection and analysis, decision to publish, or preparation of the
manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
University Grants Commission, Ministry of Human Resources Development, Govt. of
India: F.B No. 27-(TF-45)/2015.

Competing Interests
The authors declare that they have no competing interests.

Author Contributions
� Nisar Wani conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the paper, and approved the final draft.

� Khalid Raza conceived and designed the experiments, analyzed the data, prepared
figures and/or tables, authored or reviewed drafts of the paper, and approved the final
draft.

Data Availability
The following information was supplied regarding data availability:

The code is available at GitHub: https://github.com/waninisar/MKL-GRNI.

REFERENCES
Albert R. 2007. Network inference, analysis, and modeling in systems biology. Plant Cell

19(11):3327–3338 DOI 10.1105/tpc.107.054700.

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 17/20

https://github.com/waninisar/MKL-GRNI
http://dx.doi.org/10.1105/tpc.107.054700
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Alioscha-Perez M, Oveneke MC, Sahli H. 2019. Svrg-mkl: a fast and scalable multiple kernel learning
solution for features combination in multi-class classification problems. IEEE Transactions on
Neural Networks and Learning Systems 31(5):1710–1723 DOI 10.1109/TNNLS.2019.2922123.

Ben-Hur A, Noble WS. 2005. Kernel methods for predicting protein-protein interactions.
Bioinformatics 21(Suppl. 1):i38–i46 DOI 10.1093/bioinformatics/bti1016.

Butte AJ, Kohane IS. 1999. Mutual information relevance networks: functional genomic clustering
using pairwise entropy measurements. In: Biocomputing 2000. Singapore: World Scientific,
418–429.

Chen Z-Y, Fan Z-P. 2014. Parallel multiple kernel learning: a hybrid alternating direction method
of multipliers. Knowledge and Information Systems 40(3):673–696
DOI 10.1007/s10115-013-0655-5.

Cichonska A, Pahikkala T, Szedmak S, Julkunen H, Airola A, Heinonen M, Aittokallio T,
Rousu J. 2018. Learning with multiple pairwise kernels for drug bioactivity prediction.
Bioinformatics 34(13):i509–i518 DOI 10.1093/bioinformatics/bty277.

ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project.
Science 306(5696):636–640 DOI 10.1126/science.1105136.

Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner
TS. 2007. Large-scale mapping and validation of escherichia coli transcriptional regulation from
a compendium of expression profiles. PLOS Biology 5(1):e8 DOI 10.1371/journal.pbio.0050008.

Han H, Shim H, Shin D, Shim JE, Ko Y, Shin J, Kim H, Cho A, Kim E, Lee T, Kim H, Kim K,
Yang S, Bae D, Yun A, Kim S, Kim CY, Cho HJ, Kang B, Shin S, Lee I. 2015. TRRUST: a
reference database of human transcriptional regulatory interactions. Scientific Reports
5(1):11432 DOI 10.1038/srep11432.

Hecker M, Lambeck S, Toepfer S, Van Someren E, Guthke R. 2009. Gene regulatory network
inference: data integration in dynamic models: a review. Biosystems 96(1):86–103
DOI 10.1016/j.biosystems.2008.12.004.

Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. 2010. Inferring regulatory networks from
expression data using tree-based methods. PLOS ONE 5(9):e12776
DOI 10.1371/journal.pone.0012776.

Jiang C, Xuan Z, Zhao F, Zhang MQ. 2007. Tred: a transcriptional regulatory element database,
new entries and other development. Nucleic Acids Research 35(Suppl. 1):D137–D140
DOI 10.1093/nar/gkl1041.

Kondor RI, Lafferty J. 2002. Diffusion kernels on graphs and other discrete structures. Proceedings
of the 19th International Conference on Machine Learning 2002:315–322.

Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. 2003. Kernel-based data fusion
and its application to protein function prediction in yeast. In: Biocomputing 2004. Singapore:
World Scientific, 300–311.

Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. 2004. A statistical framework for
genomic data fusion. Bioinformatics 20(16):2626–2635 DOI 10.1093/bioinformatics/bth294.

Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT,
Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick
JJ, Tagne J-B, Volkert TL, Fraenkel E, Gifford DK, Young RA. 2002. Transcriptional
regulatory networks in saccharomyces cerevisiae. Science 298(5594):799–804
DOI 10.1126/science.1075090.

Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Consortium
D, Kellis M, Collins JJ, Stolovitzky G. 2012. Wisdom of crowds for robust gene network
inference. Nature Methods 9(8):796.

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 18/20

http://dx.doi.org/10.1109/TNNLS.2019.2922123
http://dx.doi.org/10.1093/bioinformatics/bti1016
http://dx.doi.org/10.1007/s10115-013-0655-5
http://dx.doi.org/10.1093/bioinformatics/bty277
http://dx.doi.org/10.1126/science.1105136
http://dx.doi.org/10.1371/journal.pbio.0050008
http://dx.doi.org/10.1038/srep11432
http://dx.doi.org/10.1016/j.biosystems.2008.12.004
http://dx.doi.org/10.1371/journal.pone.0012776
http://dx.doi.org/10.1093/nar/gkl1041
http://dx.doi.org/10.1093/bioinformatics/bth294
http://dx.doi.org/10.1126/science.1075090
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A.
2006. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian
cellular context. BMC Bioinformatics 7(1):S7 DOI 10.1186/1471-2105-7-S1-S7.

Mordelet F, Vert J-P. 2008. SIRENE: supervised inference of regulatory networks. Bioinformatics
24(16):i76–i82 DOI 10.1093/bioinformatics/btn273.

Nascimento AC, Prudêncio RB, Costa IG. 2016. A multiple kernel learning algorithm for drug-
target interaction prediction. BMC Bioinformatics 17(1):46 DOI 10.1186/s12859-016-0890-3.

Pavlidis P, Weston J, Cai J, Noble WS. 2002. Learning gene functional classifications from
multiple data types. Journal of Computational Biology 9(2):401–411
DOI 10.1089/10665270252935539.

Petralia F, Wang P, Yang J, Tu Z. 2015. Integrative random forest for gene regulatory network
inference. Bioinformatics 31(12):i197–i205 DOI 10.1093/bioinformatics/btv268.

Raza K, Alam M. 2016. Recurrent neural network based hybrid model for reconstructing gene
regulatory network. Computational Biology and Chemistry 64:322–334
DOI 10.1016/j.compbiolchem.2016.08.002.

Remli MA, Mohamad MS, Deris S, Samah AA, Omatu S, Corchado JM. 2019. Cooperative
enhanced scatter search with opposition-based learning schemes for parameter estimation in
high dimensional kinetic models of biological systems. Expert Systems with Applications
116:131–146 DOI 10.1016/j.eswa.2018.09.020.

Scholkopf B, Smola AJ. 2001. Learning with kernels: support vector machines, regularization,
optimization, and beyond. Cambridge: MIT Press.

Seoane JA, Day IN, Gaunt TR, Campbell C. 2013. A pathway-based data integration framework
for prediction of disease progression. Bioinformatics 30(6):838–845
DOI 10.1093/bioinformatics/btt610.

Shawe-Taylor J, Cristianini N. 2004. Kernel methods for pattern analysis. Cambridge: Cambridge
University Press.

Sonnenburg S, Henschel S, Widmer C, Behr J, Zien A, Bona Fd, Binder A, Gehl C, Franc V.
2010. The shogun machine learning toolbox. Journal of Machine Learning Research
11:1799–1802.

Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. 2006. Large scale multiple kernel learning.
Journal of Machine Learning Research 7:1531–1565.

Speicher NK, Pfeifer N. 2015. Integrating different data types by regularized unsupervised
multiple kernel learning with application to cancer subtype discovery. Bioinformatics
31(12):i268–i275 DOI 10.1093/bioinformatics/btv244.

Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M,
Muller J, Bork P, Jensen LJ, Mering C. 2011. The STRING database in 2011: functional
interaction networks of proteins, globally integrated and scored. Nucleic Acids Research
39(Suppl. 1):D561–D568 DOI 10.1093/nar/gkq973.

Tomczak K, Czerwińska P, Wiznerowicz M. 2015. The cancer genome atlas (tcga): an
immeasurable source of knowledge. Contemporary Oncology 19(1A):A68.

Van Laarhoven T, Nabuurs SB, Marchiori E. 2011. Gaussian interaction profile kernels for
predicting drug-target interaction. Bioinformatics 27(21):3036–3043
DOI 10.1093/bioinformatics/btr500.

Wani N, Raza K. 2018. Multiple kernel-learning approach for medical image analysis. In:
Soft Computing Based Medical Image Analysis. Amsterdam: Elsevier, 31–47.

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 19/20

http://dx.doi.org/10.1186/1471-2105-7-S1-S7
http://dx.doi.org/10.1093/bioinformatics/btn273
http://dx.doi.org/10.1186/s12859-016-0890-3
http://dx.doi.org/10.1089/10665270252935539
http://dx.doi.org/10.1093/bioinformatics/btv268
http://dx.doi.org/10.1016/j.compbiolchem.2016.08.002
http://dx.doi.org/10.1016/j.eswa.2018.09.020
http://dx.doi.org/10.1093/bioinformatics/btt610
http://dx.doi.org/10.1093/bioinformatics/btv244
http://dx.doi.org/10.1093/nar/gkq973
http://dx.doi.org/10.1093/bioinformatics/btr500
http://dx.doi.org/10.7717/peerj-cs.363
https://peerj.com/computer-science/


Wani N, Raza K. 2019a. Integrative approaches to reconstruct regulatory networks from
multi-omics data: a review of state-of-the-art methods. Computational Biology and Chemistry
83:107120 DOI 10.1016/j.compbiolchem.2019.107120.

Wani N, Raza K. 2019b. iMTF-GRN: integrative matrix tri-factorization for inference of gene
regulatory networks. IEEE Access 7:126154–126163 DOI 10.1109/ACCESS.2019.2936794.

Yamanishi Y, Vert J-P, Kanehisa M. 2004. Protein network inference from multiple genomic data:
a supervised approach. Bioinformatics 20(Suppl. 1):i363–i370
DOI 10.1093/bioinformatics/bth910.

Yan S, Xu D, Zhang B, Zhang H-J, Yang Q, Lin S. 2007. Graph embedding and extensions: a
general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and
Machine Intelligence 29(1):40–51 DOI 10.1109/TPAMI.2007.250598.

Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 20/20

http://dx.doi.org/10.1016/j.compbiolchem.2019.107120
http://dx.doi.org/10.1109/ACCESS.2019.2936794
http://dx.doi.org/10.1093/bioinformatics/bth910
http://dx.doi.org/10.1109/TPAMI.2007.250598
https://peerj.com/computer-science/
http://dx.doi.org/10.7717/peerj-cs.363

	MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks
	Introduction
	Related literature
	Materials and Methods
	Results
	Speedup
	Efficiency
	Redundancy
	Quality
	Discussion and Conclusion
	References


<<
  /ASCII85EncodePages false
  /AllowTransparency false
  /AutoPositionEPSFiles true
  /AutoRotatePages /None
  /Binding /Left
  /CalGrayProfile (Dot Gain 20%)
  /CalRGBProfile (sRGB IEC61966-2.1)
  /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2)
  /sRGBProfile (sRGB IEC61966-2.1)
  /CannotEmbedFontPolicy /Warning
  /CompatibilityLevel 1.4
  /CompressObjects /Off
  /CompressPages true
  /ConvertImagesToIndexed true
  /PassThroughJPEGImages true
  /CreateJobTicket false
  /DefaultRenderingIntent /Default
  /DetectBlends true
  /DetectCurves 0.0000
  /ColorConversionStrategy /LeaveColorUnchanged
  /DoThumbnails false
  /EmbedAllFonts true
  /EmbedOpenType false
  /ParseICCProfilesInComments true
  /EmbedJobOptions true
  /DSCReportingLevel 0
  /EmitDSCWarnings false
  /EndPage -1
  /ImageMemory 1048576
  /LockDistillerParams false
  /MaxSubsetPct 100
  /Optimize true
  /OPM 1
  /ParseDSCComments true
  /ParseDSCCommentsForDocInfo true
  /PreserveCopyPage true
  /PreserveDICMYKValues true
  /PreserveEPSInfo true
  /PreserveFlatness true
  /PreserveHalftoneInfo false
  /PreserveOPIComments false
  /PreserveOverprintSettings true
  /StartPage 1
  /SubsetFonts true
  /TransferFunctionInfo /Apply
  /UCRandBGInfo /Preserve
  /UsePrologue false
  /ColorSettingsFile (None)
  /AlwaysEmbed [ true
  ]
  /NeverEmbed [ true
  ]
  /AntiAliasColorImages false
  /CropColorImages true
  /ColorImageMinResolution 300
  /ColorImageMinResolutionPolicy /OK
  /DownsampleColorImages false
  /ColorImageDownsampleType /Average
  /ColorImageResolution 300
  /ColorImageDepth 8
  /ColorImageMinDownsampleDepth 1
  /ColorImageDownsampleThreshold 1.50000
  /EncodeColorImages true
  /ColorImageFilter /FlateEncode
  /AutoFilterColorImages false
  /ColorImageAutoFilterStrategy /JPEG
  /ColorACSImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /ColorImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000ColorACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /JPEG2000ColorImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /AntiAliasGrayImages false
  /CropGrayImages true
  /GrayImageMinResolution 300
  /GrayImageMinResolutionPolicy /OK
  /DownsampleGrayImages false
  /GrayImageDownsampleType /Average
  /GrayImageResolution 300
  /GrayImageDepth 8
  /GrayImageMinDownsampleDepth 2
  /GrayImageDownsampleThreshold 1.50000
  /EncodeGrayImages true
  /GrayImageFilter /FlateEncode
  /AutoFilterGrayImages false
  /GrayImageAutoFilterStrategy /JPEG
  /GrayACSImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /GrayImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000GrayACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /JPEG2000GrayImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /AntiAliasMonoImages false
  /CropMonoImages true
  /MonoImageMinResolution 1200
  /MonoImageMinResolutionPolicy /OK
  /DownsampleMonoImages false
  /MonoImageDownsampleType /Average
  /MonoImageResolution 1200
  /MonoImageDepth -1
  /MonoImageDownsampleThreshold 1.50000
  /EncodeMonoImages true
  /MonoImageFilter /CCITTFaxEncode
  /MonoImageDict <<
    /K -1
  >>
  /AllowPSXObjects false
  /CheckCompliance [
    /None
  ]
  /PDFX1aCheck false
  /PDFX3Check false
  /PDFXCompliantPDFOnly false
  /PDFXNoTrimBoxError true
  /PDFXTrimBoxToMediaBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXSetBleedBoxToMediaBox true
  /PDFXBleedBoxToTrimBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXOutputIntentProfile (None)
  /PDFXOutputConditionIdentifier ()
  /PDFXOutputCondition ()
  /PDFXRegistryName ()
  /PDFXTrapped /False

  /CreateJDFFile false
  /Description <<
    /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002>
    /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002>
    /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e>
    /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e>
    /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e>
    /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e>
    /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e>
    /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002>
    /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e>
    /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.)
    /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e>
    /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e>
    /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e>
    /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e>
    /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers.  Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.)
  >>
  /Namespace [
    (Adobe)
    (Common)
    (1.0)
  ]
  /OtherNamespaces [
    <<
      /AsReaderSpreads false
      /CropImagesToFrames true
      /ErrorControl /WarnAndContinue
      /FlattenerIgnoreSpreadOverrides false
      /IncludeGuidesGrids false
      /IncludeNonPrinting false
      /IncludeSlug false
      /Namespace [
        (Adobe)
        (InDesign)
        (4.0)
      ]
      /OmitPlacedBitmaps false
      /OmitPlacedEPS false
      /OmitPlacedPDF false
      /SimulateOverprint /Legacy
    >>
    <<
      /AddBleedMarks false
      /AddColorBars false
      /AddCropMarks false
      /AddPageInfo false
      /AddRegMarks false
      /ConvertColors /NoConversion
      /DestinationProfileName ()
      /DestinationProfileSelector /NA
      /Downsample16BitImages true
      /FlattenerPreset <<
        /PresetSelector /MediumResolution
      >>
      /FormElements false
      /GenerateStructure true
      /IncludeBookmarks false
      /IncludeHyperlinks false
      /IncludeInteractive false
      /IncludeLayers false
      /IncludeProfiles true
      /MultimediaHandling /UseObjectSettings
      /Namespace [
        (Adobe)
        (CreativeSuite)
        (2.0)
      ]
      /PDFXOutputIntentProfileSelector /NA
      /PreserveEditing true
      /UntaggedCMYKHandling /LeaveUntagged
      /UntaggedRGBHandling /LeaveUntagged
      /UseDocumentBleed false
    >>
  ]
>> setdistillerparams
<<
  /HWResolution [2400 2400]
  /PageSize [612.000 792.000]
>> setpagedevice