A novel feature fusion based on the evolutionary features for protein fold recognition using support vector machines


A novel feature fusion based on the evolutionary features for protein fold recognition
using support vector machines

Mohammad Saleh Refahia, A. Mir1, Jalal A. Nasiri1,∗

aDepartment of Electrical Engineering, Amirkabir University of Technology,Tehran,Iran
bIranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran

Abstract

Protein fold recognition plays a crucial role in discovering three-dimensional structure of proteins and protein functions.
Several approaches have been employed for the prediction of protein folds. Some of these approaches are based on
extracting features from protein sequences and using a strong classifier. Feature extraction techniques generally utilize
syntactical-based information, evolutionary-based information and physiochemical-based information to extract features.
In recent years, Finding an efficient technique for integrating discriminate features have been received advancing atten-
tion. In this study, we integrate Auto-Cross-Covariance (ACC) and Separated dimer (SD) evolutionary feature extraction
methods. The results features are scored by Information gain (IG) to define and select several discriminated features.
According to three benchmark datasets, DD, RDD and EDD, the results of the support vector machine (SVM) show
more than 6% improvement in accuracy on these benchmark datasets.

Keywords: Protein Fold Recognition, Feature fusion, Evolutionary method, IG, Support Vector Machine

1. Introduction

Proteins are Jack of all trades biological macromolecules.
They are involved in almost every biological reaction; Pro-
tein plays a critical roll in many different areas such as
building muscle, hormone production, enzyme, immune5
function, and energy.

Typically more than 20,000 proteins exist in human
cells, to acquire knowledge about the protein function and
interactions, the prediction of protein structural classes is
extremely useful [1]. Fold recognition is one of the funda-10
mental methods in protein structure and function predic-
tion.

Protein can be demonstrated as a chain of amino acids.
Proteins with unique lengths and similarities are part of
the same fold. They also have identical protein secondary15
structure in the same topology. Certainly, they have a
regular origin [2].

One of the main steps which can be assumed as a
vital stage for predicting protein fold is feature extrac-
tion. Computational feature extraction methods are di-20
vided into syntactical, physiochemical and evolutionary
methods. Syntactical methods pay attention only to the
protein sequence, like composition and occurrence [3, 4].
Physiochemical methods consider some physical and chem-
ical properties of protein sequences. Evolutionary meth-25

∗Corresponding author
Email addresses: msaleh.refahi@aut.ac.ir (Mohammad

Saleh Refahi), mir-am@hotmail.com (A. Mir),
j.nasiri@irandoc.ac.ir (Jalal A. Nasiri)

ods extract features from Basic Local Alignment Search
Tool(BLAST).

When attempting to solve many biological problems it
is obvious that a single data source might not be informa-
tive, and combining several complementary biological data30
sources will lead to a more accurate result. When we study
methods of protein fold recognition, we found that less at-
tention has been paid to the fusion of features to get more
comprehensive features. In recent studies, researchers at-
tempted to find new feature extraction methods[[5, 6, 7, 8,35
9]] or train different classifiers to achieve high accuracy[[10,
11, 12, 13]], even though some problems like incomplete
data sources, false positive information, multiple aspect
problem,. . . encourage us to combine data sources.

Hence, to prepare more informative and discriminative40
features, we use Auto-Cross-Covariance(ACC)[8] and Sep-
arated dimer(SD)[7] methods. Because SD explores some
amino acid dimers that may be non-adjacent in sequence
[7] and ACC method measures the correlation between the
same and different properties of amino acids [8]. One of the45
main advantages of ACC and SD is to find a fixed length
vector from a variable protein length. The performance of
the proposed method is evaluated using three benchmark
datasets DD[3] , RDD[14] and EDD[8].

In this paper, we focus on fusing ACC and SD feature50
extraction methods based on Position Specific Scoring Ma-
trix(PSSM) generated by using the Position-Specific Iter-
ated BLAST(PSI-BLAST) profile to predict protein fold.

The 1600 ACC features and the 400 SD features are
extracted based on the PSSM. Finally, we construct a55
reduced-dimensional feature vector for the Support Vec-

Preprint submitted to Journal of Theoretical Biology August 27, 2019

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


tor Machine (SVM) classifier by using the Information
Gain(IG).

The remaining sections of the paper are organized as
follows. Section 2 describes the related works of the exist-60
ing techniques. The methodology is explained in Section
3. Section 4 shows the experimental results and discus-
sion. Finally, the conclusion and future works are given in
Section5.

2. Related work65

In 1997, Dubchak et al. studied syntactical and phys-
iochemical method [15]. In which they assumed five prop-
erties of amino acid like hydrophobicity (H), frequency of
α helix (X), polarity (P), polarizability (Z) and van der
Waals volume (V). In [16] Forward Consecutive Search70
(FCS) scheme which trains physiochemical attributes for
protein fold recognition.

Another solution to find similarity between protein se-
quences is based on the BLAST. Many feature extraction
methods use BLAST alignment to extract the possibil-75
ity of amino acid in specific positions called as PSSM. In
2009, pairwise frequencies of amino acids separated by
one residue (PF1) and pairwise frequencies of adjacent
amino acid residues (PF2) were proposed by Ghatny and
Pal [5]. The bigram feature extraction method was in-80
troduced by Sharma et al. [6], bigram feature vector is
computed by counting the bigram frequencies of occur-
rence from PSSM. In 2011, combination of PSSM with
Auto Covariance (AC) transformation, was introduced as
feature extraction method [17].85

Another method introduced by Saini et al. [7] is sep-
arated dimers(SD); they used probabilistic expressions of
amino acid dimer occurrence that have varying degrees of
spatial separation in the protein sequence to predict pro-
tein fold. Dong et al. [8] proposed autocross-covariance90
(ACC) transformation for protein fold recognition. More-
over, Pailwal et al. [9] proposed the ability of trigram
to extract features from the neighborhood information of
amino acid.

In addition to the feature extraction methods, some95
researchers have paid attention to classification methods
for protein fold recognition. In [10] Kohonens selforgani-
zation neural network is used and showed the structural
class of protein is considerably correlated with its amino
acid composition features.100

Baldi et al.[18] employed Recurrent and Recursive Ar-
tificial Neural Networks (RNNs) and mixed it by directed
acyclic graphs (DAGs) to predict protein structure.

In [12], classwise optimized feature sets are used. SVM
classifiers are coupled with probability estimates to make105
the final prediction. Linear discriminant analysis(LDA)
was employed to evaluate the contribution of sequence
parameters in determining the protein structural class.
Parameters were used as inputs of the artificial neural
networks[19]. The composition entropy was proposed to110

represent apoptosis protein sequences, and an ensemble
classifier FKNN (fuzzy K-nearest neighbor) was used as a
predictor[13].

TAXFOLD [20]method extract sequence evolution fea-
tures from PSI-BLAST profiles and the secondary struc-115
ture features from PSIPRED profiles, and finally a set of
137 features is constructed to predict protein folds.

Sequence-Based Prediction of ProteinPeptide(SPRINT)
method is used to the prediction of Proteinpeptide Residue-
level Interactions by SVM [11]. SVM implements the struc-120
tural risk minimization (SRM) that minimizes the upper
bound of generation error [21, 22].

The DeepCov method proposed in [23], this method
uses convolutional neural networks to work on amino acid
pair frequency or covariance data extract from sequence125
alignments.

In [24] is attempted to show Artificial Neural Network
(ANN) with different feature extraction method is more
accurate than other classifier methods.

3. Methodology130

This section illustrates the step-by-step of the pro-
posed method for protein fold recognition. In the first
step, sequence alignments are found for each protein using
BLAST. To show improvements in protein fold recogni-
tion using evolutionary information that are presented in135
PSSM, therefore ACC [8]and SD [7] features are extracted
from PSSM. In the next step, the features are combined
and selected by the IG. In the last step, the SVM algo-
rithm is trained to classify proteins. A comprehensive view
of this approach can be found in Figure1.140

3.1. Preprocessing

3.1.1. BLAST

Similarity is used here to mention the resemblance or
percentage of identity between two protein sequences [25].
The similarity search depends on the bioinformatics al-145
gorithm. Basic Local Alignment Search Tool(BLAST) is
a tool that helps researchers to compare a query sequence
with a database of sequences and identify specific sequences
that resemble the query sequence above a certain thresh-
old. BLAST is a local alignment algorithm that means150
to find the region (or regions) of the highest similarity
between two sequences and build the alignment outward
from there [26].

3.1.2. PSSM

Position Specific Scoring Matrix(PSSM) is used to ex-155
press motif in a protein sequence. P-BLAST searches in
which amino acid substitution scores are given separately
for each position in a protein multiple sequence alignment.
In this paper, PSSM is used to extract features by ACC
and SD methods.160

2

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 1: The flowchart of the proposed method pipeline. Sequence alignments are found for each protein by BLAST. PSSM is
calculated to extract the ACC and the SD.The features are selected by the IG. The SVM algorithm is trained to classify proteins.

3.2. Feature Extraction

3.2.1. ACC

ACC fold [8] utilizes autocross-covariance transforma-
tion that convert the PSSMs of different lengths into fixed-
length vectors. The ACC separates two kinds of features:165
AC between the same properties, cross-covariance (CC)
between two different properties. The AC variable mea-
sures the correlation of the same property between two
properties separated by LG, distance along the sequence:

AC(i,LG) =

L−LG∑
j=1

(Pi,j−Pi)(Pi,j+LG−Pi)\(L−LG) (1)

Where Pi,j is the PSSM score of amino acid i at position170

j,and Pi =
∑L
j=1 Pi,j \ L, the average score of an amino

acid i in the total protein sequence. The number of fea-
tures which are calculated from AC is 20 ∗ LG. The CC
measures the correlation of two different properties be-
tween the distances of LG along the sequence:175

CC(i1, i2,LG) =

L−LG∑
j=1

(Pi1,j−Pi1)(Pi2,j+LG−Pi2)\(L−LG)

(2)

The CC variables are not symmetric. The total number of
CC variables is 380 ∗LG.The combination of AC and CC
features make 400 ∗LG feature vectors.

3.2.2. SD

Separated Dimer(SD) method was introduced by Saini
et al.[7]. It attempts to extract features from amino acids
that may or may not be adjacent in the protein sequence.
The SD demonstrates the probabilities of the occurrence
of amino acid. SD generates 400 features.

F(k) = [F1,1(k),F1,2(k), ....,F20,19(k),F20,20(k)] (3)

F(k) is computed as the feature sets for probabilistic oc-180
currence of amino acid dimers with different values of k
which is a special distance between dimers. It is obvious
if P is in the PSSM matrix for a protein sequences, it is
L×20 matrix where L is the length of the protein sequence:

Fm,n(k) =

L−k∑
i=0

Pi,mPi+k,n (4)

in which m,n (1 ≤ m,n ≤ 20) is the score of two185
selective amino acid in PSSM.

3

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 2: The AC features of the ACC, measures the corre-
lation of the same property between two properties separated
by a distance of LG along the sequence

3.3. Fusion hypothesis

More attention needs to be paid to find an efficient
technique for integrate distinct data sources for the pro-
tein fold recognition problem[27]. Various techniques have190
been employed based on the features which are extracted
from protein sequences.

These techniques investigate different aspects of a se-
quence like the study of possible position of amino acids,
protein chemical characteristics and syntactical features195
. . . . Hence, integrating them can model the folding prob-
lem more accurate.

In this study three hypotheses have been considered for
fusion data sources. The first, only evolutionary features
are used since integrating different types of features may200
have an undesirable effect on each other.

The next assumption is considered choosing the ACC
and SD methods. When we studied the recent paper, we
observed the recall and precision of some protein folds
which were almost equal to high value or one was a comple-205
ment of the other. Hence, when the recall of the ACC(SD)
is low, then the recall of the SD(ACC) is high, and also
for the precision, we observe this behavior, in almost every
fold.

The last hypothesis is that the ACC and the SD fea-210
tures showed a relationship between amino acids which
may or may not be adjacent. In this approach, three dif-
ferent characters are defined which show each amino acid
in a specific position what relation has with others. These
characters are shown in Figures234.215

3.4. Information Gain

Feature selection is a common stage in classification
problems. It can improve the prediction accuracy of clas-
sifiers by identifying relevant features. Moreover, feature
selection often reduces the training time of a classifier by220
reducing the number of features which are going to be an-
alyzed.

Figure 3: The CC features of the ACC, measures the corre-
lation of two different properties between the distances of LG
along the sequence

Figure 4: The SD consist of aminoacid dimers with proba-
bilistic expressions that have k separation.

Information gain (IG) is a popular feature selection
method. It ranks features by considering their presence
and absence in each class [28]. The IG method gives a225
high score to the features that occur frequently in a class
and rarely in other classes.

Given T the set of training samples, xi the vector of ith
variables in this set, |Txi=v|/|Tx| the fraction of samples
of the ith variable having value v. The IG method can be230
computed as follows:

IG(Tx,xi) = H(Tx) −

|Txi=v|
|Tx|∑

v=values(xi)

H(Txi=v)

H(T) = −p+(T)log2p+(T) −p−(T)log2p−(T)

(5)

where p± denotes the probability of a sample in the set T
to be of the positive or negative class.

3.5. Support Vector Machine

Support Vector Machine (SVM) was proposed by Vap-
nik and Cortes in 1995 [29]. It is a powerful tool for bi-

4

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


nary classification. SVM is on the basis of Structural Risk
Minimization (SRM) and Vapnik-Chervonenkis (VC) di-
mension. The central idea of SVM is to find the optimal
separating hyperplane with the largest margin between the
classes. Due to the SRM principle, SVM has great gener-
alization ability. Moreover, the parameters of the optimal
separating hyperplane can be obtained by solving a convex
quadratic programming problem (QPP), which is defined
as follows:

min
w

1

2
‖w‖2 + C

n∑
i=1

ξi

s.t. yi(w
Txi + b) + ξi ≥ 1,∀i

(6)

where ξ is the slack variable associated with xi sample235
and C is a penalty parameter. Note that the optimiza-
tion problem can be solved when the classification task
is linearly separable. In the case of nonlinear problems,
the input data is transformed into a higher-dimensional
feature space in order to make data linearly separable. It240
makes possible to find a nonlinear decision boundary with-
out computing the parameters of the optimal hyperplane
in a high dimensional feature space [30].

As mentioned in this subsection, SVM is designed to
solve binary classification problems. However, there are245
multi-class approaches such as One-vs-One (OVO) and
One-vs-All (OVA) [31], which can be used for solving multi-
class classification problems. In this paper, we used OVO
strategy.

4. Experimental Result250

4.1. Dataset

Three popular datasets which were employed in this
study are DD dataset [3], EDD dataset [8], and RDD
dataset[14]. DD dataset contains 27 folds which repre-
sent four major structure classes:α,β,α

β
,α + β. The train-255

ing set and the testing set contains 311 training sequences
and 383 testing sequences whose sequence similarity is less
than 35%[3]. The EDD dataset consists of 3418 proteins
with less than 40% sequential similarity belonging to the
27 folds that originally are adopted from the DD dataset.260
The RDD dataset consists of 311 protein sequences in the
training and 380 protein sequences in testing datasets with
a similarity lower than 37% [14].

4.2. Result

The experiments were performed on the benchmark265
datasets to evaluate the performance of the classification
due to our fusion method. we also adopted the 10-fold
cross-validation in this study, which has done by many
researchers to examine predictive potency.

In this study LibSVM [32] with RBF (Radial Basis270
Function) as the kernel functions has been used. The C
parameter was optimized by search between {2−14, 2−13,
. . . , 213, 214} and also Γ parameter of RBF was considered

between {2−14, 2−13, . . . , 213, 214}. The SVM was origi-
nally designed for binary data classification.This study275
used one-versus-one method to approach a multi-class clas-
sifier.

The details of the feature extraction method are ex-
plained in methodology, but it is important to know how
far is assumed between aminoacids, for each ACC and SD280
methods. In developing the algorithm to extract features
from PSSM ,LG and k parameters have been assumed like
ACC and SD papers values[7, 8]. We considered both LG
and k equals to 4.

The IG [28] makes our method safe from noisy fea-285
tures. In this approach, we considered the features ranked
between [ 1

2
maxIG, maxIG] for each dataset. The results

of IG for each dataset are exhibited in Table2.

4.3. Discussion

Table1 illustrates the total prediction accuracies of the290
existing approaches for classification of protein folds in the
DD,RDD and EDD datasets. Table1 also shows the suc-
cess rates of our proposed fusion approach. According to
Table1, classification results of the combined ACC and SD
followed by selection of best features by IG show consid-295
erable improvement compared to the state of art.

Figure9 has been shown to figure out the result dis-
tribution of feature selection method. Even though the
number of ACC in the three datasets is more but all of
the SD features exist in selected features. However, we300
studied and compared SD and ACC methods separately,
we found that the fusion of them can make more informa-
tive data which cover all characteristics of folds.

It is evident in Figure6, Figure7, and also Figure8, only
”FAD-BINDING MOTIF” protein fold is not well recog-305
nized, and these confusion matrices show the power of
proposed method for predicting the other folds in these
datasets.

The Figure5 has been shown to evaluate the IG. It is
obvious that maximum accuracy of classification for each310
dataset has been achieved when we consider ranking fea-
tures higher than 1

2
maxIG for these datasets.

Sensitivity measures the ratio of correctly classified
samples to the whole number of test samples for each class
which is classified as correct samples and calculated as fol-315
lows:

Sensitivity =
TP

TP + FN
× 100 (7)

TP represents true positive and FN represents false nega-
tive samples. Precision represents, how relevant the num-
ber of TP is to the whole number of positive prediction
and is calculated as follows:

Precision =
TP

TP + FP
× 100 (8)

FP denotes false positive. F1 Score is the weighted average
of Precision and Recall. F1 score, as other evaluation cri-
teria which are used in this study measures, is calculated

5

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 5: Comparison of number of features and accuracy for DD,RDD and EDD datasets to evaluate the IG method

Figure 9: Comparison of the ACC and the SD in DD,RDD
and EDD datasets.

as follows:

F1score =
2TP

2TP + FP + FN
× 100 (9)

The sensitivity, precision, and F1 score are computed
for each class and then averaged over all the classes which
are calculated and published in Table2.

5. Conclusion320

This study aims to improve protein fold recognition
accuracy by fusing information that are extracted from
the PSSM matrix. In this approach, we used ACC and
SD feature extraction methods. It was observed that the
proposed technique eventuates to 6% improvement for the325
accuracy of these three benchmark datasets.

In the future, classification can be done by combining
more syntactical,physiochemical or evolutionary features.

Table 1: Comparison of the proposed method with the exist-
ing predictor and Meta-predictors for the DD, RDD and EDD.

Methods Reference DD RDD EDD
occurrence [4] 42 56.6 70.0

ACC [8] 68.0 73.8 85.9
PF1 [5] 50.6 53.3 63.0
PF2 [5] 48.2 NA 49.9

TAXFOLD [20] 71.5 83.2 NA
Bigram [6] 79.3 59.6 79.9

SD [7] 86.3 72.1 90.0
Trigram [7] 73.4 60.0 80.0
MF-SRC [33] 78.6 NA 86.2

Enhanced-SD [24] 90.0 75.4 93.0*

Proposed Method - 91.31 91.64 91.2
* The evaluation method not defined in [24] approach.

Table 2: F1 score, Sensitivity and Precision, Measurement
tools to evaluate the proposed method.

Data set F1 score Sensitivity Precision Number of features

DD 0.98 0.92 0.93 1300
RDD 0.98 0.92 0.93 1416
EDD 0.96 0.91 0.93 900

To achieve more accuracy, future studies should be concen-
trate on ”FAD-BINDING MOTIF” protein fold that has330
less discriminative features in the SD and the ACC. Boost-
ing classifier may be employed to find better solutions for
protein fold recognition.

6

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 6: Confusion matrix of DD dataset(91.31%)

7

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 7: Confusion matrix of RDD dataset(91.64%)

8

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 8: Confusion matrix of EDD dataset(91.2%)

9

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/


References

[1] J.-Y. Yang, Z.-L. Peng, X. Chen, Prediction of protein struc-335
tural classes for low-homology sequences based on predicted sec-
ondary structure, BMC bioinformatics 11 (1) (2010) S9.

[2] T. Yang, V. Kecman, L. Cao, C. Zhang, J. Z. Huang, Margin-
based ensemble classifier for protein fold recognition, Expert
Systems with Applications 38 (10) (2011) 12348–12355.340

[3] C. H. Ding, I. Dubchak, Multi-class protein fold recognition
using support vector machines and neural networks, Bioinfor-
matics 17 (4) (2001) 349–358.

[4] Y. Taguchi, M. M. Gromiha, Application of amino acid occur-
rence for discriminating different folding types of globular pro-345
teins, BMC bioinformatics 8 (1) (2007) 404.

[5] P. Ghanty, N. R. Pal, Prediction of protein folds: extraction
of new features, dimensionality reduction, and fusion of hetero-
geneous classifiers, IEEE transactions on nanobioscience 8 (1)
(2009) 100–110.350

[6] A. Sharma, J. Lyons, A. Dehzangi, K. K. Paliwal, A feature
extraction technique using bi-gram probabilities of position spe-
cific scoring matrix for protein fold recognition, Journal of the-
oretical biology 320 (2013) 41–46.

[7] H. Saini, G. Raicar, A. Sharma, S. Lal, A. Dehzangi, J. Lyons,355
K. K. Paliwal, S. Imoto, S. Miyano, Probabilistic expression of
spatially varied amino acid dimers into general form of chou’s
pseudo amino acid composition for protein fold recognition,
Journal of theoretical biology 380 (2015) 291–298.

[8] Q. Dong, S. Zhou, J. Guan, A new taxonomy-based protein360
fold recognition approach based on autocross-covariance trans-
formation, Bioinformatics 25 (20) (2009) 2655–2662.

[9] K. K. Paliwal, A. Sharma, J. Lyons, A. Dehzangi, A tri-gram
based feature extraction technique using linear probabilities of
position specific scoring matrix for protein fold recognition,365
IEEE transactions on nanobioscience 13 (1) (2014) 44–50.

[10] Y.-D. Cai, X.-J. Liu, X.-b. Xu, K.-C. Chou, Prediction of pro-
tein structural classes by support vector machines, Computers
& chemistry 26 (3) (2002) 293–296.

[11] G. Taherzadeh, Y. Yang, T. Zhang, A. W.-C. Liew, Y. Zhou,370
Sequence-based prediction of protein–peptide binding sites us-
ing support vector machine, Journal of computational chemistry
37 (13) (2016) 1223–1229.

[12] A. Anand, G. Pugalenthi, P. Suganthan, Predicting protein
structural class by svm with class-wise optimized features and375
decision probabilities, Journal of theoretical biology 253 (2)
(2008) 375–380.

[13] Y.-S. Ding, T.-L. Zhang, Using chous pseudo amino acid com-
position to predict subcellular localization of apoptosis proteins:
an approach with immune genetic algorithm-based ensemble380
classifier, Pattern Recognition Letters 29 (13) (2008) 1887–1892.

[14] J. Xia, Z. Peng, D. Qi, H. Mu, J. Yang, An ensemble approach
to protein fold classification by integration of template-based as-
signment and support vector machine classifier, Bioinformatics
33 (6) (2016) 863–870.385

[15] I. Dubchak, I. B. Muchnik, S.-H. Kim, Protein folding class
predictor for scop: approach based on global descriptors., in:
Ismb, 1997, pp. 104–107.

[16] G. Raicar, H. Saini, A. Dehzangi, S. Lal, A. Sharma, Improving
protein fold recognition and structural class prediction accura-390
cies using physicochemical properties of amino acids, Journal of
theoretical biology 402 (2016) 117–128.

[17] T. Liu, X. Geng, X. Zheng, R. Li, J. Wang, Accurate prediction
of protein structural class using auto covariance transformation
of psi-blast profiles, Amino acids 42 (6) (2012) 2243–2249.395

[18] P. Baldi, G. Pollastri, The principled design of large-scale re-
cursive neural network architectures–dag-rnns and the protein
structure prediction problem, Journal of Machine Learning Re-
search 4 (Sep) (2003) 575–602.

[19] S. Jahandideh, P. Abdolmaleki, M. Jahandideh, E. B. Asad-400
abadi, Novel two-stage hybrid neural discriminant model for
predicting proteins structural classes, Biophysical chemistry
128 (1) (2007) 87–93.

[20] J.-Y. Yang, X. Chen, Improving taxonomy-based protein fold
recognition by using global and local features, Proteins: Struc-405
ture, Function, and Bioinformatics 79 (7) (2011) 2053–2064.

[21] M. S. Refahi, J. A. Nasiri, S. Ahadi, Ecg arrhythmia classi-
fication using least squares twin support vector machines, in:
Electrical Engineering (ICEE), Iranian Conference on, IEEE,
2018, pp. 1619–1623.410

[22] M. Rahmanimanesh, J. A. Nasiri, S. Jalili, N. M. Charkari,
Adaptive three-phase support vector data description, Pattern
Analysis and Applications 22 (2) (2019) 491–504.

[23] D. T. Jones, S. M. Kandathil, High precision in protein contact
prediction using fully convolutional neural networks and mini-415
mal sequence features, Bioinformatics 34 (19) (2018) 3308–3315.

[24] P. Sudha, D. Ramyachitra, P. Manikandan, Enhanced artificial
neural network for protein fold recognition and structural class
prediction, Gene Reports 12 (2018) 261–275.

[25] Blast and multiple sequence alignment (msa) programs,420
https://viralzone.expasy.org/e_learning/alignments/

description.html, accessed: 2019-01-17.
[26] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman,

Basic local alignment search tool, Journal of molecular biology
215 (3) (1990) 403–410.425

[27] P. Zakeri, J. Simm, A. Arany, S. ElShal, Y. Moreau, Gene
prioritization using bayesian matrix factorization with genomic
and phenotypic side information, Bioinformatics 34 (13) (2018)
i447–i456.

[28] Q. Zou, J. Zeng, L. Cao, R. Ji, A novel features ranking met-430
ric with application to scalable visual and bioinformatics data
classification, Neurocomputing 173 (2016) 346–354.

[29] C. Cortes, V. Vapnik, Support-vector networks, Machine learn-
ing 20 (3) (1995) 273–297.

[30] B. Schölkopf, A. J. Smola, F. Bach, et al., Learning with ker-435
nels: support vector machines, regularization, optimization, and
beyond, MIT press, 2002.

[31] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass
support vector machines, IEEE transactions on Neural Net-
works 13 (2) (2002) 415–425.440

[32] C.-C. Chang, C.-J. Lin, Libsvm: A library for support vector
machines, ACM transactions on intelligent systems and tech-
nology (TIST) 2 (3) (2011) 27.

[33] K. Yan, Y. Xu, X. Fang, C. Zheng, B. Liu, Protein fold recog-
nition based on sparse representation based classification, Arti-445
ficial intelligence in medicine 79 (2017) 1–8.

10

.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprint (whichthis version posted November 23, 2019. ; https://doi.org/10.1101/845727doi: bioRxiv preprint 

https://viralzone.expasy.org/e_learning/alignments/description.html
https://viralzone.expasy.org/e_learning/alignments/description.html
https://viralzone.expasy.org/e_learning/alignments/description.html
https://doi.org/10.1101/845727
http://creativecommons.org/licenses/by-nc-nd/4.0/

	Introduction
	Related work
	Methodology
	Preprocessing
	BLAST
	PSSM

	Feature Extraction
	ACC
	SD

	Fusion hypothesis
	Information Gain
	Support Vector Machine

	Experimental Result
	Dataset
	Result
	Discussion

	Conclusion