key: cord-265087-g4k6pc82 authors: Munteanu, Cristian Robert; González-Díaz, Humberto; Borges, Fernanda; de Magalhães, Alexandre Lopes title: Natural/random protein classification models based on star network topological indices date: 2008-10-21 journal: Journal of Theoretical Biology DOI: 10.1016/j.jtbi.2008.07.018 sha: doc_id: 265087 cord_uid: g4k6pc82 Abstract The development of the complex network graphs permits us to describe any real system such as social, neural, computer or genetic networks by transforming real properties in topological indices (TIs). This work uses Randic's star networks in order to convert the protein primary structure data in specific topological indices that are used to construct a natural/random protein classification model. The set of natural proteins contains 1046 protein chains selected from the pre-compiled CulledPDB list from PISCES Dunbrack's Web Lab. This set is characterized by a protein homology of 20%, a structure resolution of 1.6Å and R-factor lower than 25%. The set of random amino acid chains contains 1046 sequences which were generated by Python script according to the same type of residues and average chain length found in the natural set. A new Sequence to Star Networks (S2SNet) wxPython GUI application (with a Graphviz graphics back-end) was designed by our group in order to transform any character sequence in the following star network topological indices: Shannon entropy of Markov matrices, trace of connectivity matrices, Harary number, Wiener index, Gutman index, Schultz index, Moreau–Broto indices, Balaban distance connectivity index, Kier–Hall connectivity indices and Randic connectivity index. The model was constructed with the General Discriminant Analysis methods from STATISTICA package and gave training/predicting set accuracies of 90.77% for the forward stepwise model type. In conclusion, this study extends for the first time the classical TIs to protein star network TIs by proposing a model that can predict if a protein/fragment of protein is natural or random using only the amino acid sequence data. This classification can be used in the studies of the protein functions by changing some fragments with random amino acid sequences or to detect the fake amino acid sequences or the errors in proteins. These results promote the use of the S2SNet application not only for protein structure analysis but also for mass spectroscopy, clinical proteomics and imaging, or DNA/RNA structure analysis. One of the widely used methods for the predicting of the protein properties is quantitative structure activity relationship (QSAR) (Devillers and Balaban, 1999) . Graph theory can be used to obtain macromolecular descriptors named topological indices (TIs). The branch of mathematical chemistry dedicated to encode the DNA/protein information in graph representations by the use of the TIs has become an intense research area with interesting works of Liao (Liao and Wang, 2004a, b; Liao and Ding, 2005; Liao et al., 2006) , Randic, Nandy, Balaban, Basak, and Vracko (Randic, 2000; Randic et al., 2000; Randic and Basak, 2001; Randic and Balaban, 2003) , Bielinska-Waz team (Bielinska-Waz et al., 2007) or our group (Perez et al., 2004; Aguero-Chapin et al., 2006) . Using graphic approaches to study biological systems can provide useful insights, as indicated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions (Andraos, 2008; Chou, 1989; Forsen, 1980, 1981; Chou and Liu, 1981; Chou et al., 1979; King and Altman, 1956; Kuzmic et al., 1992; Myers and Palmer, 1985; Zhou and Deng, 1984) , protein folding kinetics (Chou, 1990) , inhibition kinetics of processive nucleic acid polymerases and nucleases (Althaus et al., 1993a (Althaus et al., , b, c, 1994a (Althaus et al., , b, 1996 Chou et al., 1994) , analysis of codon usage (Chou and Zhang, 1992; Chou, 1993, 1994) , base frequencies in the anti-sense strands , and analysis of DNA sequence (Qi et al., 2007) . Moreover, graphical methods have been introduced for QSAR study Prado-Prado et al., 2008) as well as utilized to deal with complicated network systems (Diao et al., 2007; Gonzalez-Diaz et al., 2007a . Recently, the ''cellular automaton image'' (Wolfram, 1984 (Wolfram, , 2002 has also been applied to study hepatitis B viral infections (Xiao et al., 2006a) , HBV virus gene missense mutation (Xiao et al., 2005b) , and visual analysis of SARS-CoV (Gao et al., 2006; Wang et al., 2005) , as well as representing complicated biological sequences (Xiao et al., 2005a) and helping to identify protein attributes (Xiao and Chou, 2007; Xiao et al., 2006b) . The actual work presents for the first time a natural/random protein classification using only the chain sequence and amino acid connectivity protein structural data. The data are transformed into sequence and connectivity Star Graph's TIs, which are then used as input for a statistical linear method in the construction of a simple classification model. Two sets of proteins are compared in the new classification model: a set (Nat) of 1046 natural protein chains as defined in the pre-compiled CulledPDB list from PISCES Dunbrack's Web Lab (Wang and Dunbrack, 2003) and a second (Rnd) with the same size formed by random amino acid sequences generated with Python scripts (Rossum, 2006) . The natural set is characterized by a homology of 20%, a structure resolution of 1.6 Å and R-factor lower than 25%. The random set is composed by the same standard amino acid types and the average length of the chains is the same as that of the natural set. Python scripts are used to download PDB files from the PDB data bank (Berman et al., 2000) and to create the correspondent DSSP file with the DSSP application (Kabsch and Sander, 1983) . The chain sequences were extracted with a Python script from these DSSP files and were filtered with our Prot-2S Web Tool (http://www.requimte. pt:8080/Prot-2S/) by removing the chains that contain nonstandard amino acid (usually labelled X). Each protein can be considered as a real network where the amino acids are the vertices (nodes), connected in a specific sequence by the peptide bonds. The graph is the abstract representation of the network and is a collection of N vertices and the connections between them. The star graph is a special case of trees with N vertices where one has got NÀ1 degrees of freedom and the remaining NÀ1 vertices have got one single degree of freedom (Harary, 1969) . In addition, as a general property, there is a unique path between any pair of vertices. For proteins, each of the 20 possible branches (''rays'') of the star contains the same amino acid type and the star centre is a nonamino acid vertex. The same protein can be represented by different forms which are associated to distinct distance matrices (Randic et al., 2007) . If the vertices do not carry a label, the sequence information will be lost; for that reason, the best method is to construct a standard star graph where each amino acid/vertex holds the position in the original sequence and the branches are labelled by alphabetical order of the three-letter amino acid code (Randic et al., 2007) . In the present study we are using the alphabetical order of oneletter amino acid code. The standard star graph for a random virtual decapeptide (ACADCEFDGH) is illustrated in Fig. 1 . If the initial connectivity in the protein chain is included, the graph is embedded (Fig. 2) . In order to compare the graphs, it is necessary to transform the graphical representation in connectivity matrix, distance matrix and degree matrix. In the case of the embedded graph, the matrices of the connectivity in the sequence and in the star graph are combined. These matrices and the normalized ones are the base for the TIs calculation. The protein chain sequences are transformed into Star Graph representations and then characterized by several TIs using our new Sequence to Star Networks (S2SNet) application. S2SNet is a wxPython (Noel Rappin, 2006) GUI application with Graphviz (Koutsofios, 1993 ) as a graphics back-end. The user of this interactive tool is able to choose the level of calculations, such as: embedded graph, additional weights for each amino acid, Markov normalization, power of the matrix connectivity, the input files (files with sequences, groups and weights), the output files, the level of details (files for summary and detailed results) and the type of graph visualization (dot, neato, fdp, twopi, circo). In particular, the calculations presented in this work are characterized by embedded and non-embedded TIs, no weights, Markov normalization and power of matrices/indices (n) up to 5. The summary file contains the following TIs (Todeschini and Consonni, 2002) : Shannon entropy of the n powered Markov matrices (Sh n ): where p i are the n i elements of the p vector, resulted from the matrix multiplication of the powered Markov normalized matrix (n i  n i ) and a vector (n i  1) with each element equal to 1/n i ; The trace of the n connectivity matrices (Tr n ): where Harary number (H): where d ij are the elements of the distance matrix, m ij are the elements of the M connectivity matrix, w j are the weight elements and nw is a switch to select (1) or not select (0) weights calculations; Wiener index (W): -500 Case Number Gutman topological index (S 6 ): where deg i are the elements of the degree matrix; Schultz topological index (non-trivial part) (S): Moreau-Broto, autocorrelation of topological structure (ATS n , n ¼ 1Àpower limit), only with weights included: where dp ij n are the elements of the pair distance matrix when the distance is n; Balaban distance connectivity index (J): where nodes+1 ¼ AA numbers/node number in the Star Graph+origin, P k d ik is the node distance degree; Kier-Hall connectivity indices ( n X): Randic connectivity index ( 1 X): All these TIs will be used to construct a natural/random classification model by statistical methods. General discriminant analysis (GDA) (Kowalski and Wold, 1982; Van Waterbeemd, 1995) (StatSoft.Inc., 2002) has been chosen as the simplest and fastest method. In order to decide if a protein chain is classified as natural (if exists in the PDB database) or random, we added an extra dummy variable named Nat/Rnd (binary values of 0/1) and a cross-validation variable (CV). There are three often used crossvalidation methods to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test (Chou and Zhang, 1995) . Through a crystal-clear analysis, Shen (2007, 2008) have shown that only the jackknife test has the least arbitrariness. Therefore, the jackknife test has been increasingly used by investigators to examine the accuracy of various predictors (Chen and Li, 2007a, b; Diao et al., 2007; Ding et al., 2007; Jiang et al., 2008; Jin et al., 2008; Li and Li, 2008; Lin, 2008; Lin et al., 2008; Niu et al., 2006 Niu et al., , 2008 Wang et al., 2008; Xiao and Chou, 2007; Zhou et al., 2007; Zhang et al., 2008) . In the actual work, the independent data test is used by splitting the data at random in a training series (train, 75%) used for model construction and a prediction one (val, 25%) for model validation (the CV column is filled by repeating 3 train and 1 val). All independent variables are standardized prior to model construction. Using S2SNet methodology, as defined previously we can attempt to develop a simple linear QSAR, with the general formula where Nat/Rnd-score is the continue score value for the Nat/Rnd classification, T i ¼ TIs described above, C 1 ÀC n ¼ TIs coefficients, n is the number for the indices and c 0 is the independent term. GDA models quality was determined by examining Wilk's U statistics, Fisher ratio (F), p-level (p), and canonical regression coefficient (R C ). We also inspected the percentage of good classification, cases/variables ratios, and number of variables to be explored in order to avoid over-fitting or chance correlation. The forward, backward and best subset model types are tested for the embedded, non-embedded and both data. Eight variable selection methods were applied in order to find the best GDA equation which is able to discriminate between natural and random chain proteins. Eight models were constructed using embedded/non-embedded Star Graph TIs obtained with S2SNet application and forward, backward and best subset model types. The values obtained for the training/predicting accuracies are presented in Table 1 . The forward stepwise selection variable method conjugated with the nE and E TIs provides the best results for our data set with values of correctly classified compounds of 91.01%, 90.06% and 90.77% for the training, cross-validation and full sets, respectively, and using a minimum number of 12 parameters (Eq. (15)). The embedded TIs have the name of the non-embedded ones plus ''e'' as suffix: Nat=Rnd À score ¼ 0:1 þ 4:8Sh0 þ 254:9H þ 1860:2W À 1931:0S þ 39:4J À 139:2X0 À 73:0X3 þ 146:7X4 À 159:3X5 À 6:6Tr4e þ 7:1X2e, where N is the number of studied protein sequences (Nat+Rnd), R c is the canonical regression coefficient, U is the Wilk's statistics, F is the Fisher's statistics and p is the p-level (probability of error). The present R c value shows a high level of correlation between the input variables and the classification of proteins. Wilk's U is used to measure the statistical significance of the discriminatory power of the model and has values from 1.0 (no discriminatory power) to 0.0 (perfect discriminatory power). The F value shows the statistical significance in the discrimination between groups, a measure of the extent to which a variable makes a unique contribution to a prediction of group membership. The values of the p-level of Fisher's test for the GDA is less than 0.05 and show that the hypothesis of group overlapping with a 5% error can be rejected (Hua and Sun, 2001 considered as excellent in the literature for LDA-QSAR models (Garcia-Garcia et al., 2004; Marrero-Ponce et al., 2004 . The parametrical assumptions such as normality, homoscedasticity (homogeneity of variances) and non-colinearity have the same importance in the application of multivariate statistic techniques to QSPR (Bisquerra Alzina, 1989; Stewart, 1998) as the correct specification of the mathematical form has. The validity and statistical significance of any model is conditioned by the above-mentioned factors. In our study, a simple linear mathematical form of the model has been chosen in the absence of prior information. Figs. 3 and 4 show that the training cases against the residuals did not present any characteristic pattern (Dillon and Goldstein, 1984) . The protein nos. 632 and 864 are the only two cases not shown in Fig. 4 because the corresponding raw residuals are clear distinct from the whole set, ca -7. They correspond to 1QWN, chain A (1014 AAs) and 1JZ8, chain A (1011 AAs). One possible reason for the apparent different statistical behaviour could be the limitation of the model when the length of the chains is greater than 1000 amino acids. It is possible that the star net TIs for large proteins become similar to the TIs of the random proteins. A different and better threshold for the a priori classification probability can be estimated by means of the receiver operating characteristics (ROC) curve (James and Hanley, 1982) . As the Fig. 5 clearly shows, one can see that the model is not a random, but a truly statistically significant classifier, since the area under the ROC curve (for both training ¼ 0.98 and validation ¼ 0.96) is significantly higher than the area under the random classifier curve random ¼ 0.5 ¼ diagonal line (Morales Helguera et al., 2007) . The validity of the GDA models depends on the normal distribution of the sample used as well as the homogeneity of their variances. Thus, we carried out two significant tests for normality, chi-square and Kolmogorov-Smirnov tests, and we have found significant statistical differences (po0.01) on the respective values (chi-square, d). These results allow us to reject the hypothesis of normal distribution of the sample under study (Fig. 6) (Stewart, 1998) . The heteroscedasticity of a large set can be detected with the simple graphical method based on the examination of the residuals of the variable included in the model. Fig. 7(a and b) shows that the Nat/Rnd GDA model variables against the residuals plots do not present any pattern, which indicates that homoscedasticity assumption is fulfilled (Stewart, 1998) . Due to the robustness of the GDA multivariate statistical techniques, the predictive ability and interference reached by using the proposed model should not be affected (see Fig. 8 ). This study extends for the first time the classical TIs to protein Star Network TIs by proposing a model that can predict if a chain protein is natural or random. The results prove for the first time the excellent predictive ability (90.77%) of the simple and fast Star Network TIs and GDA statistics linear models in the case of natural/random protein model. This classification can help the study of the protein function by changing some fragments with random amino acid sequences or can detect the fake amino acid sequences or the errors in proteins. The S2SNet application can be very useful to calculate the protein Star Network TIs, which can be the base of a model for any other protein property. S2SNet can also be used for mass spectroscopy, clinical proteomics and imaging or DNA/RNA structure analysis. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204E Steady-state kinetic studies with the polysulfonate U-9843, an HIV reverse transcriptase inhibitor Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-90152E The benzylthiopyrididine U-31, 355 is a potent inhibitor of HIV-1 reverse transcriptase Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs The protein data bank Distribution moments of 2D-graphs as descriptors of DNA sequences Introducció n conceptual al aná lisis multivariante: Un enfoque informá tico con los paquetes SPSS Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition Prediction of the subcellular location of apoptosis proteins Graphical rules in steady and non-steady enzyme kinetics Review: applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady state systems Graphical rules for enzyme-catalyzed rate laws Graphical rules of steady-state reaction systems Graphical rules for non-steady state enzyme kinetics Review: recent progresses in protein subcellular location prediction Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms Diagrammatization of codon usage in 339 HIV proteins and its biological implication Review: prediction of protein structural classes Graph theory of enzyme kinetics: 1. Steady-state reaction system Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases Do antisense proteins exist? Topological Indices and Related Descriptors in QSAR and QSPR. Gordon and Breach The community structure of human cellular signaling network Multivariate Analysis: Methods and Applications Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network A novel fingerprint map for detecting SARS-CoV New agents active against Mycobacterium avium complex selected by molecular topology: a virtual screening method 3D-QSAR study for DNA cleavage proteins with a potential anti-tumor ATCUN-like motif Medicinal chemistry and bioinformatics-current trends in drugs discovery with networks topological indices ANN-QSAR model for selection of anticancer leads from structurally heterogeneous series of compounds Proteomics, networks, and connectivity indices Graph Theory Support vector machine approach for protein subcellular localization prediction The meaning and use of the area under a receiver operating characteristic (ROC) curve Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy Predicting subcellular localization with AdaBoost learner Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features A schematic method of deriving the rate laws for enzyme-catalyzed reactions Drawing Graphs with Dot Pattern recognition in chemistry Kinetic analysis by a recursive rate equation Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach Graphical approach to analyzing DNA sequences Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases New 2D graphical representation of DNA sequences Coronavirus phylogeny based on 2D graphical representation of DNA sequence The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition 3D-chiral quadratic indices of the 'molecular pseudograph's atom adjacency matrix' and their application to central chirality codification: classification of ACE inhibitors and prediction of sigma-receptor antagonist activities Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic Probing the anticancer activity of nucleoside analogues: a QSAR model approach using an internally consistent training set Microcomputer tools for steady-state enzyme kinetics Predicting protein structural class with AdaBoost learner Predicting membrane protein types with bagging learner A topological sub-structural approach for predicting human intestinal absorption of drugs Unified QSAR approach to antimicrobials. Part 3: first multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds New 3D graphical representation of DNA sequence based on dual nucleotides Condensed representation of DNA primary sequences On a four-dimensional representation of DNA primary sequences Characterization of DNA primary sequences based on the average distances between bases On 3-D graphical representation of DNA primary sequences and their numerical characterization On representation of proteins by starlike graphs Python Reference Manual STATISTICA (data analysis software system), version 6.0, /www Handbook of Molecular Descriptors Discriminant analysis for activity prediction PISCES: a protein sequence culling server A new nucleotide-composition based fingerprint of SARS-CoV with visualization analysis Predicting membrane protein types by the LLDA algorithm Cellular automation as models of complexity Digital coding of amino acids based on hydrophobic index Using cellular automata to generate image representation for biological sequences An application of gene comparative image for predicting the effect on replication ratio by HBV virus gene missense mutation A probability cellular automaton model for hepatitis B viral infections Using cellular automata images and pseudo amino acid composition to predict protein subcellular location Graphic analysis of codon usage strategy in 1490 human proteins Analysis of codon usage in 1562 E. coli protein coding sequences Prediction protein structural classes with pseudo amino acid composition: approximate entropy and hydrophobicity pattern An extension of Chou's graphical rules for deriving enzyme kinetic equations to system involving parallel reaction pathways Using Chou's amphiphilic pseudoamino acid composition and support vector machine for prediction of enzyme subfamily classes Cristian R. Munteanu thanks the FCT (Portugal) for support from Grant SFRH/BPD/24997/2005. Gonzá lez-Díaz Humberto acknowledges an Isidro Parga Pondal research contract supported by Xunta de Galicia, University of Santiago de Compostela (Spain).