key: cord-272260-88l9bq4i authors: Han, L.Y.; Cai, C.Z.; Ji, Z.L.; Chen, Y.Z. title: Prediction of functional class of novel viral proteins by a statistical learning method irrespective of sequence similarity date: 2005-01-05 journal: Virology DOI: 10.1016/j.virol.2004.10.020 sha: doc_id: 272260 cord_uid: 88l9bq4i The function of a substantial percentage of the putative protein-coding open reading frames (ORFs) in viral genomes is unknown. As their sequence is not similar to that of proteins of known function, the function of these ORFs cannot be assigned on the basis of sequence similarity. Methods complement or in combination with sequence similarity-based approaches are being explored. The web-based software SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi) to some extent assigns protein functional family irrespective of sequence similarity and has been found to be useful for studying distantly related proteins [Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13): 3692–3697]. Here 25 novel viral proteins are selected to test the capability of SVMProt for functional family assignment of viral proteins whose function cannot be confidently predicted on by sequence similarity methods at present. These proteins are without a sequence homolog in the Swissprot database, with its precise function provided in the literature, and not included in the training sets of SVMProt. The predicted functional classes of 72% of these proteins match the literature-described function, which is compared to the overall accuracy of 87% for SVMProt functional class assignment of 34 582 proteins. This suggests that SVMProt to some extent is capable of functional class assignment irrespective of sequence similarity and it is potentially useful for facilitating functional study of novel viral proteins. The complete genomes of 1536 viruses have been sequenced (viral genomes at NCBI http://www.ncbi.nlm. nih.gov/genomes/static/vis.html). Knowledge of these genomes has facilitated mechanistic study of viral infections and provided important clues for searching molecular targets of antiviral therapeutics (Herniou et al., 2003; Marra et al., 2003; Miller et al., 2003) . The function of over 15% of the putative protein-coding open reading frames (ORFs) in these viral genomes is unknown (Herniou et al., 2003; Marra et al., 2003; Miller et al., 2003) . Determination of the function of these unknown ORFs is important for a more comprehensive understanding of the molecular mechanism of specific virus and for searching novel targets for antiviral drug development. The sequence of many of these unknown ORFs has no significant similarity to proteins of known functions, and their functions are difficult to probe on the basis of sequence similarity. For instance, 50%, 100%, 20%, and 67% of the unknown ORFs in the recently determined genomes of Ferde-lance virus (Makeyev and Bamford, 2004) , Grapevine fleck virus (Sabanadzovic et al., 2001) , Indian citrus ringspot virus (Rustici et al., 2002) , and SARS coronavirus (He et al., 2004) are without a homolog in Swissprot database (Boeckmann et al., 2003) based on BLAST search against all Swissprot entries as of September 2004. This suggests that a significant percentage of new viral proteins are likely to have no known sequence homolog. It is thus desirable to explore alternative methods or combination of methods for providing useful hint about the function of unknown viral ORFs. Various alternative methods for probing protein function have been developed. These include evolutionary analysis (Benner et al., 2000; Eisen, 1998) , hidden Markov models (Fujiwara and Asogawa, 2002) , structural consideration (Di Gennaro et al., 2001; Teichmann et al., 2001) , protein/gene fusion (Enright et al., 1999; Marcotte et al., 1999) , proteinprotein interactions (Bock and Gough, 2001) , motifs (Hodges and Tsai, 2002) , family classification by sequence clustering (Enright et al., 2002) , and functional family prediction by statistical learning methods (Cai et al., 2003 Han et al., 2004; Jensen et al., 2002; Karchin et al., 2002) . In the absence of clear sequence or structural similarities, the criteria for comparison of distantly related proteins become increasingly difficult to formulate (Enright and Ouzounis, 2000) . Moreover, not all homologous proteins have analogous functions (Benner et al., 2000) . The presence of shared domain within a group of proteins does not necessarily imply that these proteins perform the same function (Henikoff et al., 1997) . Therefore, careful evaluation is needed to determine which method or combination of methods is useful for facilitating functional study of novel proteins with no homology to proteins of known function. The web-based software SVMProt (http://jing.cz3.nus. edu.sg/cgi-bin/svmprot.cgi) to some extent has shown some potential for assigning the functional class of distantly related proteins and homologous proteins of different functions as well as homologous proteins (Cai et al., 2003 . It classifies proteins into functional classes defined from activities or physicochemical properties rather than sequence similarity (Bock and Gough, 2001; Cai et al., 2003 Cai et al., , 2004 Han et al., 2004; Karchin et al., 2002) . In developing SVMProt, proteins in a training set, represented by their sequence-derived physicochemical properties, are projected onto a hyperspace where proteins in a class are separated from those outside the class by a hyperplane. By projecting a new sequence onto the same hyperspace, SVMProt determines whether the corresponding protein is a member of that class based on its location with respect to the hyperplane. The accuracy of SVMProt depends on the diversity of the protein samples, the quality of the representation of protein properties, and the efficiency of the statistical learning algorithm. To some extent, no sequence similarity is required per se. Thus SVMProt may be potentially explored for facilitating functional assignment of proteins whose function cannot be assigned on the basis of sequence similarity. This work evaluates the usefulness of SVMProt for predicting the functional class of viral ORFs of unknown function. It is assessed by using novel viral proteins that are without a single homolog in the SwissProt database (Boeckmann et al., 2003) , with their precise function described in the literature, and are not included in the training sets of SVMProt. These proteins are collected from an unbiased search of Medline (Wheeler et al., 2003) and SwissProt database (Boeckmann et al., 2003) . The SVMProt predicted functional classes of these proteins are compared with the function described in the literature and databases to evaluate to what extent SVMProt are useful for functional class assignment of novel viral proteins. The prediction accuracy for assignment of these novel proteins is compared with the overall accuracy of the SVMProt assignment of a large number of proteins to examine the level of sequence similarity independence of SVMProt classification. Table 1 gives SVMProt ascribed functional classes for each of the 25 novel viral proteins together with literaturedescribed function. More than one class may be characterized by SVMProt and the probability of correct prediction for each class is also given in Table 1 . There are 18 proteins with the top hit of the SVMProt assigned functional class matching the literature-described function, representing 72% of the novel viral proteins studied in this work. These proteins are MotA protein of bacteriophage T4 (Gerber and Hinton, 1996) , outer capsid protein VP4 of bovine rotavirus (serotype 10/strain B223) (Hardy et al., 1992) , ADOMetase of bacteriophage T3 (Hughes et al., 1987) , R.CviJI of chlorella virus IL3A (Skowron et al., 1995) , exonuclease of bacteriophage lambda (Sanger et al., 1982) , R.CviAII of paramecium bursaria chlorella virus 1 (Zhang et al., 1992) , ORF13 of haemophilus phage HP1 (Esposito et al., 1996) , Protein kinase of enterobacteria phage T7 (Dunn and Studier, 1983) , DNA-directed RNA polymerase of African swine fever virus (strain BA71V) (Yanez et al., 1995) , AGT (Miller et al., 2003) , BGT (Miller et al., 2003; Tomaschewski et al., 1985) , DNK (Broida and Abelson, 1985) , Endonuclease II (Sjoberg et al., 1986) , Endonuclease V (Valerie et al., 1984) , Gp61.9 (Valerie et al., 1986) , IRF protein (Chu et al., 1986) , and I-TevII (Tomaschewski and Ruger, 1987) of enterobacteria phage T4. MotA protein of bacteriophage T4 has been found to be a transcription activator that binds to DNA (Gerber and Hinton, 1996) and the far-C-terminal region of the sigma70 subunit of Escherichia coli RNA polymerase (Pande et al., 2002) . The top hit of SVMProt predicted functional class for this protein is the DNA-binding, which matches with literature-described functions. Bovine rotavirus is a double-stranded RNA virus that is naked. Thus, the outer capsid protein VP4 of bovine rotavirus (serotype 10/strain B223) is located at the viral surface acting as part of the viral coat (Hardy et al., 1992) . This protein is predicted by SVMProt as a coat protein that is consistent with literature-described function. The other 14 proteins are enzymes, and these are all correctly assigned by SVMProt to the respective enzyme EC class. Because these proteins have no homolog of known function in the SwissProt entries of Swissprot database based on PSI-BLAST search, our study suggests that SVMProt has certain level of capability for providing useful hint about the functional class of novel proteins with no or low homology to known proteins, and this capability is not based on sequence similarity or clustering. The overall accuracy of 72% for the assignment of the novel viral proteins is smaller, but not too far away, than that of 87% for SVMProt functional class assignment of 34 582 proteins. This indicates certain level of the sequence-similarityindependent nature of SVM protein classification. Several factors may affect the accuracy of SVMProt for functional characterization of novel plant proteins. One is the diversity of protein samples used for training SVMProt. It is likely that not all possible types of proteins, particularly those of distantly related members, are adequately represented in some protein classes. This can be improved along with the availability of more protein data. Not all distantly related proteins of the same function have similar structural and chemical features. There are cases in which different functional groups, unconserved with respect to position in the primary sequence, mediate the same mechanistic role, due to the flexibility at the active site (Todd et al., 2002) . This plasticity is unlikely to be sufficiently described by the physicochemical descriptors currently used in SVMProt. Therefore, SVMProt in the present form is not expected to be capable of classification of these types of distantly related enzymes. Some of the SVMProt functional classes are at the level of families and superfamilies that may include a broad spectrum of proteins. It has been shown that SVM works not as well as HMM for distinguishing proteins in a superfamily, but may be more accurate with subfamily discrimination (Karchin et al., 2002) . Thus, the use of some large families and superfamilies as the basis for classification may affect the prediction accuracy of SVMProt to some extent. SVMProt prediction may be further improved by using protein subfamilies as the basis of classification, more comprehensive set of protein samples, and more refined protein descriptors. SVMProt optimization procedure and feature vector selection algorithm may also be improved by adding additional constraints, and by incorporating independent component analysis and kernel PCA in the preprocessing steps. SVMProt shows certain level of capability for predicting functional class of a number of novel viral proteins. This suggests that SVMProt is potentially useful to a certain extent for providing useful hint about the function of distantly related proteins in viruses as well as in other organisms. Further improvements in protein functional family coverage, sample collections, and SVM algorithm may enable the development of SVMProt into a practical tool for facilitating functional study of unknown ORFs in virus genomes and other genomes. The key words, bnovel protein virusQ or bnovel viral proteinQ, are used to search the Medline (Wheeler et al., 2003) and the Swissprot database (Boeckmann et al., 2003) for finding viral proteins that are both described as novel and with their precise function provided. As the search of the Medline is confined to the abstracts, those proteins whose function is not explicitly hinted in an abstract are not selected. Thus, the selected proteins likely account for a portion of the known novel viral proteins with available functional information. PSI _ BLAST (Altschul et al., 1997) sequence analysis is subsequently conducted on each of these novel viral proteins against all SwissProt entries in the SwissProt protein database (Boeckmann et al., 2003) so that those with at least one sequence homolog of known function (including that of the same protein in different species) are removed. The commonly used criterion for homologs, the similarity score e-value b the inclusion threshold value of 0.005 (Altschul et al., 1997) , is used in this work. Finally, those proteins that are in the training sets of SVMProt are removed. A total of 25 novel viral proteins are identified in this process, which together with their protein accession number and literature-described functional indications and related references are given in Table 1 . SVMProt is based on a statistical learning method support vector machines (SVM) (Burges, 1998) . In addition to the prediction of protein functional class (Cai et al., 2003 Han et al., 2004; Karchin et al., 2002) , SVM has also been used for a variety of protein classification problems including fold recognition (Ding and Dubchak, 2001) , analysis of solvent accessibility (Yuan et al., 2002) , prediction of secondary structures (Hua and Sun, 2001) , and protein-protein interactions (Bock and Gough, 2001) . As a method that uses sequence-derived physicochemical properties of proteins as the basis for classification, SVM may be particularly useful for functional classification of distantly related proteins and homologous proteins of different functions (Cai et al., 2003 . There are 75 protein functional classes currently covered by SVMProt. These include 46 enzyme families, 13 channel/transporter families, 4 RNA-binding protein families, DNA-binding proteins, G-protein-coupled receptors, nuclear receptors, Tyrosine receptor kinases, cell adhesion proteins, coat proteins, envelope proteins, outer membrane Human herpesvirus 6 chemokine like (Luttichau et al., 2003) No function predicted NM (continued on next page) proteins, structural proteins, and growth factors. Two broadly defined families of antigens and transmembrane proteins are also included. The majority of known types of viral proteins are included in these classes. Representative proteins of a particular functional class (positive samples) and those do not belong to this class (negative samples) are needed to train a SVMProt classifier for this class. The positive samples of a class are constructed by using all of the known distinct protein members in that class. Because of the enormous number of proteins, the size of negative samples needs to be restricted to a manageable level by using a minimum set of representative proteins. One way for choosing representative proteins is to select one or a few proteins from each protein domain family. The negative samples of a class are selected from seed proteins of the 7316 curated protein families (domain-based) in the Pfam database excluding those families that have at least one member belong to the functional class. Pfam families are constructed on the basis of sequence similarity. The purpose of using Pfam proteins is to ensure that the negative samples are evenly distributed in the protein space. Sequence similarity is not required for selecting positive samples. In this sense, SVMProt is to some extent independent of sequence similarity. The SVMProt training system for each family is optimized and tested by using separate testing sets of both positive and negative samples. While possible, all the remaining distinct proteins in each functional family (not in the training set of that family) are used as positive samples and all the remaining representative seed proteins in Pfam curated families are used to construct negative samples in a testing set. The performance of SVMProt classification is further evaluated by using independent sets of both positive and negative samples. There is no duplicate protein in each training, testing, or independent evaluation set. Data set construction can be demonstrated by an illustrative example of viral coat proteins. The key word bvirus coat proteinQ is used to search the Swissprot, which finds 3012 entries. These entries are checked to remove noncoat proteins, redundant entries, and putative proteins, which gives 848 positive samples. These positive samples cover 140 Pfam families; thus, 14 758 seed proteins of the remaining 7176 Pfam families are used as the negative samples. These positive and negative samples are further divided into 346 and 1474 training, 305 and 8370 testing, and 197 and 4914 independent evaluation sets using the procedure described above. Not all of the SVMProt classes are at the same hierarchical level. These classes are mixtures of subfamilies, families, and superfamilies. Some classes, such as antigen, need to be more clearly defined into specific subclasses. While it is desirable to define all of the classes at the same level, this is not yet possible because of insufficient data for the subhierarchies of some families and superfamilies. Effort is being made to collect sufficient data so that SVMProt classification systems can be constructed on the basis of a more evenly distributed family structures. Transferase (Wilfred et al., 2002) No function predicted NM SPLT13 (NP _ 258405) SpLtMNPV virus A noval envelope protein (Yin et al., 2003) No function predicted NM TRL10 (AAL27474) Human cytomegalovirus (HCMV) Structural envelop glycoprotein (Spaderna et al., 2002) Transmembrane ( Nonetheless, prediction on the basis of the current structures provides useful hint about the function of a protein. SVMProt is trained for protein classification in the following manner. First, every protein sequence is represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure, and solvent accessibility for each residue in the sequence (Cai et al., 2003) . The feature vectors of the positive and negative samples are used to train a SVMProt classifier. The trained SVMProt classifier can then be used to classify a protein into either the positive group (protein is predicted to be a member of the class) or the negative group (protein is predicted to not belong to the class). The theory of SVM has been described in the literature (Burges, 1998) . Thus, only a brief description is given here. SVM is based on the structural risk minimization (SRM) principle from statistical learning theory (Burges, 1998) . In linearly separable cases, SVM constructs a hyperplane that separates two different groups of feature vectors with a maximum margin. A feature vector is represented by x i , with physicochemical descriptors of a protein as its components. The hyperplane is constructed by finding another vector w and a parameter b that minimizes twt 2 and satisfies the following conditions: where y i is the group index, w is a vector normal to the hyperplane, |b| / twt is the perpendicular distance from the hyperplane to the origin and twt 2 is the Euclidean norm of w. After the determination of w and b, a given vector x can be classified by: In nonlinearly separable cases, SVM maps the input variable into a high dimensional feature space using a kernel function K(x i , x j ). An example of a kernel function is the Gaussian kernel that has been extensively used in different protein classification studies (Bock and Gough, 2001; Burges, 1998; Cai et al., 2002; Ding and Dubchak, 2001; Hua and Sun, 2001; Karchin et al., 2002; Yuan et al., 2002) : Linear support vector machine is applied to this feature space and then the decision function is given by: where the coefficients a i 0 and b are determined by maximizing the following Langrangian expression: under conditions: a i z 0 and X l iÀ1 a i y i ¼ 0 A positive or negative value from Eq. (3) or Eq. (5) indicates that the vector x belongs to the positive or negative group, respectively. To further reduce the complexity of parameter selection, hard margin SVM with threshold instead of soft margin SVM with threshold is used in SVMProt. Scoring of SVM classification of proteins has been estimated by a reliability index and its usefulness has been demonstrated by statistical analysis (Cai et al., 2003; Hua and Sun, 2001) . A slightly modified reliability score, R value, is used in SVMProt: where d is the distance between the position of the vector of a classified protein and the optimal separating hyperplane in the hyperspace, d N 0 indicates the sample belongs to the positive group and d b 0 the negative group. There is a statistical correlation between R value and expected classification accuracy (probability of correct classification) (Cai et al., 2003; Hua and Sun, 2001) . Thus, another quantity, P value, is introduced to indicate the expected classification accuracy. P value is derived from the statistical relationship between the R value and actual classification accuracy based on the analysis of 9932 positive and 45,999 negative samples of proteins (Cai et al., 2003) . Identification and characterization of a filament-associated protein encoded by Amsacta moorei entomopoxvirus Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Functional inferences from reconstructed evolutionary biology involving rectified databases-an evolutionarily grounded approach to functional genomics Predicting protein-protein interactions from primary structure The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 Sequence organization and control of transcription in the bacteriophage T4 tRNA region A tutorial on support vector machine for pattern recognition Support vector machines for predicting HIV protease cleavage sites in protein SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence Enzyme family classification by support vector machines Characterization of the intron in the phage T4 thymidylate synthase gene and evidence for its self-excision from the primary transcript Enhanced functional annotation of protein sequences via the use of structural descriptors Multi-class protein fold recognition using support vector machines and neural networks Complete nucleotide sequence of bacteriophage T7 DNA and the locations of T7 genetic elements Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis GeneRAGE: a robust algorithm for sequence clustering and domain detection Protein interaction maps for complete genomes based on gene fusion events An efficient algorithm for large-scale detection of protein families The complete nucleotide sequence of bacteriophage HP1 DNA Identification of a novel protein encoded by the BamHI A region of the Epstein-Barr virus Protein function prediction using hidden Markov models and neural networks An N-terminal mutation in the bacteriophage T4 motA gene yields a protein that binds DNA but is defective for activation of transcription Prediction of RNA-binding proteins from primary sequence by a support vector machine approach Amino acid sequence analysis of bovine rotavirus B223 reveals a unique outer capsid protein VP4 and confirms a third bovine VP4 type Analysis of multimerization of the SARS coronavirus nucleocapsid protein Gene families: the taxonomy of protein paralogs and chimeras The genome sequence and evolution of baculoviruses 3D-Motifs: an informatics approach to protein function prediction A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach Nucleotide sequence and analysis of the coliphage T3 S-adenosylmethionine hydrolase gene and its surrounding ribonuclease III processing sites Prediction of human protein function from post-translational modifications and localization features Classifying G-protein coupled receptors with support vector machines A highly selective CCR2 chemokine agonist encoded by human herpesvirus 6 Evolutionary potential of an RNA virus Detecting protein function and protein-protein interactions from genome sequences The bacteriophage T4 transcription activator MotA interacts with the far-C-terminal region of the sigma70 subunit of Escherichia coli RNA polymerase Nucleotide sequence, genome organisation and phylogenetic analysis of Indian citrus ringspot virus Complete nucleotide sequence and genome organization of Grapevine fleck virus Nucleotide sequence of bacteriophage lambda DNA The bacteriophage T4 gene for the small subunit of ribonucleotide reductase contains an intron Cloning and applications of the two/three-base restriction endonuclease R.CviJI from IL-3A virus-infected Chlorella Identification of glycoprotein gpTRL10 as a structural component of human cytomegalovirus Determination of protein function, evolution and interactions by structural genomics Plasticity of enzyme active sites Nucleotide sequence and primary structures of gene products coded for by the T4 genome between map positions 48.266 kb and 39.166 kb T4-induced alpha-and beta-glucosyltransferase: cloning of the genes and a comparison of their products based on sequencing data Identification, physical map location and sequence of the denV gene from bacteriophage T4 Nucleotide sequence and analysis of the 58.3 to 65.5-kb early region of bacteriophage T4 Characterization of Spodoptera exigua multicapsid nucleopolyhedrovirus ORF17/18, a homologue of Xestia c-nigrum granulovirus ORF129 Database resources of the National Center for Biotechnology Analysis of the complete nucleotide sequence of African swine fever virus Identification of a novel protein associated with envelope of occlusion-derived virus in Spodoptera litura multicapsid nucleopolyhedrovirus Prediction of protein solvent accessibility using support vector machines Characterization of Chlorella virus PBCV-1 CviAII restriction and modification system