key: cord-0000754-g1ob16b9 authors: Xie, Xiao-li; Zheng, Li-fei; Yu, Ying; Liang, Li-ping; Guo, Man-cai; Song, John; Yuan, Zhi-fa title: Protein sequence analysis based on hydropathy profile of amino acids date: 2012-01-27 journal: Journal of Zhejiang University SCIENCE B DOI: 10.1631/jzus.b1100052 sha: fb49968febfe6d5e4103cf3d834bb22ea994a341 doc_id: 754 cord_uid: g1ob16b9 Biology sequence comparison is a fundamental task in computational biology. According to the hydropathy profile of amino acids, a protein sequence is taken as a string with three letters. Three curves of the new protein sequence were defined to describe the protein sequence. A new method to analyze the similarity/dissimilarity of protein sequence was proposed based on the conditional probability of the protein sequence. Finally, the protein sequences of ND6 (NADH dehydrogenase subunit 6) protein of eight species were taken as an example to illustrate the new approach. The results demonstrated that the method is convenient and efficient. The comparative biological sequence is one of the issues in bioinformatics when analyzing similarities of function and properties of different sequences. Similarly, evolutionary homology is analyzed by comparing DNA and protein sequences. In general, there are two types of methodologies to conduct the comparison. One is an alignment-based method, and the other is an alignment-free method. Sequence alignment is based on computeroriented and computer-intensive comparisons of sequences, and then a distance function or a score function is obtained. Using the distance function, one can compare biological sequences. However, multiple sequence alignment of several hundred sequences always produces a bottleneck, firstly due to long computational time, and secondly due to possible bias of multiple sequence alignments for multiple occurrences of highly similar sequences (Pham and Zuegg, 2004) . Therefore, the emergence of a study on alignment-free sequence analysis is obvious. Until now, alignment-free sequence analysis is still in its early development. For most alignment-free methods, a biological sequence should be transformed into an object for which a linear algebra and statistical theory already has useful analytical tools. Since 1983, DNA sequence has been represented in different dimension spaces (Hamori and Ruskin, 1983; Hamori, 1985; Nandy, 1994; 1996; Nandy and Basak, 2000; Randić et al., 2001; Randić, 2003; Randić and Balaban, 2003; Zhang et al., 2003; Liao and Wang, 2004; Liao et al., 2005; Nandy et al., 2006; Bai et al., 2007; Feng and Wang, 2008) . Each nucleotide of a given DNA sequence is a point in different dimension spaces, and these graphical representations can allow us to qualitatively analyze DNA sequences, and provide a way of viewing, sorting and comparing various genomic sequences. Based on the graphical representation, it is possible to numerically characterize DNA sequence and further quantitatively measure similarity of different DNA sequences. Although protein sequence and DNA sequence belong to symbolic sequences, compared with DNA sequence, there are fewer methods for the graphical representation of protein sequence. This is mainly because extension of DNA graphical representation to protein sequences would enormously increase the number of possible alternative assignments for the 20 amino acids. The amino acid sequence is the key to understanding protein structure and function in the cell, so analysis of amino acid sequence is an important part of post-genomic studies. Recently, several schemes have been proposed in protein graphical representation (Randić and Krilov, 1997; Vinga and Almeida, 2003; Bai and Wang, 2005; Li J. et al., 2006; Li C. et al., 2008; Munteanu et al., 2008; Yau et al., 2008; Yao et al., 2008; Wen and Zhang, 2009) . In order to plot amino acid sequence, 20 amino acids in protein sequences are divided into different types, including protein sequence regarded as a word with three, four, or five different letters. Since ordering amino acids based on their physicochemical properties may offer better insights into comparative study of protein than representation of protein based on the random ordering of amino acid, Randić (2007) and Yao et al. (2008; outlined different 2D graphical representations of protein sequence based on different physicochemical properties. The graphical representation of protein sequence cannot only describe amino acid sequence, but also measure similarity/ dissimilarity of different protein sequences. However, the methods only consider the string's information of protein, and do not consider adjacent string's information of amino acid sequence. Here, we choose conditional probability to measure adjacent string's information. In this paper, we converted a protein sequence into three-letter sequence based on hydropathy profile of amino acid and defined the three curves to represent different hydropathy features. We then selected conditional probability as a new invariant for the protein sequences. To illustrate the proposed method, we made a comparison of the sequences belonging to eight ND6 (NADH dehydrogenase subunit 6) proteins from http://www.ncbi.nlm.nih.gov/: human , rat (AP_004903), and mouse (NP_904339). According to the hydropathy profile of amino acids, the amino acids can be classified into three groups (Nei and Kumar, 2002; Liu and Wang, 2006) : internal group (F, I, L, M, V), external group (D, E, H, K, N, Q, R), and ambivalent group (S, T, Y, C, W, G, P, A). The amino acid of internal group tends to occur in the inner side of the protein's spatial structure, while the amino acid of external group tends to appear at the surface. In order to characterize the hydropathicity of a protein primary structure, we defined a primary protein sequence as a symbolic sequence including three letters according to the following rule: where S(i) is the letter in the ith position in the protein primary sequence, and F(S(i)) is the substitution for S(i). Since the hydropathy profile can detect more evolutionary relationships, in the next section, we analyzed the new protein sequence containing three letters through different mathematical methods. Given a protein primary sequence with length N, we transformed it into a new sequence according to the above definition. For example, for the protein sequence, S=MMYALFLLSVGLVMGFVGFS, then F(S)=IIAAIIIIAIAIIIAIIAIA. To obtain more information, we defined three curves of the sequence. Firstly, we let IE EA IA 1 if ( ( )) I, 0 otherwise, where i ranges from 1 to N. Then, let Y n u and n are Y axis and X axis, respectively, and then we can draw three different curves, which are named as IE, IA, and EA curves of the protein sequence. The three different curves can give us some information about the protein sequence. According to the IE curve, we can compare the numbers of the amino acids belonging to the internal group and the external group at different positions. The IA curve can then be used to compare the numbers of the amino acids belonging to the internal group and the ambivalent group at different positions. Finally, the EA curve can compare the numbers of the amino acids of the external group and the ambivalent group at different positions. According to the above definitions of three different curves, we drew three curves of ND6 proteins for the eight species (Fig. 1 ). Fig. 1 shows that the amino acids of the internal group in ND6 protein sequences are more than the amino acids of the external group, and the amino acids of the ambivalent group are more than the amino acids of the external group. Furthermore, it is evident that G. seal and H. seal have similar curves, rat and mouse's curves are almost identical, and the three curves of human, gorilla, and chimpanzee are similar, but wallaroo's curve is different from curves of other species. Protein sequence is composed of three parts, internal group, external group and ambivalent group, so we regard the random numerical sequence to be composed of three parts (+1, 0, −1). We calculated the conditional probability, which was invariant to quantity protein sequences. For example, let X i IE represents the state of the ith (i=1, 2, ..., N) moment, state space S={+1, 0, −1}. There are nine conditional probabilities as follows: ( 1 1 ), According to the above definition, we can obtain these conditional probabilities of a given protein sequence. The conditional probability of each of ND6 proteins is listed in Table 1 . Given two protein sequences, we can obtain two nine-component vectors whose elements are conditional probabilities for each protein sequence. Based on the vectors, we can compare different protein sequences. In general, similarities of the two vectors can be obtained by calculating Euclidean distance. The smaller the Euclidean distance of two vectors is, the more similar are the protein sequences. The Euclidean distance of two vectors u and v is as follows: where u i and v i denote the components of vectors u and v, respectively. k is the dimension of vectors u and v. Yao et al. (2009) proposed a new similarity measure of sequences, and coefficient of determination (r 2 ), which is defined as: r 2 can vary from 0 to 1, and represents the percent of the data, which is the closest to the line of best fit. The larger the coefficient of determination of two vectors is, the more similar are the protein sequences. In Tables 2 and 3, we give the similarity/dissimilarity matrices for the eight ND6 sequences based on Euclidean distance and coefficient of determination amongst nine-component vectors. As shown in Tables 2 and 3, it is obvious that ND6 proteins of human, gorilla, and chimpanzee are more similar to each other. In addition, ND6 proteins are more similar for (G. seal, H. seal) and (mouse, rat). However, ND6 protein of wallaroo is very dissimilar to others amongst the eight species. The results are consistent with the known fact of evolution (Yao et al., 2009) . Biology sequence analysis is a fundamental task in computational biology, whose aim is to detect similarity/dissimilarity relationships between molecular sequences. Some alignment-free methods to analyze similarities/dissimilarities of DNA sequences have been proposed. However, there are few alignmentfree methods to analyze protein sequences. The amino acid sequence of a protein is the key to understanding its structure and function in the cell, so we present a new method to analyze protein primary sequence in this paper. The method is based on the graphical representation and conditional probability taken as the numerical characterization of the protein sequence. The demonstrable significance of the new method is that it cannot only analyze similarity/dissimilarity of protein sequences, but also provide more biological information about the protein sequences. According to the IE curve, we can compare the numbers of amino acids of the internal and external groups at different positions. Also the IA curve can be used to compare the numbers of amino acids of the internal and ambivalent groups at different positions. The EA curve can be used to compare the numbers of amino acids in the external and ambivalent groups at different positions. Therefore the three curves show the distribution of the three types of amino acids. Furthermore, the conditional probability reflected the distribution of the two adjacent amino acids. The new approach was applied to ND6 protein sequences of several species and results have shown that the introduction of hydropathy profile of amino acids into protein sequence is effectual and feasible. A 2-D graphical representation of protein sequences based on nucleotide triplet codons A representation of DNA primary sequences by random walk A 3D graphical representation of RNA secondary structures based on chaos game representation Novel DNA sequence representation H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences 2-D graphical representation of protein sequences and its application to coronavirus phylogeny Simplification of protein sequence and alignment-free sequence analysis Analysis of similarity of DNA sequences based on 3D graphical representation Application of 2D graphical representation of DNA sequence Protein-based phylogenetic analysis by using hydropathy profile of amino acids Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes Two-dimensional graphical representation of DNA sequences and intron-exon discrimination in intronrich sequences Simple numerical descriptor for quantifying effect of toxic substances on DNA sequences Mathematical descriptors of DNA sequences: development and applications Molecular Evolution and Phylogenetics A probabilistic measure for alignment-free sequence comparison Condensed representation of DNA primary sequences 2-D Graphical representation of proteins based on physico-chemical properties of amino acids Characterization of 3-D sequences of proteins On a four-dimensional representation of DNA primary sequences On the characterization of DNA primary sequences by triplet of nucleic acid bases Alignment-free sequence comparison-a review A 2D graphical representation of protein sequence and its numerical characterization Analysis of similarity/dissimilarity of protein sequences Similarity/dissimilarity studies of protein sequences based on a new 2D graphical representation A protein map and its application The Z curve database: a graphic representation of genome sequences We would like to thank Dr. Jian-gang WANG (College of Animal Science and Technology, Northwest A&F University, China), Jian-zhong LUO (Department of Foreign languages, Northwest A&F University, China), Feng AN and Dr. Jun-li DU (College of Sciences, Northwest A&F University, China) for their helpful suggestions.