key: cord-0894175-o1k0q78x authors: Zhao, Sheng; Zhang, Qin; Liu, Xiaolin; Wang, Xuemin; Zhang, Huilin; Wu, Yan; Jiang, Fei title: Analysis of synonymous codon usage in 11 Human Bocavirus isolates date: 2008-02-21 journal: Biosystems DOI: 10.1016/j.biosystems.2008.01.006 sha: 1f547132360d838d92c6fbb7bc264c92375bb85e doc_id: 894175 cord_uid: o1k0q78x Human Bocavirus (HBoV) is a novel virus which can cause respiratory tract disease in infants or children. In this study, the codon usage bias and the base composition variations in the available 11 complete HBoV genome sequences have been investigated. Although, there is a significant variation in codon usage bias among different HBoV genes, codon usage bias in HBoV is a little slight, which is mainly determined by the base compositions on the third codon position and the effective number of codons (ENC) value. The results of correspondence analysis (COA) and Spearman's rank correlation analysis reveals that the G + C compositional constraint is the main factor that determines the codon usage bias in HBoV and the gene's function also contributes to the codon usage in this virus. Moreover, it was found that the hydrophobicity of each protein and the gene length are also critical in affecting these viruses’ codon usage, although they were less important than that of the mutational bias and the genes’ function. At last, the relative synonymous codon usage (RSCU) of 44 genes from these 11 HBoV isolates is analyzed using a hierarchical cluster method. The result suggests that genes with same function yet from different isolates are classified into the same lineage and it does not depend on geographical location. These conclusions not only can offer an insight into the codon usage patterns and gene classification of HBoV, but also may help in increasing the efficiency of gene delivery/expression systems. In general, synonymous codons are not used equally in intergenome and different genomes have their own characteristic patterns of synonymous codon usage (Grantham et al., 1980; Nakamura et al., 1991) . In Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Dictyostelium discoideum (Bulmer, 1988; Sharp et al., 1993) , Drosophila melanogaster (Shields et al., 1988) and Caenorhabditis elegans (Stenico et al., 1994) , compositional constraints and translational selection have been found to be the main factors accounting for codon usage variation among genes. However, in some genome with extremely Abbreviations: bp, base pair; HBoV, Human Bocavirus; RSCU, relative synonymous codon usage; ENC, effective number of codons; COA, correspondence analysis; GC 3S , the frequency of G + C at the synonymous third position of sense codons; A 3S ; T 3S ; G 3S and C 3S , the adenine, thymine, guanine and cytosine content at synonymous third positions; S.D., standard deviation. * Corresponding author. Tel.: +86 29 87092158; fax: +86 29 87092164. E-mail addresses: harryzs1981@yahoo.com.cn, xiaolinliu2000@sina.com (X. Liu). high A + T or G + C contents (Karlin and Mrazek, 1996; Sharp et al., 1993; Zhao et al., 2007; Zhong et al., 2007) , mutation bias is the major factor accounting for the variation in codon usage. Recently, codon usage was suggested to be related to gene function (Fuglsang, 2003; Liu et al., 2005; Ma et al., 2002b) and protein secondary structure (Griswold et al., 2003; Kahali et al., 2007; Ma et al., 2002a) . Codon usage information has also been analyzed in different viruses. For example, a survey of the patterns of synonymous codon preference in human immunodeficiency virus (HIV) reveals that HIV has a marked codon usage bias due to its strong preference for the A nucleotide (Chou and Zhang, 1992) . It was also found that codon usage appears to be simply a consequence of uneven base composition in nucleopolyhedroviruses (Levin and Whittome, 2000) . Moreover, in Mimivirus genes, codon usage bias is dictated both by mutational pressure and translational selection, and evidences show that four factors such as mean molecular weight (MMW), hydropathy, aromaticity and cysteine content are mostly responsible for the variation of amino acid usage in Mimivirus proteins (Sau et al., 2006) . Some published studies are mostly restricted to 0303-2647/$ -see front matter © 2008 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2008.01.006 particular groups of viruses and have usually addressed phylogenetic questions (Berkhout et al., 2002; Gu et al., 2004; Zhou et al., 2005) . In 2005, a research on coding sequences of RNA viruses and their genome polarity showed that positivestranded RNA viruses have significantly higher GC contents than negative-stranded RNA viruses. Coding sequences of all negative-stranded RNA viruses are biased toward high A in coding strands (high T in genomes) (Auewarakul, 2005) . A recent study showed that genome-wide mutational pressure, rather than natural selection for specific coding triplets, is the main determinant of codon usage in vertebrate-infecting DNA viruses (Shackelton et al., 2006) . Studies of the synonymous codon usage in viruses can reveal information about the molecular evolution of individual genes and such information would be relevant to understanding the regulation of viral gene expression and also to vaccine design where the efficient expression of viral proteins may be required to generate immunity (Hassard and Ward, 1995; Jenkins and Holmes, 2003) . In 2005, applying molecular methods, a novel respiratory virus has been discovered in children with respiratory tract infections in Sweden and was subsequently named Human Bocavirus (HBoV) (Allander et al., 2005) . Phylogenetic analyses of the complete genome of HBoV revealed that the virus is most closely related to canine minute virus and bovine parvovirus, which are members of the Bocavirus genus of the Parvoviridae family (Allander et al., 2005) . The genome of parvovirus consists of two major ORFs encoding a nonstructural protein (NS1) and at least two capsid proteins (VP1 and VP2), respectively. Moreover, HBoV also has a third middle ORF encoding a nonstructural protein (NP1) of unknown function (Allander et al., 2005) . HBoV is currently being detected in patients with respiratory disease in several countries, suggesting that HBoV may be circulating worldwide (Bastien et al., 2006; Phillips et al., 1987; Simon et al., 2007) . The relative importance of HBoV on viral respiratory tract illnesses is still not known, but it has been associated with respiratory illnesses ranging from upper respiratory tract disease (24%) to severe bronchiolitis (11-26%) and pneumonia (17-33%) (Bastien et al., 2006 (Bastien et al., , 2007 Phillips et al., 1987) . Preliminary reports also have suggested that the HBoV detection rate in children with respiratory tract infection at approximately 1.5-18.3% and it seems that this virus is associated with respiratory tract illness in patients, especially in infants and young children (Fryxell and Zuckerkandl, 2000; Ma et al., 2006; Sloots et al., 2006; Weissbrich et al., 2006) . Although genome sequence of HBoV has been published and many studies have been performed on it in recent years (Allander et al., 2005; Lu et al., 2006) , few genomic analyses are available on this virus (Chieochansin et al., 2007) . In particular, no in depth genomic analyses have so far been made on codon usage, which may provide more information on the features of HBoV genome. In this study, we have analyzed and compared the codon usage data of 11 available complete genome sequences of HBoV. Such information not only can offer an insight into the codon usage patterns of HBoV, but also may help in increasing the efficiency of gene delivery/expression systems. The 11 available complete genome sequences of HBoV (listed in Table 1 ) have been downloaded from NCBI (http://www.ncbi.nlm.nih.gov/). To minimize the sampling error we have taken only those genes, which are greater than or equal to 300 bp and have internal termination codons. Finally 44 genes were selected for analysis (Table 2 ). Correlation analysis was carried out by using the Spearman's rank correlation analysis method. In order to compare the variation of codon usage between different gene groups, a one-tailed t-test has been used. Cluster analysis was done using a hierarchical cluster method and the distances between selected sequences were calculated by the Euclidean distance method. Relative synonymous codon usage (RSCU) values were calculated by dividing the observed codon usage by that expected when all codons for the same amino acid are used equally to normalize codon usage within datasets of different amino acid compositions (Paul and Wen-Hsiung, 1986 ). The 'effective number of codons' (ENC) was often used to measure the magnitude of codon bias for an individual gene, which yields values ranging from 20 for a gene with extreme bias using only one codon per amino acid, to 61 for a gene with no bias using synonymous codons equally (Wright, 1990) . GC 3S value is the frequency of G + C at the third synonymously variable coding position (excluding Met, Trp, and termination codons), which is a good indicator of the extent of base composition bias. Similarly, GC 1S and GC 2S are the frequencies of the nucleotide G + C at the synonymous first and second position, respectively. The GRAVY score, which indicates the mean hydropathy index of the encoded amino acid residues and hence, is an estimate of overall hydrophobicity (Kyte and Doolittle, 1982) , was computed for each gene product. The most commonly used method of multivariate statistical analysis is called COA in which all genes were plotted in a 59-dimensional hyperspace, according to their usage of the 59 sense codons (excluding Met, Trp, and termination codons). Major trends within this dataset can be determined using measures of relative inertia and genes ordered according to their positions along the axis of major inertia. This method has been successfully used to investigate the variation of RSCU values among genes (Sau et al., 2006; Shackelton et al., 2006; Zhao et al., 2003) . Therefore, the RSCU, GC3s, ENC, G + C, GRAVY, length value, COA were calculated using the program CodonW Version 1.4 (http://codonw.sourceforge.net). The correlation analysis and Cluster analysis were carried out by using the multianalysis software SPSS Version 13.0 (http://spss.com). The details of genes and the overall RSCU values of 59 codons in 11 HBoV isolates were, respectively, shown in Table 2 and Table 3 ). Most of preferentially used codons in HBoV are all A-or U-ended codons (Table 3) . These HBoV isolates are GC poor genomes with average GC content of 42%. Due to compositional constraints, it is expected that A-and/or Uended codons should be preferentially used in this genome. But it is also interesting to note that UAC is most used among these 44 genes, while GAC is the most used in all NS1 and VP2 genes. To study the codon usage variation among different HBoV genes, ENC and GC 3S values of different HBoV genes were calculated (Table 2) . ENC values of different HBoV genes vary from 40.87 to 48.42, with a mean value of 44.45 and S.D. of 2.89. All the ENC values of these genes are more than 40. The data suggests the homogeneity of synonymous codon usage among HBoV genes examined. This concept is further supported by the GC 3S values for each HBoV genes, which range from 29% to 40% with a mean of 33% and S.D. of 0.04. To investigate the variation of RSCU values among genes, correspondence analysis (COA) was implemented on these 44 HBoV genes examined as a single dataset based on the RSCU value of each gene. As mentioned, the axis of a correspondence analysis identifies the source of the variation among a set of multivariate data point. The four largest trends in codon usage among these genes were observed: the first axis accounts for 61.20% of all variation among genomes, whereas the next three axes accounts for 35.43%, 1.62% and 0.55%, respectively. To investigate if the evolution of codon usage bias is controlled by mutation pressure or by natural selection, firstly, G + C content at the first and second codon positions (GC 12 ) was compared with that at synonymous third codon positions (GC 3S ) (Fig. 1) and a highly significant correlation was observed (r = −0.837, P < 0.05) by using the Spearman's rank correlation analysis method, indicating that patterns of base composition are most likely the result of mutation pressure, and not natural selection, since the effects are present at all codon positions. Secondly, for each gene, actual codon bias was plotted against both GC 3S and the expected ENC value, if codon usage bias is solely due to biased base composition (i.e. G + C content). Result showed that the actual codon usage indices are close to the values expected from their G + C composition, although all are slightly lower (Fig. 2) . Thirdly, we plotted the first and second axis values in COA and GC 3S values of each strains (Fig. 3) . The patterns of codon usage in different genes also appear to be closely related to the GC content on the third codon position. Correlation analysis has been implemented to each gene to find some correlation between synonymous codon usage and nucleotide compositions. We also found that axis 1 coordinates are correlated with GC 3S and GC (r = −0.918, P < 0.01; r = −0.585, P < 0.01), while there is a significant correlation between axis 2 value and GC 3S (r = 0.366, P < 0.05). Taken together, these analyses indicate that most of the codon usage bias among these HBoV genes is directly related to the nucleotide composition. Furthermore, mutational bias is the major factor responsible for the variation of synonymous codon usage among genes in these virus genomes. It is clear in Table 2 that the functionally homologous genes in different viral genomes tend to have close value of the first axis in COA. Because the closeness of any two genes on this value reflects the similarities of their codon usages, synonymous codon usage bias appears to be conservative between genes that are functionally closely related (Zhou et al., 2005) . To detect whether gene function were correlated with the observed variation in codon bias, all genes were grouped into several classes according to gene function. Because most of these viruses contain genes coding for a nonstructural protein (NS1), two capsid proteins (VP1 and VP2), respectively, and also a third middle ORF encoding a nonstructural protein (NP1) of unknown function, these four gene groups were selected to find whether there is a correlation between codon usage and gene function. The average ENC value and its corresponding S.D. value of each group were calculated (Fig. 4) . The S.D. values of NS1, VP1, VP2 and NP1 groups were all slight. A one-tailed ttest was then performed on ENC values and values of both axes1 and axes 2 in COA of these genes with the hypothesis that there is no correlation between codon usage bias and gene function (t-test, P-value < 10 −3 ). It suggests that the gene's func- tion also contributes to the codon usage in HBoV, although the mutational bias mainly drives the codon usage in these genes. Usually, mutational bias and natural selection, (i.e. gene length and the hydrophobicity of each protein) are thought to account for the codon usage variation among genes in different organisms. To test that whether any selection pressure contributes to the codon usage variation among these virus genes and which selection pressure determines this variation, we performed a correlation analysis on axis 1, axis 2 between the hydrophobicity of each protein and gene length. The results show that axis 1 and axis 2 coordinates are also significantly correlated with the hydrophobicity of each protein (r = 0.709, P < 0.01; r = −0.703, P < 0.01), respectively, while axis 1 and axis 2 coordinates are also significantly correlated with the gene length (r = 0.848, P < 0.01; r = −0.451, P < 0.01), respectively, indicating that the hydrophobicity of each protein and gene length are also critical in affecting these viruses' codon usage, although they were less important than that of the mutational bias and the gene's function. Based on the RSUC variation of these 44 HBoV genes examined, a cluster tree was generated by using Hierarchical cluster method. As shown in Fig. 5 , these 44 HBoV genes examined were divided into four main lineages (I-IV). Lineage I was comprised 11 NP1 genes from these 11 HBoV isolates examined. Lineage II was comprised 11 NS1 genes. All VP2 genes from these 11 HBoV isolates examined were grouped into lineage III. Lineage IV included genes of VP2. From the above, we can see that genes with same function yet from different isolates are classified into the same lineage and it does not depend on geographical location. Distances between each lineage center are listed in Table 4 . From Table 4 , it can be found that the longer the distances between the main lineage centers, the bigger the difference between their codon usages. The distances between lineages of similar gene functions are relatively closer than those distances between classes of different gene functions. For example, the distance between lineages III and IV is obviously closer than their distances to other classes, such as classes I and II. This once again testified to the conclusion that genes with similar functions also display similar codon usage bias. It has been well established that synonymous codon usage in various organisms, often reflect a balance between mutation pressure and translational selection. However, with the development of genome project of many organisms, many researches have showed that other factors also influence the biased usage of synonymous codons. Knowledge of codon usage pattern in virus may assist the development of polynucleotide vaccines and improve understanding of the evolution and pathogenesis of certain virus. In 2003, a comprehensive analysis and comparison of codon usage and A + T content in 79 human papillomavirus (HPV) genotypes from three distinct phylogenetic groups has revealed that all eight ORFs across HPV genotypes show a strong codon usage bias and a similar pattern of codon usage is observed in human and nonhuman PVs though they originate from different phylogenetic groups (Zhao et al., 2003) . In this study, evidence suggests that synonymous codon usage bias in HBoV was less biased, which was mainly determined by the base compositions on the third codon position. As a case in point, the values of ENC vary from 40.87 to 48.42 (S.D. = 2.89) and the GC 3S values range from 29% to 40% (S.D. = 0.04). The average ENC value of 44.45 among 44 genes can be compared to those seen in other organisms such as H5N1 virus, severe acute respiratory syndrome Coronavirus (SARSCoV), Porcine adenovirus, Orgyia pseudotsugata multicapsid nucleopolyhedrovirus (OpMNPV) and Lymantria dispar multinucleocapsid nuclear polyhedrosis virus (LdMNPV), where mean values of 50.91, 48.99, 38.97, 38.80 and 35.90, respectively, have been reported (Das et al., 2006; Gu et al., 2004; Levin and Whittome, 2000; Zhou et al., 2005) . Therefore, taken together with published data of codon usage bias among some other viruses, we could conclude that codon usage bias in HBoV genes is less biased. In human RNA viruses, mutation pressure seems to be the main force shaping codon usage, accounting for 71-85% of the observed bias (Jenkins and Holmes, 2003) . In 2005, in order to understand the common features and differences among viruses, some sequenced vertebrate-infecting DNA viruses were analyzed. This research revealed that patterns of codon usage bias are strongly correlated with overall genomic GC content, suggesting that genome-wide mutational pressure, rather than natural selection for specific coding triplets, is the main determinant of codon usage (Shackelton et al., 2006) . But in Chlamydomonas reinhardtii genome (Naya et al., 2001) which had high GC contents, there was no evidence that the genome composition shaped the codon usages of genes. In this study, the general association between codon usage bias and base composition suggests that mutational pressure, rather than natural selection is the main factor that determines the codon usage bias in HBoV, which is also supported by the highly significant correlation between GC 12 and GC 3S (r = −0.837, P < 0.05), and the result of ENC-plot (Fig. 2) . A similar pattern of codon usage has been reported amongst some viruses (Das et al., 2006; Jenkins and Holmes, 2003; Levin and Whittome, 2000; Shackelton et al., 2006; Zhao et al., 2003) . Therefore, mutational bias is the major factor responsible for the variation of synonymous codon usage among genes in these virus genomes. Generally, natural selection, such as translational selection, gene length and gene function are thought to be the factors accounting for the codon usage variation among genes in different organisms (Zhou et al., 2005) . Some published results have shown that functionally homologous genes in different viral genomes tend to cluster together in COA (Das et al., 2006; Gu et al., 2004; Zhou et al., 2005) . In this study, it is also clear that, gene function, rather than mutational bias, is another factor accounting for codon usage variation among these virus genes. The longer genes had higher expression level and higher codon usage bias in Streptococcus pneumoniae genome (Hou, 2002) , but in Drosophila (Comeron et al., 1999) , longer genes had lower codon usage bias. While in some virus, gene length has no effect on the variations of synonymous codon usage (Das et al., 2006; Gu et al., 2004; Levin and Whittome, 2000; Zhou et al., 2005) . Those indicated that different genomes had different gene lengths which accommodated their particular genome's best requirements, and there were not universal rules about gene length and codon usage in all genomes. In this study, the gene length had played a critical role in affecting HBoV codon usage. The mechanisms that lead this is not clear, which is needed a more comprehensive analysis. The hydropathy level of each protein influence codon choices in Chlamydia trachomatis, and Thermotoga maritime (Romero et al., 2000; Zavala et al., 2002) . Evidences show that hydropathy of each Mimivirus gene, aromaticity and cysteine content are mostly responsible for the variation of amino acid usage in Mimivirus and foot-and-mouth disease viurs (FMDV) (Sau et al., 2006; Zhong et al., 2007) . In this study, codon usage is significantly positively correlated with the hydrophobicity of each HBoV gene. The link with hydropathy and codon usage may be caused by the fact that the expressed sequences are hydrophilic just because they accomplish their function in the aqueous media of the cell (Romero et al., 2000) . To date, phylogenetic analyses have been performed on HBoV from complete coding sequence, NS1, NP1, VP1, and VP2 gene (Bastien et al., 2007; Chieochansin et al., 2007; Mandal et al., 2007; Neske et al., 2007; Qu et al., 2007) . These analyses indicated the NS/NP1 gene are the most conserved regions and thus, will not demonstrate differences between HBoV isolates; most variations of nucleotide sequences appeared in the VP1/VP2 gene encoding the capsid protein and the variation between HBoV isolates does not depend on geographical location. In this study, cluster analyses based on the RSUC values of the 44 HBoV genes examined were carried out using a hierarchical cluster method. The result indicated that gene's function is the dominant factor that determines the result of cluster analysis and suggests that the cluster pattern of HBoV does not correlate with geographic variation. This conclusion has prosperous applications in the field of gene classification and the prediction of gene functions. Our analysis revealed that although there are a few variations in codon usage bias among different HBoV isolates, codon usage bias in HBoV is low. But clearly, a more comprehensive analysis is needed to reveal the true extent of codon usage bias variation within and among HBoV isolates and what other factors are responsible, including the influence of factors such as cell tropism, principal host species, method of transmission, and viral genetic structure. Such information would then allow us to judge more precisely the relative importance of mutation pressure versus natural selection in determining base composition and codon usage in these pathogens (Jenkins et al., 2001) . Codon usage patterns and the phylogenetic results we proposed here are useful to understand the processes governing the evolution of HBoV, especially the roles played by mutation pressure and natural selection. Further, such information not only can offer an insight into the codon usage patterns and gene classification of HBoV, but also may help in increasing the efficiency of gene delivery/expression systems. Cloning of a human parvovirus by molecular screening of respiratory tract samples Composition bias and genome polarity of RNA viruses Human Bocavirus infection Detection of Human Bocavirus in Canadian children in a 1-year study Codon and amino acid usage in retroviral genomes is consistent with virus-specific nucleotide pressure Are codon usage patterns in unicellular organisms determined by selection-mutation balance? Complete coding sequences and phylogenetic analysis of Human Bocavirus (HBoV) Diagrammatization of codon usage in 339 human immunodeficiency virus proteins and its biological implication Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila Synonymous codon usage in adenoviruses: influence of mutation, selection and protein hydropathy Cytosine deamination plays a primary role in the evolution of mammalian isochors Strong associations between gene function and codon usage Codon frequencies in 119 individual genes confirm corsistent choices of degenerate bases according to genome type Effects of codon usage versus putative 5'-mRNA structure on the expression of Fusarium solani cutinase in the Escherichia coli cytoplasm Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales Efficient creation of sequencing libraries from blunt-ended restriction enzyme fragments Analysis of factors shaping S. pneumoniae codon usage The extent of codon usage bias in human RNA viruses and its evolutionary origin Evolution of base composition and codon usage bias in the genus Flavivirus Reinvestigating the codon and amino acid usage of S. cerevisiae genome: a new insight from protein secondary structure analysis What drives codon choices in human genes? A simple method for displaying the hydropathic character of a protein Codon usage in nucleopolyhedroviruses Synonymous codon usage and gene function are strongly related in Oryza sativa Real-time PCR assays for detection of Bocavirus in human specimens Correlations between shine-dalgarno sequences and gene features such as predicted expression levels and operon structures Cluster analysis of the codon use frequency of MHC genes from different species Detection of Human Bocavirus in Japanese children with lower respiratory tract infections Gag processing defect of HIV-1 integrase E246 and G247 mutants is caused by activation of an overlapping 5 splice site Two types of linkage between codon usage and gene-expression levels Translational selection shapes codon usage in the GC-rich genome of Chlamydomonas reinhardtii Real-time PCR for Human Bocavirus infections and phylogenetic analysis An evolutionary perspective on synonymous codon usage in unicellular organisms The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of overand underrepresented sequences by Markov chain analysis Human Bocavirus infection, People's Republic of China Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces Factors influencing synonymous codon and amino acid usage biases in Mimivirus Evolutionary basis of codon usage and nucleotide composition bias in vertebrate DNA viruses Codon usage: mutational bias, translational selection, or both? Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons Detection of Bocavirus DNA in nasopharyngeal aspirates of a child with bronchiolitis Evidence of human coronavirus HKU1 and Human Bocavirus in Australian children Codon usage in Caenorhabditis elegans: delineation of translational selection and mutational biases Frequent detection of Bocavirus DNA in German children with respiratory tract infections The 'effective number of codons' used in a gene Trends in codon and amino acid usage in Thermotoga maritime Codon usage bias and A+T content variation in human papillomavirus genomes The factors shaping synonymous codon usage in the genome of Burkholderia mallei Mutation pressure shapes codon usage in the GC-Rich genome of foot-and-mouth disease virus Analysis of synonymous codon usage in H5N1 virus and other influenza A viruses Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.biosystems.2008.01.006.