key: cord-0005049-28mrzvsc authors: Pavesi, Angelo; De Iaco, Bettina; Granero, Maria Ilde; Porati, Alfredo title: On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses date: 1997 journal: J Mol Evol DOI: 10.1007/pl00006185 sha: 5ebac947cdeae4566f2a54feefa7caace61cf756 doc_id: 5049 cord_uid: 28mrzvsc In genetic language a peculiar arrangement of biological information is provided by overlapping genes in which the same region of DNA can code for functionally unrelated messages. In this work, the informational content of overlapping genes belonging to prokaryotic and eukaryotic viruses was analyzed. Using information theory indices, we identified in the regions of overlap a first pattern, exhibiting a more uniform base composition and more severe constraints in base ordering with respect to the nonoverlapping regions. This pattern was found to be peculiar to coliphage, avian hepatitis B virus, human lentivirus, and plant luteovirus families. A second pattern, characterized by the occurrence of similar compositional constraints in both types of coding regions, was found to be limited to plant tymoviruses. At the level of codon usage, a low degree of correlation between overlapping and nonoverlapping coding regions characterized the first pattern, whereas a close link was found in tymoviruses, indicating a fine adaptation of the overlapping frame to the original codon choice of the virus. As a result of codon usage correlation analysis, deductions concerning the origin and evolution of several overlapping frames were also proposed. Comparison of amino acid composition revealed an increased frequency of amino acid residues with a high level of degeneracy (arginine, leucine, and serine) in the proteins encoded by overlapping genes; this peculiar feature of overlapping genes can be viewed as a way with which they may expand their coding ability and gain new, specialized functions. A particular issue in the statistical analysis of genomic DNA sequences concerns the characterization of codes and semantic patterns in the genetic language (Trifonov 1989; Smith 1989) . In this language, overlapping genes represent an unusual pattern, as two, or exceptionally three, out-of-phase reading frames may lie in a single nucleotide sequence. Such an arrangement, called ''overprinting,'' is frequent in viruses, where it probably evolved to increase the density of genetic information (Lamb and Horvath 1991) . The first genes of this type were identified by Barrell and co-workers (1976) in the genome of X174, a single-stranded DNA phage, and similar overlapping regions were later detected in many other genes belonging to DNA or RNA viruses of both prokaryotes and eukaryotes (Normark et al. 1983; Samuel 1989 and references therein) . Translation of the different reading frames has been shown to be mediated by ribosomal frameshifting, which requires an upstream site of ribosomal slippage and a downstream stem-loop structure known as a ''RNA pseudoknot'' (Jacks et al. 1988 ; Wilson et al. 1988; Brierley et al. 1989 ). On the other hand, translation of multiple reading frames can occur simply by internal de novo initiation in an alternative frame and does not require ribosomal frameshifting (Atkins et al. 1979; Chang et al. 1989) . Originally developed to maximize the efficiency of transmission of electronic signals, information theory (Shannon and Weaver 1949) was later utilized to evaluate the complexity of DNA sequences (Gatlin 1968 (Gatlin , 1972 . In the past years, several papers have dealt with the connection between information theory and the analysis of overlapping coding regions (Yockey 1979; Granero-Porati et al. 1980; Smith and Waterman 1981) . Other studies have addressed the problem of the evolution of overlapping arrangement (Miyata and Yasunaga 1978; Soeda and Maruyama 1982; Keese and Gibbs 1992) and the restrictions imposed on proteins encoded by the overlaid genetic messages (Sander and Schulz 1979; Smith and Waterman 1980) . Here, we present an analysis at different levels of complexity (divergence from randomness of mono-and dinucleotide composition, choice of synonymous codons, and frequency of occurrence of amino acid residues) of the informational content of overlapping genes. Using information theory indices and statistical methods of sequence analysis, the constraints acting on overlapping coding regions were quantitatively evaluated and compared to those occurring in the nonoverlapping regions belonging to the same viral genomes. Results obtained from the information theory approach and those derived from codon usage and amino-acid composition correlation analyses are discussed in terms of evolution of overlapping genes. The nucleotide sequences of the complete genomes of three prokaryotic (X174, G4, and ␣3 coliphages), five animal (two avian hepatitis B viruses and three different strains of lentivirus human immunodeficiency type 1), and four plant viruses (beet and barley luteoviruses, turnip and eggplant tymoviruses), all containing a large density of overlapping coding regions, were selected from the EMBL database (Rice et al. 1993) . The genomic map of the five virus families is reported in Fig. 1 . Divergence from randomness at the level of mono-and dinucleotide composition was evaluated, respectively, by the informational indices D 1 and D 2 (Gatlin 1968) : where n is equal to 4 (number of symbols in the genetic language) and p i is the relative frequency of base ''i'' in a sequence under examina-tion. Entropy H is measured in bits per symbol and its maximum value, H max , corresponds to a 25% frequency for each base equalling 2 bits/ symbol. The D 1 index represents the divergence from maximum entropy due to constraints on mononucleotide composition. where p ij is the relative frequency of dinucleotide ''ij'' in a sequence under examination. The absolute frequency of dinucleotides is calculated by moving along the sequence with steps corresponding to one nucleotide position. The D 2 index measures the divergence from an independent ordering of bases, thus accounting for the constraints acting on dinucleotide composition. Therefore, for a random sequence with no order at any level we would expect values of D 1 and D 2 indices nearly equal to zero. The additional step of our analysis takes into account the comparison, at the level of both codon usage and amino-acid composition, between overlapping and nonoverlapping coding regions. Correlation analysis of the codon choice was carried out using both the Relative Synonymous Codon Usage (RSCU) index (Sharp and Li 1987) and the Pearson correlation coefficient r. The RSCU value for each of the 59 degenerate codons was calculated as follows: where N codon is the total number of times a given codon is used in a given coding region, N aminoacid is the absolute frequency for the amino acid specified by that codon and its synonyms, and D is the degeneracy of that amino acid (when all synonyms are used with equal frequencies, a RSCU value of 1 for each codon is expected). A set of RSCU values obtained from a given overlapping gene was then compared with that of the nonoverlapping regions of the same viral genome by means of the Pearson correlation coefficient (r), whose values, ranging from −1 to 1, reflect a completely discordant or concordant degree in the usage of synonymous codons, respectively. At the level of composition in amino acid residues, the degree of similarity between proteins encoded by overlapping and nonoverlapping genes was carried out by the chisquare test (Snedecor and Cochran 1967) . Data were arranged in a 2 × 2 contingency table to identify amino acid residues whose frequency of occurrence in proteins encoded by overlapping genes is significantly higher than that observed in the nonoverlapping counterpart. From each of the 12 viral genomes under examination, two sets of data, including overlapping and nonoverlapping genes, respectively, were obtained and the constraints acting on base composition and base ordering were evaluated, respectively, by the D 1 and D 2 indices, whose values are reported in Table 1 . In eight cases, including ␣3, G4, and X174 coliphages (BACALPHA, MIG4XX, PHIX174), duck and heron hepatitis B viruses (HBDGENM, HBHCG), and strains of HIV-1 lentivirus family (HIVBRUCG, HIVCAM1, HIVNDK), the D 1 value of overlapping regions was found to be smaller than the D 1 value of the nonoverlapping part. In two cases, corresponding to barley and beet luteoviruses (BYDCG, BWYVFL1), the D 1 values appeared to be similar and near to zero. The exception was repre-sented by the family of eggplant and turnip tymoviruses (TYMVCG, MTYRPVP), with a D 1 value of the overlapping sequences higher than the D 1 value of the nonoverlapping ones. Moreover, the D 1 value considerably different from zero obtained from both types of coding regions in plant tymoviruses reflects the highest divergence from a random base composition in the set of sequences considered in our analysis. When analyzed with respect to the divergence from an independent ordering of bases (Table 1) , all the overlapping sequences exhibited, with the exception of tymoviruses, a higher D 2 index value, as compared with the nonoverlapping counterpart. The graphical representation ( Fig. 2) of the average values of the D 1 and D 2 indices, calculated by grouping the 12 viruses under examination in the five corresponding families (coliphage, hepatitis B virus, HIV-1 lentivirus, luteovirus, and tymovirus), led to the identification of two different informational patterns in the viral coding sequences. The first pattern is characterized by a clear tendency to possess, in the regions of overlap, a more uniform nucleotide composition (a lower D 1 value) and more severe constraints in base ordering (a higher D 2 value), with respect to the nonoverlapping regions lying in the same genome. It includes four of the five families considered in this study (coliphage, hepatitis B virus, HIV-1 lentivirus, and luteovirus). In the second pattern, which appears to be limited to the family of tymoviruses, both regions show, instead, similar compositional constraints, as evidenced by a slight variation of the corresponding D 2 values. The additional step of our analysis concerned the relationship between the frequencies of synonymous codons in overlapping or nonoverlapping genes. For each of the five virus families, the nonoverlapping regions belonging to the corresponding members were combined into a single entity, while the two frames of each overlapping gene arrangement were considered as two sepa- (Table 2) were then characterized, thus increasing the statistical relevance of our analysis. For example, the nonoverlapping set of tymoviruses (see the genomic map of tymoviruses in Fig. 1 ) includes the coat gene and the nonoverlapping fraction of replicase gene of both eggplant and turnip virus. The overlapping regions of tymovirus family were considered, instead, as two distinct sets of data, the one including the overlapping region of replicase gene, the other the 69-kD protein gene. The subsequent correlation analysis (Table 2 ) evidenced that six overlapping genes exhibited a choice of synonymous codons highly different from that occurring in the corresponding nonoverlapping genes. They include the A, C, E, and K genes of coliphage family, the Tat gene of lentivirus, and the Vpg gene of luteovirus, all exhibiting an r value near to zero. The highest degree of relationship was found in the overlapping genes encoding the replicase and the 69-kD protein of tymoviruses, as evidenced by an r value of 0.90 and 0.74, respectively. More generally, when the r mean values of the virus families were considered (Table 2) the overlapping regions related to the first informational pattern (coliphage, hepatitis B virus, HIV-1 lentivirus, and luteovirus families) exhibited a very low correlation with the usage of synonyms in the nonoverlapping counterpart, as documented by a range of variation from 0.24 in coliphage to 0.38 in hepatitis B virus. In contrast, a much higher relationship between overlapping and nonoverlapping regions (an r mean value of 0.82) was found in the family of tymoviruses, representing the alternative informational pattern. The statistical analysis testing a difference in the composition of amino-acid residues between each of the 24 overlapping frame encoded proteins and the corresponding nonoverlapping counterpart was performed by the chi-square contingency-table test. Data were arranged in a 2 × 2 table whose a, b, c, d values correspond to the content of a given amino-acid residue in a given overlapping frame (a), in the nonoverlapping frames (b), and to the total amount of the other amino acid residues in the same overlapping frame (c) and in the nonoverlapping frames (d). The counting of the chi-square values above the 3.8 cutoff (P < 0.05 for 1 degree of freedom), expressing a significantly higher content of amino-acid residues in the overlapping genes, led to the general representation shown in Fig. 3 . It appears that the aminoacid composition bias within overlapping genes can be mainly ascribed to amino-acid residues with the highest level of codon degeneracy (e.g., arginine, serine, and leucine residues are expressed by six synonymous codons each and proline residue by four synonyms). This findings was also corroborated when considering the highest compositional differences. As summarized in Table 3 , out of a total of ten chi-square values higher than 30.0 (P < 0.00001), four were ascribed to arginine, three to leucine, two to proline, and one to methionine residues. It has recently been proposed (Keese and Gibbs 1992) that overlapping gene arrangements may arise de novo, thus encoding novel specialized proteins; it has also been hypothesized that a new gene arisen in this way will have an unusual codon usage and will encode a protein with biased physicochemical properties. Taking into account these observations, we have analyzed the informational content of viral overlapping genes at different levels of complexity. The use of information theory indices shows that viral sequences, albeit deriving from different sources, can be referred to two distinct patterns. Considering that (Luo et al. 1988 ) ''the smallness of D 1 represents the abundance of vocabulary and the largeness of D 2 represents the clarity of grammatical rules,'' the informational measures of overlapping sequences related to the first pattern (a low D 1 value, a high D 2 value, see Fig. 2 ) suggest a level of genetic information storage closely resembling natural languages. The occurrence of these constraints in coliphage, hepatitis B virus, HIV-1 lentivirus, and luteovirus also reflects a peculiar pattern in the usage of synonymous codons for most of the corresponding overlapping frames (Table 2) . Since the most striking difference in the choice of synonyms concerns the family of coliphages, some speculations on the origin of its overlapping genes (see the genomic map shown in Fig. 1) can be proposed. For example, the codon usage pattern of overlapping frames encoding the structural ''scaffolding'' B and D proteins is well correlated (an r value of 0.55 and 0.68, respectively) with that of the nonoverlapping genes, which encode the similarly structural J, F, G, and H proteins. This relatively high degree of correlation suggests an ancient origin for the B and D frames. On the other hand, the highly peculiar choice of synonyms in the genes E and K, which are entirely embedded within the D and A/C genes and exhibit an r value of 0.02 and 0.03, respectively, supports the idea of a more recent acquisition. Since a low expression of the gene E is necessary and sufficient to induce lysis of the host cell (Blasi et al. 1990 ), its peculiar codon usage pattern could represent a mechanism to regulate the rate of translation, thus preventing premature lysis of the host. A low expression during infectious cycle could also be required for the gene K, whose regulative role consists in increasing the burst size of phage production (Gillam et al. 1985) . The codon usage pattern of the region of gene A which overlaps both B and K genes (an r value of 0.13 with the nonoverlapping frames) contrasts with that found in the nonoverlapping part of the gene A (an r value of 0.91). This observation suggests that a shorter gene A originally terminated in close proximity to a preexisting gene B and that the present overlapping arrangement evolved by a new termination codon of the gene A beyond the gene B. In a similar way, the codon usage of the overlapping fraction of gene C (an r value of 0.01 with the regions of nonoverlap) markedly differs from that occurring in the nonoverlapping region (an r value of 0.41). Considering that the gene C of X174, G4, and ␣3 coliphages contains a second in-phase ATG codon localized in the nonoverlapping region, we predict an originally shorter gene C which evolved using as initiator codon an upstream ATG localized at the end of the gene A. The alternative pattern revealed by the information Fig. 3 . Frequency of occurrence of chi-square values above the 3.8 cutoff value which reflect a significantly higher content of individual amino acid residues in proteins encoded by overlapping genes. is preserved in the overlapping regions (A ‫ס‬ 21%, T ‫ס‬ 23%, G ‫ס‬ 16%, C ‫ס‬ 40%), and this excess of C residues tends to be clustered in the third base position of codons. In fact, a 49% content of C residues occurs in the third base position of nonoverlapping regions, a 44% content in the replicase protein (RP) overlapping frame, and a 36% content in the 69-kD protein overlapping frame. The high degree of relationship in the codon usage (an r mean value of 0.82) likely suggests that, at variance with the case of X174, the infectious cycle of tymoviruses may require, in all coding regions, a more uniform adaptation to the translationary machinery of the host. Tymoviruses infect various members of the Cruciferae (e.g., Brassica rapa, Arabidopsis thaliana) and they accumulate in leaves (Bozarth et al. 1992) , where the highly expressed genes code for the small subunit of ribulose 1,5-bisphosphate carboxylase and for the chlorophyll a/b-binding protein (Murray et al. 1989 ). Interestingly, the third base positions of the coding regions of these latter genes show a frequency of C residues (40%) similar to that occurring in all different frames of tymoviruses. Since the base frequency at third degenerate position in nonoverlapping regions (A ‫ס‬ 16%, T ‫ס‬ 21%, C ‫ס‬ 49%, G ‫ס‬ 14%) is closely similar to RP overlapping frame (A ‫ס‬ 19%, T ‫ס‬ 18%, C ‫ס‬ 44%, G ‫ס‬ 19%) and contrasts with 69-kD protein frame (A ‫ס‬ 24%, T ‫ס‬ 29%, C ‫ס‬ 36%, G ‫ס‬ 11%), we can also predict that this latter overlapping gene arose later by superimposition on a preexisting RP gene. The statistical analysis of the amino acid composition evidenced that the peculiar amino-acid usage occurring in overlapping genes can be mainly ascribed to a significantly higher frequency of amino acid residues having the highest level of codon degeneracy (Fig. 3) . For example, the high content of leucine and arginine residues in the overlapping region of lentivirus Env gene (Table 3) is related to a very peculiar choice of synonyms (Leu/ CTA 4.9%; Leu/TTA 16.9%; Leu/CTC 29.0%, Arg/ AGA 31.9%, Arg/CGC 17.0%), when compared with that occurring in the lentivirus nonoverlapping regions (Leu/CTA 18.6%; Leu/TTA 34.2%; Leu/CTC 6.5%; Arg/AGA 65.0%; Arg/CGC 0.7%). Therefore, a localized high frequency of both leucine and arginine residues combined with a strongly different strategy of codon usage within the ancestral Env gene frame can be hypothesized as a basic event to originate the Tat, Rev, and Vpu overlapping frames. Some overlapping genes exhibiting a strongly preferred occurrence of leucine or arginine residues (Table 3) have previously been demonstrated to perform a crucial function in the viral life cycle. For example, the high content of leucines in the overlapping E protein of coliphages lie within a transmembrane domain that is required to determine Escherichia coli cell lysis (Buckley and Hayashi 1986) . The high frequency of arginines in the overlapping fraction of the core antigen of hepatitis B virus corresponds to a carboxyl-terminal signal that is involved in nuclear targeting of the protein (Eckhardt et al. 1991) . A similar function has been ascribed to the polyarginine motifs of the Rev protein of the HIV-1 lentiviruses (Kubota et al. 1989 ). It has also been demonstrated (Zapp et al. 1991 ) that the run of arginines in the center of Rev protein is involved in the recognition, and nucleocytoplasmic transport, of unspliced viral mRNAs. These data support the notion that the high frequency of amino-acid residues with a high level of codon degeneracy, which appears to be a peculiar feature of overlapping genes, can be viewed as a valuable tool with which to achieve a more flexible strategy in the choice of synonymous codons and/or to gain new specialized functions in the viral life cycle. Binding of mammalian ribosomes to MS2 phage RNA reveals an overlapping gene encoding a lysis function Overlapping genes in bacteriophage X174 Translational efficiency of X174 lysis gene E is unaffected by upstream translation of the overlapping gene D reading frame Expression of ORF-69 of turnip yellow mosaic virus is necessary for viral spread in plants Characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an RNA pseudoknot Lytic activity localized to membranespanning region of X174 E protein Biosynthesis of the reverse transcriptase of hepatitis B viruses involved de novo translational initiation not ribosomal frameshifting Hepatitis B virus core antigen has two nuclear localization sequences in the arginine-rich carboxyl terminus The information content of DNA Gene K of bacteriophage X174 codes for a protein which affects the burst size of phage production Informational parameters of an exact DNA base sequence Signals for ribosomal frameshifting in the Rous sarcoma virus gag-pol region Origins of genes: ''Big bang'' or continuous creation? Functional similarity of HIV-I rev and HTLV-I rex proteins: identification of a new nucleolar-targeting signal in rev protein Diversity of coding strategies in influenza viruses Informational parameters of nucleic acid and molecular evolution Evolution of overlapping genes Codon usage in plant genes Overlapping genes The EMBL data library Polycistronic animal virus mRNAs Degeneracy of the information contained in amino acid sequences: evidence from overlaid genes The codon adaptation index, a measure of directional synonymous codon usage bias, and its potential applications Semantic and syntactic patterns in the genetic language. In: Colwell RR (ed) Biomolecular data. A resource in transition Protein constraints induced by multiframe encoding Overlapping genes and information theory Statistical methods Molecular evolution in papova viruses and bacteriophages Searching for codes in the sequences. In: Colwell RR (ed) Biomolecular data. A resource in transition HIV expression strategies: ribosomal frameshifting is directed by a short sequence in both mammalian and yeast systems Do overlapping genes violate molecular biology and the theory of evolution? Oligomerization and RNA binding domains of the type I human immunodeficiency virus Rev protein: a dual function for an arginine-rich binding motif We are grateful to Professor Franco Conterio for support and encouragement. We also appreciate critical readings of the manuscript by Simone Ottonello and Elena Maestri. This work was supported by the National Research Council of Italy and by the Ministry of University and Scientific and Technological Research.