key: cord-0736125-abgti7jh
authors: Pavesi, Angelo
title: New insights into the evolutionary features of viral overlapping genes by discriminant analysis
date: 2020-04-02
journal: Virology
DOI: 10.1016/j.virol.2020.03.007
sha: 17ac3935bf09ec5ffa96016d5aac648ebcf2c987
doc_id: 736125
cord_uid: abgti7jh

Abstract Overlapping genes originate by a mechanism of overprinting, in which nucleotide substitutions in a pre-existing frame induce the expression of a de novo protein from an alternative frame. In this study, I assembled a dataset of 319 viral overlapping genes, which included 82 overlaps whose expression is experimentally known and the respective 237 homologs. Principal component analysis revealed that overlapping genes have a common pattern of nucleotide and amino acid composition. Discriminant analysis separated overlapping from non-overlapping genes with an accuracy of 97%. When applied to overlapping genes with known genealogy, it separated ancestral from de novo frames with an accuracy close to 100%. This high discriminant power was crucial to computationally design variants of de novo viral proteins known to possess selective anticancer toxicity (apoptin) or protection against neurodegeneration (X protein), as well as to detect two new potential overlapping genes in the genome of the new coronavirus SARS-CoV-2.

The viral genome often contains more coding information than a co-linear relationship between nucleotide and protein sequences would suggest. This increase in coding density depends on overlapping genes, in which two different reading frames on the same strand are translated to yield two proteins (Normark et al., 1983) . Overlapping genes originate by a mechanism of overprinting, in which critical nucleotide substitutions in a pre-existing ancestral frame induce the expression of a de novo protein from an alternative frame (Keese and Gibbs, 1992; Rancurel et al., 2009; Sabath et al., 2012) . The de novo proteins, unlike the ancestral ones, usually lack any remote homologs in protein databases (Gibbs and Keese, 1995) .

Overlapping genes were first detected in the genome of the bacteriophage ΦX174 by Barrell et al. (1976) . For many years they were thought to be limited to viruses, but experimental (Michel et al., 2012; Bergeron et al., 2013; Vanderperre et al., 2013; Fellner et al., 2015) and computational studies (Chung et al., 2007; Delaye et al., 2008; Ribrioux et al., 2008; Vanderperre et al., 2012) indicate that they also occur in prokaryotes and eukaryotes.

Thus, the expression of two proteins from the same mRNA has changed the traditional view that a mature eukaryotic mRNA is a monocistronic molecule with a single translated open reading frame (ORF) (reviewed by Mouilleron et al., 2016; Brunet et al., 2018) . Interestingly, it has been found that some human cancer-specific antigens, silent in normal tissues, are translated from alternative open reading frames (AltORFs) (Wang et al., 1996 (Wang et al., , 1998 Rosenberg et al., 2002; Mandic et al., 2003; Slager et al., 2003) . These neoantigens are promising targets for the development of antitumour immunotherapies with a potentially broader coverage of patients (Smith et al., 2019) .

Researchers have advanced two theories, not mutually exclusive, to explain the abundance of overlapping genes in viruses. The gene-compression theory claims that this feature is a strategy to maximize the information content of small genomes (Lamb and Horvath, 1991; Pavesi et al., 1997) , as a consequence of error-prone polymerases (Belshaw et al., 2007) and biophysical constraints acting on the capsid structure (Chirico et al., 2010) . The gene-novelty theory asserts that the birth of de novo proteins, often overprinting a structural or enzymatic protein, is driven by selection pressures providing the virus with a fitness advantage that lead to their fixation (Keese and Gibbs, 1992; Rancurel et al., 2009; Brandes and Linial, 2016) . Indeed, the de novo proteins originated by overprinting often play a critical role in virus infection, by neutralizing the host interferon response (van Knippenberg et al., 2010; Mc Fadden et al., 2011; Wensman et al., 2013) and the RNA interference pathway (Vargason et al., 2003; Chellappan et al., 2005; Li and Ding, 2006) , by inducing apoptosis in host cells (Noteborn et al., 1994; Chen et al., 2001 , Boehme et al., 2013 or promoting the systemic spread of virus (Taliansky et al., 2003) , A few de novo proteins can also exert functions that are not virus-specific. A well-known example is the apoptin of Chicken anemia virus, which induces cell death in a broad range of human tumour cell lines but not in normal cells (Danen-Van Oorschot et al., 1997; Backendorf et al., 2008 ; reviewed by Castro et al., 2018) . Another example is the X protein of Borna disease virus, which shows protective properties against neurodegeneration in vitro and in vivo (Szelechowski et al., 2014) . To increase their therapeutic effectiveness, researchers have constructed N-terminal deleted forms or short peptides both for apoptin (Shen Ni et al., 2013; Ruiz-Martínez et al., 2017; Zhang et al., 2017) and X protein (Szelechowski et al., 2014; Ferré et al., 2017) .

To identify viral overlapping genes by sequence analysis, Firth (2014) and Sealfon et al. (2015) have developed statistical methods that detect, in a set of homologous protein-coding regions, the atypical pattern of nucleotide substitution induced by the overlap (i.e. a significantly reduced rate of synonymous substitution). Schlub et al. (2018) have recently proposed a new statistical method. It examines single gene or genome viral sequences, and can therefore be applied in many situations where the previous methods could be ineffective.

To provide a benchmark for systematic studies, we assembled a dataset of 80 overlapping genes 180 nt or longer from eukaryotic viruses with a genome shorter than 30 kb, whose expression is supported by reliable experimental evidence (S1 dataset from Pavesi et al., 2018) . A first analysis revealed that overlapping genes differ significantly from non-overlapping genes in their nucleotide and amino acid composition (Pavesi et al., 2018) . A further analysis, carried out on 75 pairs of homologous overlaps, revealed that half of overlapping genes undergo asymmetric evolution, as the protein encoded by one frame is significantly more variable than that encoded by the other frame (Pavesi, 2019) .

In the present study, I enlarged the dataset of overlapping genes by increasing the number of homologs. The dataset I assembled contains 319 viral overlapping genes, instead of 80 or 150 as in the previous studies (Pavesi et al., 2018; 2019) . I first investigated the nucleotide and amino acid composition of overlapping genes with the principal component analysis (PCA) (Hotelling, 1936) , by using as a control the entire complement of non-overlapping genes in the viral genome.

I then analyzed the dataset with the partial least squares-discriminant analysis (PLS-DA) (reviewed by Lee et al., 2016) and the Fisher's linear discriminant analysis (LDA) (Fisher, 1936) , with the aim to separate overlapping from non-overlapping genes with the best accuracy. To test which of the two theories about the abundance of gene overlap in viruses (gene-compression and gene-novelty) is the most plausible one, I also applied LDA to a subset of overlapping genes with known genealogy (which frame is ancestral and which one is de novo).

Finally, I developed a computational algorithm that simulated the birth of new overlapping genes encoding variants of two peculiar proteins: the apoptin of Chicken anemia virus (apoptin is the de novo protein overprinting the ancestral capsid protein VP2) and the X protein of Borna disease virus (X is the de novo protein overprinting the ancestral phosphoprotein). The huge amount of new overlapping genes yielded by simulation was processed using PLS-DA and LDA, with the aim to obtain a set of protein variants whose selective anticancer activity (apoptin) or protection against neurodegeneration (X protein) could be tested experimentally.

I first added to the dataset of 80 overlapping genes (S1 Dataset from Pavesi et al., 2018) two other overlaps experimentally proven: NS1 protein/NS2 protein of Influenza A virus (Lamb, 1980) and large T antigen/ALTO protein of Merkel cell polyomavirus (Carter et al., 2013) . I then extracted from the dataset the amino acid sequence of the two proteins encoded by each overlap. For each protein, I searched for homologs against the non-redundant protein sequences and the reference protein sequences NCBI database using BLASTP with an E cutoff of 10 -6 . (Altschul et al., 1997) . In the rare cases where BLASTP did not detect any homolog I used TBLASTN, which compared the protein query sequence against the nucleotide collection NCBI database translated in all reading frames. I used TBLASTN because the amino acid sequence of the protein encoded by one of the two overlapping frames (usually the one discovered more recently) may not be reported in the NCBI database.

Overall, I collected a total of 319 overlapping genes from 244 viral genomes (the number of genomes is lower than that of overlapping genes because some genomes contain more than one overlap). Each of them was classified as internal or terminal overlap. Internal overlap means that one gene is entirely within a second gene but in a different reading frame. Terminal overlap means that two genes overlap for part of their lengths, one upstream and one downstream. The combined overall length of overlapping genes was 163,050 nt.

For each viral genome, I extracted from the NCBI database the non-overlapping coding regions. They were composed of the sequences of non-overlapping genes, and, in cases where some genes partially overlap, of their non-overlapping regions. For viruses with segmented genomes, all segments were included in the calculations. Non-overlapping coding regions came from the 62 complete genomes that contain the 82 proven overlaps and from the 182 genomes that contain the respective 237 homologs (175 complete genomes and 7 partial genomes). The combined overall length of non-overlapping genes was 1,724,466 nt. All sequence data were collected in Supplementary File S1 (see Results).

I calculated the percent content of nucleotides, dinucleotides, amino acids, and synonymous codons in each overlapping gene and in the entire complement of non-overlapping genes in the viral genome. I also calculated the percent content of amino acids with high codon degeneracy (the 6fold degenerate residues L, R, and S), medium codon degeneracy (the 4-and 3-fold degenerate residues A, G, P, T, V, and I), and low codon degeneracy (the 2-and 1-fold degenerate residues C, D, E, F, H, K, N, Q, Y, M, and W).

Overall, I compared a sample set of 319 overlapping genes with a control set of nonoverlapping genes for 102 composition features (4 from nucleotides, 16 from dinucleotides, 20 from amino acids, 59 from synonymous codons, and 3 from amino acids grouped in accordance to codon degeneracy). I used the Wilcoxon signed-rank test for paired data (Siegel and Castellan, 1988) to detect the composition features showing a statistically significant difference between overlapping and non-overlapping genes (z-value >4.41; two-tailed P = 10 -5 ; 319 degrees of freedom).

To overcome the problem that the dataset of non-overlapping genes was more than 10-fold larger than that of overlapping genes (1,724,466 vs. 163,050 nt), I randomly selected nonoverlapping coding regions of the genomes that are of equal size to the overlapping set (510 nt from each genome for a total of 162,690 nt). By repeating this step 100 times, I obtained 100 resamplings of the non-overlapping set. I compared each resampling to the dataset of overlapping genes, assessing the statistical significance of the 102 composition features. I selected only the features showing a significant difference (z-value >4.41) in more than 95 out of 100 resamplings.

The Wilcoxon test revealed that the nucleotide and amino acid composition of overlapping genes is significantly different from that of non-overlapping genes for 37 composition features (see Results).

I used PCA (Hotelling, 1936; Morrison, 1976; Ringnér, 2008) to evaluate whether the observed differences were homogeneously distributed in individual overlapping genes, or if instead there were outliers with a highly atypical composition. I calculated the difference between the percent content of the 37 composition features in each overlapping gene and the percent content of the same features in the respective complement of nonoverlapping genes in the viral genome. I obtained a matrix of 319 rows (the number of overlaps) and 37 columns. The matrix was subjected to PCA, by using the OriginPro software (OriginLab, Northampton, MA). PCA summarized the information carried by 37 variables into 5 synthetic variables, that is the first (PC1), second (PC2), third (PC3), fourth (PC4), and fifth principal component (PC5). Using the first three PCs, I could represent the 319 overlapping genes of the dataset as a swarm of points in a three-dimensional map (see Results).

The search for overlapping genes with a highly atypical composition was carried out by applying to each of the five PCs the the Rosner's test for detection of multiple outliers (Rosner, 1975 ) (https://contchart.com/outliers.aspx).

After removing the 7 outliers detected by PCA (see Results), I compared a sample set of 312 overlapping genes with a control set of non-overlapping genes taken from 244 viral genomes. The first statistical method I used was the partial least squares-discriminant analysis (PLS-DA) (Brereton and Lloyd, 2013; Lee et al., 2016) . The input data were collected into a matrix of 556 rows (312 overlapping and 244 non-overlapping genes) and 37 columns (the critical composition features I detected). The matrix was enlarged with a "dummy variable" having a value of -1 for overlapping genes and +1 for non-overlapping genes (38 th column). This variable was the "outcome" variable on the Y-axis, while the 37 critical composition features were the "predictor" variables on the X-axes. PLS-DA first calculated a, the intercept on the Y-axis, and b, the regression coefficient for each X-axis. Using this linear regression function, I predicted Y and evaluated the accuracy of prediction. It was given by the number of overlapping genes with a predicted Y value below 0 added to the number of non-overlapping genes with a predicted Y value above 0. The sum was divided by the total number of observations (556) and multiplied by 100.

The accuracy of prediction was assessed with the validation test. The dataset was partitioned into: i) a training set composed of 75% of overlapping genes (234 out of 312) and of 75% of nonoverlapping genes (183 out of 244); ii) a validation set composed of the remaining 25% of overlapping genes (78 out of 312) and of the remaining 25% of non-overlapping genes (61 out of 244). PLS-DA was carried out on the training set and the accuracy of prediction was tested on the validation set. By repeating this step 100 times, I obtained 100 random resamplings of the dataset.

By subjecting each of them to PLS-DA, I could obtain the mean value of the accuracy of prediction.

If the mean value and the accuracy of prediction from PLS-DA of the full dataset were both above 95%, I come to conclusion that this linear regression function has a discriminant power significantly high.

The other statistical method I used was LDA (Fisher, 1936; Lachenbruch and Goldstein, 1979) .

The input data were a matrix of 312 rows and 37 columns (the critical composition features I detected) and a matrix of 244 rows and 37 columns. The aim of LDA was to obtain a combination of a number of composition features yielding the best discrimination between overlapping and nonoverlapping genes. Thus, I randomly selected a number of features ranging from 15 to 25 and I evaluated the accuracy of prediction (the percent of overlapping and non-overlapping genes correctly predicted as such) of the corresponding LDA.

The accuracy of prediction of each LDA was assessed with the validation test. By randomly resemplings the dataset, I obtained 100 training sets composed of 75% of overlapping and nonoverlapping genes and 100 validation sets composed of the remaining 25% of overlapping and nonoverlapping genes. LDA was carried out on each training set and the accuracy of prediction was tested on the respective validation set. If the mean value of the accuracy of prediction from validation sets and the accuracy of prediction from LDA of the full dataset were both above 95%, I come to conclusion that this linear function has a discriminant power significantly high.

The genealogy of overlapping genes can be inferred by examining their phylogenetic distribution, under the assumption that the protein with the most restricted distribution is encoded by the de novo frame (Keese and Gibbs, 1992; Rancurel et al., 2009) . The genealogy of 24 overlaps of the dataset is known from previous phylogenetic studies (see Table 1 in Sabath et al., 2012 and Table 1 in Pavesi et al., 2013) . By extending it to the respective 87 homologs, I obtained a first set of 111 overlapping genes with a known ancestral and de novo frame.

Another approach to infer the genealogy of overlapping genes is the codon-usage method. It is based on the assumption that the ancestral frame, which has co-evolved with the other viral genes over a long period of time, has a distribution of synonymous codons significantly closer to that of the viral genome than the de novo frame (Pavesi et al., 2013) . Analysis of the dataset with an improved version of the method (Pavesi, 2015) predicted the genealogy of 21 additional overlaps.

By extending it to the respective 54 homologs, I obtained an additional set of 75 overlapping genes with a known ancestral and de novo frame.

The overlap polymerase/large envelope protein of Hepatitis B virus was examined separately.

In accordance to the genealogy inferred by codon usage and indirect experimental evidence (Pavesi, 2015; Lauber et al., 2017) , it was subdivided into two regions: i) a 5' region in which the Pre-S domain of envelope (encoded by the ancestral frame) overlaps the spacer domain of polymerase (encoded by the de novo frame); ii) a 3' region in which the reverse transcriptase domain of polymerase (encoded by the ancestral frame) overlaps the S domain of envelope (encoded by the de novo frame). Extension of genealogy to 3 homologs yielded 4 other overlapping genes with a known ancestral and de novo frame.

Thanks to phylogenetic and codon-usage methods, I could predict the genealogy for more than half (46 out of 82) of the overlapping genes in the dataset (Supplementary Table S1 ).

The algorithm I developed first operated on the overlap capsid protein VP2/apoptin from CAV (Ac. number NC_001427). The apoptin (121 aa) is encoded by the de novo frame, which is shifted of one nucleotide 3' (+1) with respect to the ancestral frame. The algorithm randomly permuted the synonymous codons of the ancestral frame (e.g. permutation was between CGC(Arg) at codon position 3 and AGA(Arg) at codon position 7). Thus, permutation did not change the codon usage of the ancestral frame (e.g. CGA, CGC, and AGA for arginine remained the preferred codons over CGT, CGG, and AGG) as well as the amino acid sequence of the encoded capsid protein VP2.

However, permutation of synonyms did not affect 62 codon positions of the ancestral frame.

The reason is that any synonymous change at these positions (e.g. CGA instead of CGC) would always cause an amino acid substitution in a few critical regions of the apoptin, as apoptin is encoded by the +1 frame. The critical regions (see Fig. 1 in Castro et al., 2018) are the following: i) a N-terminal apoptosis-inducing domain, formed by a proline-rich segment (20 aa; residues 9-28) and a (iso)leucine-rich segment (14 aa; residues 33-46); ii) a C-terminal apoptosis-inducing domain, formed by a bi-partite nuclear localization sequence (18 aa; residues 82-88 and 111-121) and a nuclear export sequence (9 aa; residues 97-105); iii) a threonine site at position 108, whose phosphorylation in transformed cells is crucial for nuclear accumulation of apoptin (Rohn et al., 2002) .

Thus, permutation of synonyms in the ancestral frame was limited to 59 out of 121 codon positions. It generated a huge amount of new overlapping genes, half of which were discarded because of a +1 frame interrupted by termination codons. Selecting the remaining ones with the scoring rules given by PLS-DA and LDA, and using also a conservation criterion, yielded a subset of variants of the CAV apoptin (see Results).

The algorithm I developed then operated on the overlap phosphoprotein/X protein from BDV (Ac. number NC_001607). The first 17 aa of the X protein are encoded by a non-overlapping region, while the remaining 70 aa are encoded by a de novo frame, which is shifted of two nucleotides 3' (+2) with respect to the ancestral frame. The algorithm randomly permuted the synonymous codons of the ancestral frame, preserving both its codon usage (e.g. CAG for glutamine remained the preferred codon over CAA) and the amino acid sequence of the encoded phosphoprotein. Permutation of synonyms affected all 70 codon positions of the ancestral frame, because little is known about the presence of functionally critical regions in the X protein.

Permutation generated a huge amount of new overlapping genes, half of them were discarded because of a +2 frame interrupted by termination codons. Selection of the remaining ones with the scoring rules given by PLS-DA and LDA, and using also a conservation criterion, yielded a subset of variants of the X protein from BDV (see Results).

The search for homologs in the NCBI database led to detection of only one homolog for 30 out of 82 overlaps in the dataset. In the remaining cases, it led to detection of a mean number of 2.9 homologs per overlap, with a standard deviation (sd) of 2.6. In detail, I found a large number of homologs in the following cases: movement protein/replicase of Turnip yellow mosaic virus (19 homologs); p22 protein/p19 protein of Tomato bushy stunt virus (9 homologs); polyprotein (core protein)/F protein of Hepatitis C virus (8 homologs); RdRp (2a)/2b protein of Spinach latent virus (7 homologs).

The dataset contains a total of 319 overlapping genes with a mean length of 511 nt (sd = 476 nt). The high standard deviation is due to a wide length distribution, ranging from 126 nt (the homolog of the overlap bel protein/bet protein of Puma feline foamy virus) and 2844 nt (p130 protein/replicase of Providence virus). The combined overall length of overlapping genes was 163,050 nt, while that of non-overlapping genes was more than 10-fold higher (1,724,466 nt).

I collected all nucleotide and amino acid sequence data in Supplementary File S1. Section A of the file contains a list of the 82 overlaps with the respective 237 homologs. Section B reports a list of the 244 virus species (or isolates of a given virus species) that contain the 319 overlaps of the dataset. For each virus, it specifies the number of overlaps, their length, the overall length of nonoverlapping genes, and the length of the genome. Section B also reports the classification of each virus in accordance to the genome type. The 319 overlaps are distributed as follows: 162 overlaps from 125 ssRNA+ viruses, 65 overlaps from 47 ssRNA-viruses, 19 overlaps from 12 ssRNA-RT viruses, 13 overlaps from 13 dsRNA viruses, 47 overlaps from 38 ssDNA viruses, 5 overlaps from 5 dsDNA viruses, and 8 overlaps from 4 dsDNA-RT viruses.

For each overlapping gene, section C contains the following information: I) accession number in the NCBI database; II) name of virus species, family, and genus; III) name of the overlapping gene; IV) nucleotide sequence of the upstream overlapping frame; V) amino acid sequence of the protein encoded by the upstream overlapping frame; VI) nucleotide sequence of the downstream overlapping frame, shifted of one or two nucleotides 3' with respect to the upstream frame; VII) amino acid sequence of the protein encoded by the downstream overlapping frame; VIII) nucleotide sequence of the entire complement of non-overlapping genes in the viral genome; IX) amino acid sequence of the proteins encoded by the entire complement of non-overlapping genes.

Using the data reported in section B, I calculated the density of overlapping genes in each virus species. It was the ratio between the length of overlapping genes and the length of the genome, multiplied by 100. I found that about two-thirds of viruses (150 out of 244) have a density of overlapping genes less than 10% (from 1.1% in ebolaviruses to 9.6% in panicoviruses). The highest density (50%) was found in Hepatitis B virus, followed by Providence virus (44%), Maize chlorotic mottle virus (34%), Hibiscus chlorotic ringspot virus (33%), and tymoviruses (from 21.8% in

I then examined the 244 virus species of the dataset to determine whether there was a relationship between the length of their genomes and of their overlapping genes. Using the Spearman rank correlation coefficient, I found a weak, albeit significant, negative correlation (rho = -0.31; t-Student = 5.36; P = 10 -4 ).

Finally, I found a statistically significant prevalence of the internal overlaps (185 out of 319 overlaps) on the terminal ones (134 out of 319 overlaps) (chi-square = 15.67; P = 10 -5 ). It mainly depends on ssRNA-viruses, in which the number of internal overlaps (53 out of 65) was significantly higher than that of terminal overlaps (12 out of 65) (chi-square = 17.38; P = 10 -5 ). In the other 6 classes of viruses, the difference was statistically not significant (chi-square <3.84).

The Wilcoxon test revealed that the nucleotide and amino acid composition of overlapping genes is significantly different (z-value >4.41; two-tailed P = 10 -5 ; 319 degrees of freedom) from that of non-overlapping genes for one third of the examined composition features (37 out of 102). Indeed, all these features showed a z-value higher than the cutoff of significance (4.41) in more than 95 out of 100 resamplings of the non-overlapping set (see Methods). Table 1 shows a list of 17 features, sorted in accordance to a decreasing z-value, in which overlapping genes exhibited a significant enrichment with respect to non-overlapping genes. For instance, overlapping genes have a mean percent content in amino acids with high codon degeneracy (Arg, Leu, and Ser) of 27.2%, which is 5.2% greater than that in non-overlapping genes (z-value = 14.4). Considering the other z-values >9.0, overlapping genes are also enriched in the nucleotide C, the dinucleotides CG, TC, and CC, the amino acids Arg, Ser, and Gln, and the synonymous codons CGA(Arg), TCG(Ser) and AGC(Ser). These enrichments are clearly linked.

Regarding amino acids that are enriched, Arg is encoded by codons rich in CG (among which CGA, which is also enriched); likewise, Ser is largely encoded by TCG, which is a combination of the enriched dinucleotides TC and CG. Table 2 shows a list of 20 features, sorted in accordance to a decreasing z-value, in which overlapping genes exhibited a significant depletion with respect to non-overlapping genes.

Considering the z-values >10.0, overlapping genes are depleted in the nucleotide T, the dinucleotides TA, AT, TT, and TG, the amino acids Tyr, Val, Phe, and Ile, the synonymous codons TAT(Tyr), TTT(Phe), ATT(Ile), and GTT(Val), and the amino acids with low codon degeneracy.

These depletions are correlated each other. Regarding amino acids that are depleted, Tyr, Phe, and Ile are encoded by codons rich in A and T (among which TAT, TTT, and ATT, which are also depleted). Depletion in TA and TG has a clear biological meaning, as it reduces the probability of occurrence of termination codons (TGA, TAG, and TAA) in overlapping genes.

PCA summarized the information carried by 37 variables (the critical composition features I detected) into 5 synthetic variables, that is the first (PC1), second (PC2), third (PC3), fourth (PC4), and fifth principal component (PC5). They accounted, respectively, for 26.7, 22.1, 13.0, 7.0, and 6.3% of the total variation in the data. Taken together, the first 5 PCs summarized 75% of the total variation, i.e. reduction from 37 to 5 variables resulted in a loss of information of 25%. Using the first 3 PCs, the 319 overlapping genes of the dataset were represented as a swarm of points into a three-dimensional map (Fig. 1) .

Application of the Rosner's test for detection of multiple outliers to the first 3 PCs did not reveal any outliers in PC1 and PC3, but 4 outliers in PC2 (P = 0.02). They were the 4 homologs of the overlapping gene of Hepatitis B virus that encodes the RNase domain of polymerase and the X protein (see black circles in Fig. 1 ). The atypical composition of this overlap mainly depends on: i) an extremely higher content in the nucleotide C (37.3%) and the dinucleotide CG (9.6%), with respect to the non-overlapping genome counterpart (23.1% and 2.1%, respectively); ii) an extremely lower content in amino acids with low codon degeneracy (24.0%), with respect to the nonoverlapping genome counterpart (40.8%).

Application of the Rosner's test to the remaining PCs led to detection of 3 outliers in PC4 (P = 0.01). They were the 3 homologs of the overlapping gene that encodes the capsid protein VP2 and the nucleocapsid protein VP1 in gyroviruses. Their atypical composition mainly depends on an extremely higher content in arginine (22.5%) and the codon CGA(Arg) (6.4%), with respect to the non-overlapping genome counterpart (5.2% and 0.8%, respectively).

Asymmetric evolution means that the protein encoded by one frame (usually the de novo frame) is significantly more variable than that encoded by the other frame. A previous analysis of 65 pairs of homologous overlaps revealed that half of them undergo asymmetric evolution (Pavesi, 2019) .

The increased content in homologs of the present dataset allowed a more exhaustive analysis.

Compared to the previous study, I could assess the pattern of symmetric/asymmetric evolution by examining a number of homologs three-fold greater (208 vs. 65). Using Clustal Omega (Sievers and Higgins, 2014) , I first aligned the two proteins encoded by each overlap with the respective homologs. Whenever present, I then manually removed the protein regions with large gaps.

Consider for example the overlap capsid protein P3/movement protein P4 of Potato leafroll virus, which encodes two proteins having a length of 156 aa. Manual inspection of the alignments of P3 and P4 with the respective 6 homologs led to removal of the largely gapped N-and Cterminal regions. Thus, the occurrence of symmetric or asymmetric evolution was tested on the central un-gapped region of 90 aa. Using the chi-square test (cutoff >3.84; P <0.05; one degree of freedom), the numbers of conserved and non-conserved amino acid positions in the alignment of P3 (38 and 52, respectively) were compared to those in the alignment of P4 (19 and 71, respectively).

A chi-square value of 8.32 (P = 0.004) indicated that this overlap undergoes asymmetric evolution, as the number of amino acid substitutions in P4 is significantly higher than that in P3.

Chi-square test confirmed the pattern of asymmetric evolution found previously in 32 overlaps (see Table 1 in Pavesi, 2019) . Unlike the previous study, it revealed that 5 more overlapping genes evolve in accordance to the asymmetric model (Table 3) . Thus, the number of overlaps evolving asymmetrically increased from 32 to 37 (57% of the total), while that of overlaps evolving symmetrically decreased from 33 to 28 (43% of the total). Although the percent difference (57% vs. 43%) was statistically not significant (chi-square value = 1.97; P = 0.16), asymmetric evolution is now the prevailing pattern in viral overlapping genes.

Having found that PCA has a poor ability to separate overlapping genes from non-overlapping genes (see, respectively, the grey and black circles in Supplementary Figure S1 ), I explored the dataset using PLS-DA and LDA, that is supervised multivariate statistical methods that maximize the variance between groups and minimize the variance within groups.

Analysis with PLS-DA of a matrix of 556 rows (312 overlapping genes after exclusion of 7 outliers and 244 non-overlapping genes) and 38 columns (the 37 critical composition features I detected and a dummy variable assigning -1 to overlapping genes and +1 to non-overlapping genes) yielded a linear regression function consisting of a, the intercept on the Y axis, and 37 values of b, the regression coefficient.

This function assigned a predicted Y value below 0 to 296 out of 312 overlapping genes (error of classification = 5.1%) and a predicted Y value above 0 to 241 out 244 non-overlapping genes (error of classification = 1.2%). The total error was 3.4% (19 misclassifications out of 556) with an accuracy of prediction of 96.6%. The accuracy of prediction of the linear regression function was assessed with the validation test (see Methods). When applied to 100 validation sets, it yielded for overlapping genes a mean error of classification of 6.3% and for non-overlapping genes of 3.1%.

The mean total error of classification was 4.9% with a mean accuracy of prediction of 95.1%.

To obtain a linear regression function with a highest discriminant power, especially for overlapping genes, I randomly selected a number of critical composition features ranging from 15 to 25 and I evaluated the accuracy of prediction of the corresponding PLS-DA. The best performance was given by a PLS-DA accounting for 23 composition features (1 from nucleotides, 6 from dinucleotides, 7 from amino acids, and 9 from synonymous codons). When applied to 100 validation sets, it yielded a mean error of classification of 5.4% for overlapping genes and of 3.2% for non-overlapping genes. The mean total error of classification was 4.5% with a mean accuracy of prediction of 95.5%. When applied to the full dataset, this linear regression function misclassified 5.1% of overlapping genes and 1.6% of non-overlapping genes. The total error was 3.6% with an accuracy of prediction of 96.4%. The strong discriminant power of this linear regression function is evident from the distribution of the predicted Y value in overlapping (grey columns) and nonoverlapping genes (black columns) (Fig. 2) .

The other statistical method I used was LDA. The input data were a matrix of 312 rows (the number of overlaps after exclusion of 7 outliers) and 37 columns (the critical composition features I detected) and a matrix of 244 rows (the number of the genome complements of non-overlapping genes) and 37 columns. I randomly selected a number of features ranging from 15 to 25 and I evaluated the accuracy of prediction of the corresponding LDA. The performance of each LDA was then assessed with the validation test (see Methods).

The best performance was given by a LDA accounting for 21 composition features (2 from nucleotides, 4 from dinucleotides, 8 from amino acids, and 7 from synonymous codons). When applied to 100 validation sets, it yielded a mean error of classification of 3.3% for overlapping genes and of 4.6% for non-overlapping genes. The mean total error of classification was 3.9% with a mean accuracy of prediction of 96.1%. When applied to the full dataset, this linear function misclassified 3.5% of overlapping genes and 2.9% of non-overlapping genes. The total error was 3.2% with an accuracy of prediction of 96.8%. The strong discriminant power of this linear function can be appreciated by examining the distribution of the LDA score in overlapping (grey columns) and non-overlapping genes (black columns) (Fig. 3) .

I summarized the performance of PLS-DA and LDA into a two-dimensional map in which the predicted Y value in overlapping and non-overlapping genes was plotted against the respective LDA score (Fig. 4) . The grey circles into part A of the map are the 294 overlaps correctly classified by both methods (94.2% of the total). The black circles into part C are the 237 non-overlaps correctly classified by both methods (97.1% of the total).

Finally, I found that the performance of both PLS-DA and LDA is not affected by the fact that the dataset includes representatives from all Baltimore classes of viruses (e.g. ssRNA+, ssRNA-, ssDNA, and dsDNA viruses). When the validation test was applied to the 259 overlaps from RNA viruses, the linear regression function accounting for 23 composition features showed a total mean error of classification of 3.2%. When the same test was applied to the 162 overlaps from ssRNA+ viruses, it yielded a mean total error of classification of 2.8%. Similarly, the LDA function accounting for 21 composition features showed a mean total error of classification of 4.8%, when the validation test was applied to the 259 overlaps from RNA viruses, and of 2.4% when the same test was applied to the 162 overlaps from ssRNA+ viruses.

Using the intercept and the 23 regression coefficients given by PLS-DA (see legend of 

The present dataset contains two overlapping genes (3a protein/3b protein and nucleocapsid protein/9b protein) from human SARS-Cov (Ac. number NC_004718) and the 2 respective homologs from bat SARS-Cov (Ac. number DQ412042) (see overlaps from #51 to #54 in Supplementary File S1). Thus, it was important to examine the genome sequences of the new coronavirus SARS-CoV-2, the etiological agent of the current pneumonia outbreak in the world (Zhou et al., 2020) . I analyzed the genome sequence of the isolate Wuhan-Hu-1 of SARS-CoV-2 (NC_045512), which shows a nucleotide identity of 80% with human and bat SARS-Cov and of 96% with the closely related bat coronavirus RaTG13 (Ac. number MN996532). Application of the model to SARS-CoV-2 pointed out a strict conservation of the overlap nucleocapsid protein/9b protein. Indeed, the 9b protein of SARS-CoV-2 (unannotated in NC_045512, but encoded from nt 28,284 to 28,574) showed an identity of 71% with the homologs from human and bat SARS-Cov and of 93% with the homolog from bat coronavirus RaTG13 (unannotated in MN996532, but encoded from nt 28,250 to 28,543). Interestingly, the model pointed out that the overlap 3a protein/3b protein of SARS-CoV-2 is conserved in bat coronavirus RATG13, but it is deeply different from that of human and bat SARS-CoV. Indeed, the 3b protein of SARS-CoV-2 (not annotated in NC_045512, but encoded from nt 25,457 to 25,579) is three-fold shorter (41 aa) than that of human and bat SARS-CoV (154 and 114 aa, respectively). In addition, it shows with them an identity of only 22.5 and 28%, respectively. Application of the model led to identification in SARS-CoV-2 of two new potential overlapping genes. They were predicted as such by both PLS-DA and LDA (see asterisks in Fig. 4 ).

In the first case, a +1 overlapping frame from nt 28,734 to 28955 is entirely within the region that encodes residues 155-229 of nucleocapsid protein. It encodes a hypothetical protein having a length of 73 aa (Fig. 5A ). Using TBLASTN, I found that this hypothetical protein shows an identity of 90% with the homolog from bat coronavirus RaTG13 (nt 28700-28918), and of 77% and 73% with the homologs (3 aa shorter) of human and bat SARS-Cov (nt 25,583-28,792 and 28,544-28,753, respectively).

In the other case, a +2 overlapping frame from nt 25524 to 25697 is entirely within the region that encodes residues 44 to 102 of 3a protein. This frame encodes a hypothetical protein having a length of 57 aa (Fig. 5B ). Using TBLASTN, I found that this overlapping frame is a peculiar feature of SARS-CoV-2. Indeed, both the homologous genome region of bat coronavirus RaTG13 (nt 25,494-25,667) and that of human and SARS-CoV (nt 25,399-25,572 and nt 25,338-25,511, respectively) appear to be interrupted by several termination codons.

Thanks to the phylogenetic and codon-usage methods, I could predict the genealogy for more than half (46 out of 82) of the overlapping genes in the dataset (Supplementary Table S1 Supplementary Files S2 and S3 report for each overlap the following information: I) the accession number in the NCBI database; II) the name of virus species, family, and genus; III) the name of the overlapping gene; IV) the nucleotide sequence of the ancestral frame; V) the amino acid sequence of the protein encoded by the ancestral frame; VI) the nucleotide sequence of the de novo frame; VII) the amino acid sequence of the protein encoded by the de novo frame.

I analyzed the two groups of overlaps separately, with the aim to detect critical composition differences between ancestral and +1 de novo frames in one case, and between ancestral and +2 de novo frames in the other. I calculated the percent content in amino acids, synonymous codons, and amino acids grouped in accordance to codon degeneracy in each ancestral frame and in the respective de novo frame. I used the Wilcoxon test to detect the composition features showing a statistically significant difference (z-value >4.41; two-tailed P = 10 -5 ) between ancestral and de novo frames.

When applied to the first group (126 overlaps), the Wilcoxon test revealed that the +1 de novo frames differ significantly from the ancestral ones for 25 composition features. Section A of Supplementary Table S2 shows a list of 14 features in which the +1 de novo frames exhibited a significant enrichment with respect to the ancestral ones. Section B shows a list of 11 features in which the +1 de novo frames exhibited a significant depletion.

Using these composition features as input data for LDA, I compared a matrix of 126 rows (the +1 de novo frames) and 25 columns with a matrix of 126 rows (the ancestral frames) and 25 columns. LDA separated the +1 de novo frames from the ancestral ones with an accuracy of 97.2% (Fig. 6A) . The error of classification was 2.4% for the de novo frames and 3.2% for the ancestral frames. The strong discriminant power of the linear function was confirmed with the validation test.

When applied to 100 resamplings of the dataset, it yielded a mean accuracy of prediction of 95.8%.

Analysis of the other group (68 overlaps) revealed that the +2 de novo frames differ significantly from the ancestral ones for 23 composition features (13 with a significant enrichment and 10 with a significant depletion) (Supplementary Table S3 ). They were used as input data for LDA, which compared a matrix of 68 rows (+2 de novo frames) and 23 columns with a matrix of 68 rows (ancestral frames) and 23 columns. LDA separated the +2 de novo frames from the ancestral ones with an accuracy of 100% (Fig. 6B) . This result was confirmed by the validation test, whose performance on 100 resamplings of the dataset was a mean accuracy of prediction of 98.0%.

Search for homologs of the overlap capsid protein VP2/apoptin from CAV led to detection of Avian gyrovirus 2 (AGV2) and human Gyrovirus Tu789. These homologs are distantly related each other, as they show a mean nucleotide diversity of 39.2% and a mean amino acid diversity of 47.7% for VP2 and of 61.3% for apoptin. Despite the diversity, the apoptin of Gyrovirus Tu789 is able to trigger apoptosis in cancer cell lines at levels comparable to the CAV apoptin (Bullenkamp et al., 2012; Chaabane et al., 2017) .

PLS-DA of the dataset (Fig. 2) assigned a score of -0.96 to the overlap from CAV, -0.76 to that from AGV2, and -0.94 to that from Gyrovirus Tu789 (PLS-DA score). LDA of the same dataset PLS-DA of the dataset (Fig. 2) assigned to the 7 overlaps a score ranging from -1.39 to -0.78 (PLS-DA score). LDA of the dataset (Fig. 3 ) assigned to them a score ranging from -48.1 to -42.5 (LDA.1 score). LDA of ancestral and +2 de novo frames (Fig. 6B ) assigned to the seven +2 frames encoding X protein a score ranging from -59.7 to -47.6.

The algorithm randomly permuted the synonymous codons of the phosphoprotein frame of BDV in all its 70 codon positions. Over a total of 150,000 rounds of permutation, the algorithm selected a total of 4998 new overlaps with a +2 frame not interrupted by stop codons and meeting the following criteria: i) a PLS-DA score from -1.39 to -0.78; ii) a LDA.1 score from -48.1 to -42.5;

iii) a LDA.2 score from -59.7 to -47.6. Using as additional filter a conservation criterion, that is the presence of the 16 conserved amino acids in the 7 homologous X proteins, I obtained a set of 6 variants with a mean amino acid identity of 65.0% (sd = 5.6%) (Fig. 8) . Szelechowski et al. (2017) found that a peptide covering the C-terminal end of the X protein from BDV (residues 59-87 in the sequence NC_001607, see Fig.8 ) has a neuroprotective activity similar to that of the full-length protein, whereas no activity was found for peptides spanning residues 18-58. This finding suggested a further analysis. Using the conservation criterion, but limited to the region spanning residues 18-58, I selected a total of 30 shorter variants with a mean amino acid identity of 61.6% (sd = 6.4%). The corresponding amino acid sequences are given in Supplementary File S5.

A limitation of our previous study (Pavesi et al., 2018) was that comparative analysis of overlapping and non-overlapping genes did not take into account the contribution to nucleotide diversity given by homologs, as it was carried out on 80 individual overlaps. Analysis of the present dataset (82 overlaps with the respective 237 homologs) led to detection in overlapping genes of a number of critical composition features that is about two-fold higher than that found previously (37 vs. 20). This finding made possible a PCA based on 37 variables. It confirmed that overlapping genes follow a common pattern of sequence composition (2 outliers out of 82 overlaps) more strongly than the previous PCA based on 20 variables (5 outliers out of 80 overlaps).

In addition, the present dataset made possible a more accurate evaluation of the pattern of symmetric/asymmetric evolution in overlapping genes. A previous study demonstrated that half of overlapping genes (32 out of 65) evolve in accordance to the asymmetric model (Pavesi, 2019) . This analysis, however, underestimated the sequence diversity of overlapping genes, as it selected only one homolog per overlap and, moreover, with stringent criteria (an equal length in nucleotides and an alignment of the encoded proteins with virtually no gaps). The present analysis, which examined a number of homologs three-fold higher than that in the previous study (208 vs. 65), changed the trend in the direction of a prevalence of asymmetric (37 out of 65) vs. symmetric evolution (28 out of 65).

The linear regression function given by PLS-DA, accounting for 23 composition features, and the linear function given by LDA, accounting for 21 composition features, separated overlapping from non-overlapping genes with an accuracy of 95.5% and 96.8%, respectively (Figs. 2 and 3) . When considered jointly, the two methods correctly classified 94.2% of overlapping genes and 97.1% of non-overlapping genes (Fig. 4) .

Thus, and as a complement to previous prediction methods (Firth, 2014; Sealfon et al., 2015; Schlub et al., 2018) , the present model could be a powerful tool for detecting new potential overlapping genes in the viral genome sequences deposited in database. A joint application of PLS-DA and LDA to 8 overlapping genes previously classified as putative (Pavesi et al., 2018) revealed that 5 of them are bona fide overlapping genes (see Results). Application of the model to a set of overlapping genes predicted in viruses by Firth (2014) and Schlub et al. (2018) revealed that most of them (25 out of 33) are bona fide overlapping genes (data not shown). Finally, application of the model to a new potential overlapping gene found in Hepatitis G virus (Pavesi, 2000) confirmed that the NS5A region of the viral polyprotein is a dual-coding region, as it likely encodes a de novo gene product (data not shown).

Based on the finding that mammalian overlapping genes follows a composition bias similar to viral ones (Pavesi, 2018) , the present model could also be useful for checking the wide amount of AltORFs detected in mammals by previous bioinformatics genome-wide studies (Chung et al., 2007; Ribrioux et al., 2008; Xu et al., 2010; Vanderperre et al., 2012) .

Sequence analysis of about four thousand cancer exomes has revealed that the enrichment in synonymous nucleotide substitutions found in oncogenes depends on the fact that synonymous mutations frequently act as driver mutations in human cancers (Supek et al., 2014) . By exploring the Supek's dataset, Brunet et al. (2018) suggested that the underlying cause of pathological synonymous mutations could be an amino acid change in a "hidden" protein encoded by an alternative ORF in the same mRNA. Thus, another possible application of the model could be the search for "hidden" gene product in the Cancer Genome Atlas database.

Analysis of the genome sequence of SARS-CoV-2 using the scoring rules given by PLS-DA and LDA pointed out a few interesting features. First at all, SARS-Cov-2 shows an overlap 3a protein/3b protein deeply different from the homologs from human and bat SARS-CoV. It may be that SARS-CoV-2 has lost the ability to encode a functional 3b protein. Indeed, the length of itts predicted 3b protein is about one quarter of that of human SARS-Cov, which acts as inhibitor of the host interferon response (Kopecky-Bromberg et al., 2007) .

However, within the gene region that encodes 3a protein I identified a new potential overlapping gene, shifted of two nucleotide positions (Fig. 5B) . Interestingly, this overlap is a feature unique to all the human SARS-CoV-2 genomes deposited in NCBI to date. If the hypothesis of a non-functional 3b protein in SARS-CoV-2 is correct, this finding suggests a replacement of the +1 frame encoding 3b with a +2 frame of unknown function. A similar mechanism has been described for the overlapping gene capsid protein VP2/apoptin in some human divergent gyroviruses (see Fig. 1 in Gia Phan et al., 2013) .

In addition, within the gene region that encodes the nucleocapsid protein I identified a further new potential overlapping gene, shifted of one nucleotide position (Fig. 6A) . Unlike the previous case, this putative overlap is a common feature of SARS-Cov-2 and SARS-CoV. The location of the two putative overlapping genes in the genome of SARS-CoV-2 suggests that the gene region encoding the 3a and nucleocapsid proteins could be a hotspot for overprinting. A similar feature was found previously in the "gene nursery" of deltaretroviruses (Pavesi et al., 2013) .

As shown in Supplementary Table S1, I could predict the genealogy for 46 out of 82 overlapping genes in the dataset. In addition to 9 overlaps whose genealogy was inferred only by phylogeny, Table S1 reports 37 overlaps in which prediction of ancestral and de novo frame came from codonusage alone or combined with phylogeny. I would highlight two particular cases of gene overlap, as a paradigm of different evolutionary pathways.

The first is the overlap nucleocapsid protein/non-structural protein NSs from orthohantaviruses (6 homologs, see overlap #19 in Table S1 ). In all homologs, the codon-usage method demonstrated that the frame encoding NSs (the predicted de novo frame) has a codon bias significantly more distant to that of the viral genome than that of the frame encoding the nucleocapsid protein (P values ranging from 0.0001 to 0.005).

The other is the overlap envelope glycoprotein (EGP)/secreted glycoprotein (SGP) from ebolaviruses (6 homologs, see overlap#16 in Table S1 ). In 3 homologs, the codon-usage method demonstrated that the codon bias of the frame encoding SGP (the predicted de novo frame) is significantly more distant to that of the viral genome than the codon bias of the frame encoding EGP (P values ranging from 0.0001 to 0.025). In the remaining 3 homologs, the method yielded the same genealogy, but prediction was not supported significantly (P values ranging from 0.15 to 0.25). Sabath et al. (2012) found that young de novo genes have a codon usage highly different from the rest of the genome and that older de novo genes tend to have a codon usage more adapted to that of the rest of the genome. Based on this finding, I can infer that the non-structural protein NSs of orthohantaviruses is encoded by a young de novo gene and that the secreted glycoprotein (SGP) of ebolaviruses by an older de novo gene. Table S1 reports several overlaps with a codon-usage statistics similar to that found in orthohantaviruses (e.g. overlaps #6 and #24) or to that found in ebolaviruses (e.g. overlaps #9 and #44). This information should be useful for inferring the relative age of other overlapping genes.

The finding that the length of viral genomes is negatively correlated with the length of the overlapping genes they contain could be a clue to support the gene-compression theory. This theory explains the abundance of gene overlap in viruses as a consequence of a physical constraint on genome length by the capsid (Chirico et al., 2010) or of a high mutation rate such that occurring in RNA viruses (Belshaw et al., 2007) . As most mutations are deleterious, the high mutation rate will limit the genome size, and thus new genes must come from overprinting (Holmes, 2009 ).

However, the negative correlation I found between gene overlap and genome length, albeit significant, is weak (rho = -0.31). The strong correlation reported in previous studies (Belshaw et al., 2007; Chirico et al., 2010) could be artefactual, because it was a correlation between the length of the genome and the ratio of the length of overlaps to the length of the genome. As noted previously (Pavesi et al., 2018) , using twice the same variable (the genome length) in a correlation test is statistically questionable, since the examined data are not independent.

In this study, LDA separated the ancestral overlapping frames from the de novo + 1 frames and from the +2 de novo frames with an accuracy close to 100% (Fig. 6 ). In detail, the de novo proteins encoded by the +1 frame are enriched in hydrophobic residues (leucine and methionine) and depleted in acidic residues (aspartic acid) (Supplementary Table S2 ). The de novo proteins encoded by the +2 frame are enriched in basic residues (arginine and histidine) and cysteine, and depleted in hydrophobic residues (leucine and methionine) (Supplementary Table S3 ).

These findings should support the gene-novelty theory, which states that the abundance of gene overlap in viruses is driven by selection pressures favouring the expression of new proteins with peculiar sequence properties (Rancurel et al., 2009; Brandes and Linial, 2016) .

In the overlaps with known genealogy, the strong prevalence of the +1 de novo on the +2 de novo frames (36 and 11 cases, respectively, see last column in Table S1 ) can be explained by mutation bias. Indeed, analysis of the control set of non-overlapping genes (1,724,466 nt) revealed that start AUG codons are more frequent in the +1 frame (1 per 30 codons) than the +2 frame (1 per 110) and that stop codons are less frequent in the +1 frame (1 per 16 codons) than the +2 frame (1 per 12 codons). These results are in full accordance with those reported by Willis and Masel (2017) . Belshaw et al. (2007) proposed that prevalence of the +1 on the +2 frameshifts should result in a preponderance of NYR and YRN triplets, respectively. Indeed, we would expect to find more stop codons (TAA, TAG, and TGA) by chance in a YRN-rich (+2 frameshifted) sequence than in a NYR-rich (+1 frameshifted) sequence. In accordance to this proposal, analysis of the control set of non-overlapping genes (1,724,466 nt), shifted of 1 nucleotide position, showed a preponderance of the NYR triplet (31%) on the YNR triplet (26%). The same analysis, with a shift of 2 nucleotide positions, showed a preponderance of the YNR triplet (21%) on the NYR triplet (15%).

Finally, Belshaw et al. (2007) showed that in internal overlaps the +1 frameshifts are significantly more common than the +2 frameshifts (59 and 20 cases, respectively). This finding was fully confirmed by analysis of the 47 overlapping genes with known genealogy and of their homologs. Indeed, I found that the number of internal overlaps with a +1 de novo frame is almost 5fold larger than that of internal overlaps with a +2 de novo frame (98 and 20 cases, respectively).

The selective anticancer toxicity of CAV-apoptin depends on a predominantly nuclear localization in tumor cells, whereas in normal cells it is detected mainly in cytoplasm (Danen-Van Oorschot et al., 2003) . Apoptin has a N-terminal apoptosis-inducing domain, formed by a proline-rich segment (PRS) and a leucine-rich segment (LRS), and a C-terminal apoptosis-inducing domain, formed by a bi-partite nuclear localization sequence (NLS1 and NLS2) and a nuclear export sequence (NES).

Both the N-and C-terminal halves of apoptin can induce cell death on their own, albeit less strongly than the full-length protein (Danen-Van Oorschot et al., 2003) . A truncated apoptin lacking residues 1-43 is a soluble, non-aggregating protein that maintains most of the properties of wildtype apoptin when transfected into human cancer cells (Rui-Martinez et al., 2017) . Two other constructs, the first formed by PRS and NLSs and the other by LRS and NLSs, can induce selective apoptosis in a breast cancer cell line and in glioma cells, respectively (Shen Ni et al. 2013; Zhang et al., 2017) .

The aim of these studies was to obtain deleted forms of CAV-apoptin retaining the selective anticancer activity of the full-length protein, yet with the advantage of an increased solubility (Rui-Fernandez et al., 2017) or a reduced immunogenicity (Zhang et al., 2017) . However, all efforts in this direction have been conducted so far using the wild-type apoptin of CAV. By a computer simulation of the process of overprinting, the aim of the present study is to provide new variants of CAV-apoptin. The X protein of BDV is encoded by a non-overlapping region (residues 1-17, see italic characters in Fig. 8 ) and an overlapping region (residues 18-87). The finding that its mitochondrial localization is mediated by residues 5-16 led to construction of two N-terminal manipulated mutants with an improved mitochondrial targeting and higher neuroprotective potential (Ferré et al., 2017) . Szelechowski et al. (2014) found that a short peptide (residues 59-87) provides a protection against neurodegeneration in a mouse model of Parkinson's disease that is similar to the full-length X protein, yet with the advantage of a minimally invasive method of administration. The same study revealed that two other peptides (residues 1-29 and 30-59 respectively) did not exhibit any protection.

In addition to the scoring rules given by PLS-DA and LDA, selection of variants of the X protein of BDV was favoured by the finding that the overlap phosphoprotein/X protein undergoes strong asymmetric evolution. Indeed, the amino acid identity between the 7 homologous phosphoproteins (46 conserved residues) is three-fold greater than that between the 7 homologous X proteins (16 conserved residues).

Using the scoring filters given by PLS-DA and LDA and a conservation criterion, I could obtain a first set of 6 variants of the X-protein (Fig. 8 ). Using the same scoring filters, but limiting the conservation criterion to the region from residue 18 to 58, I could obtain a further set of 30 shorter variants (Supplementary File S5). Both datasets are a useful benchmark for experimental studies testing the neuroprotective potential, especially the second one due to the poorly invasive method of administration of short peptides in vivo (Szelechowski et al., 2014) .

In the present study, a multivariate statistical analysis of a large dataset revealed new evolutionary features of viral overlapping genes. They show a common, and peculiar, pattern of nucleotide and amino acid composition. Thanks to discriminant analysis, overlapping genes can be separated from non-overlapping genes with high accuracy. Thus, the model I developed is a valuable tool for identifying new potential dual-coding regions in the viral genome sequences deposited in databases and, possibly, also in the eukaryotic genome sequences. A preliminary application of the model to SARS-CoV-2 led to prediction of 2 putative overlaps in the 3' genome region. The finding that the proteins encoded by the de novo frames differ significantly from those encoded by the ancestral frames supports the view that overprinting is a valuable source of genetic novelties. This view is corroborated by the notion that 2 de novo proteins (apoptin from CAV and X protein from BDV) can exert functions that are not virus-specific, such as a selective anticancer toxicity in human tumour cell lines and a protection against neurodegeneration in tissue culture, respectively. The search for variants of these proteins with an enhanced therapeutic effectiveness is an intriguing field of research. The contribution provided by this study could be helpful for future experimental studies. Finally, the wide collection of processed genome sequences given in Supplementary Files can be used by others as reference datasets for subsequent evolutionary studies (e.g. the relative age of overlapping genes or the occurrence of symmetric and asymmetric evolution in different regions of the same overlap). were correctly classified as overlap (their score ranged from -53.91 to -35.36). With the same score, a high percentage (97.1%) of non-overlapping genes were correctly classified as non-overlap (their score ranged from -35.31 to 0.84). Overlap 3a protein/hypothetical protein: the nucleotide sequence (from nt 25522 to 25698) encodes the region of 3a protein spanning residues 44-102. The +2 overlapping frame (from nt 25524 to 25697) encodes a hypothetical protein (underlined characters) with a length of 57 amino acids. and in the respective +1 de novo frames (grey columns). With a discriminant score of 18.02, a high percentage (97.6%) of ancestral frames were correctly classified as ancestral (their score ranged from 18.29 to 36.29). With the same score, a high percentage (98.4%) of the +1 de novo frames were correctly classified as de novo (their score ranged from -3.96 to 17.95). (B) Histogram of the distribution of the LDA score in 68 ancestral frames (black columns) and in the respective +2 de novo frames (grey columns). With a discriminant score of -15.92, all the ancestral frames and all the +2 de novo frames were correctly classified as ancestral and de novo, respectively. Fig. 7 . Alignment of the 7 most divergent apoptins yielded by simulation (from #1 to #7) with the apoptin of Chicken anemia virus (sequence NC_001427). Italic characters indicate the N-terminal apoptosis-inducing domain (Pro-rich region from residue 9 to 28 and Ile/Leu rich region from residue 33 to 46), the C-terminal apoptosis-inducing domain (NLS1 from residue 82 to 88, NES from residue 97 to 105, and NLS2 from residue 111 to 121), and a critical threonine site (residue 108). Due their critical role, these protein regions were excluded from random permutation.

Underlined positions indicate the 14 conserved amino acids in the apoptins of CAV, AGV2, and Gyrovirus Tu789. 

List of 17 composition features, sorted in accordance to a decreasing z-value, showing a significant enrichment in overlapping genes (z-value >4.41; two-tailed P = 10 -5 ; 319 degrees of freedom) with respect to non-overlapping genes. Table 2 List of 20 composition features, sorted in accordance to a decreasing z-value, showing a significant depletion in overlapping genes (z-value >4.41; two-tailed P = 10 -5 ; 319 degrees of freedom) with respect to non-overlapping genes. Table 3 List of the 5 overlapping genes in which analysis of a higher number of homologs revealed that they undergo asymmetric evolution (chi-square value >3.84; 1 degree of freedom). 

a new generation of protein database search programs

Apoptin: therapeutic potential of an early sensor of carcinogenic transformation

A. 3 rd , 1976. Overlapping genes in bacteriophage phiX174

The evolution of genome compression and genomic novelty in RNA viruses

Nonstructural protein σ1s mediates reovirus-induced cell cycle arrest and apoptosis

Gene overlapping and size constraints in the viral world

Partial least squares discriminant analysis: taking the magic away

Recognition of the polycistronic nature of human genes is critical to understanding the genotype-phenotype relationship

Human Gyrovirus Apoptin shows a similar subcellular distribution pattern and apoptosis induction as the chicken anaemia virus derived VP3/Apoptin

Identification of an overprinting gene in Merkel cell polyomavirus provides evolutionary insight into the birth of viral gene

Apoptin, a Versatile Protein with Selective Antitumor Activity

Human gyrovirus-apoptin interferes with the cell cycle and induces G2/M arrest prior to apoptosis

MicroRNA-binding viral protein interferes with Arabidopsis development

A novel influenza A virus mitochondrial protein that induces cell death

Why genes overlap in viruses

A first look at ARFome: dual-coding genes in mammalian genomes

Apoptin induces apoptosis in human transformed and malignant cells but not in normal cells

Importance of nuclear localization of apoptin for tumor-specific induction of apoptosis

The origin of a novel gene through overprinting in Escherichia coli

Evidence for the recent origin of a bacterial protein-coding, overlapping orphan gene by evolutionary overprinting

Manipulation of the N-terminal sequence of the Borna disease virus X protein improves its mitochondrial targeting and neuroprotective potential

Mapping overlapping functional elements embedded within the protein-coding regions of RNA viruses

The use of multiple measurements in taxonomic problems

Divergent gyroviruses in the feces of Tunisian children

Molecular Basis of Virus Evolution

The evolution and emergence of RNA viruses

Relation between two sets of variates

Origin of genes: "big bang" or continuous creation?

Severe acute respiratory syndrome coronavirus open reading frame (ORF) 3b, ORF 6, and nucleocapsid proteins function as interferon antagonists

Stability and evolution of overlapping genes

Discriminant analysis

Mapping of the two overlapping genes for polypeptides NS1 and NS2 on RNA segment 8 of influenza virus genome

Diversity of coding strategies in influenza viruses

Deciphering ther origin and evolution of Hepatitis B virus be means of a family of non-enveloped fish viruses

Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategy and knowledge gaps

Apoptin induces tumor-specific apoptosis as a globular multimer

Virus counterdefense: diverse strategies for evading the RNA-silencing immunity

The alternative open reading frame of LAGE-1 gives rise to multiple promiscuous HLA-DR-restricted epitopes recognized by T-helper 1-type tumor-reactive CD4+ T cells

Norovirus regulation of the innate immune response and apoptosis occurs via the product of the alternative open reading frame

Observation of dually decoded regions of the human genome using ribosome profiling data

Evolution of overlapping genes

Multivariate Statistical Methods

Death of a dogma: eukaryotic mRNAs can code for more than one protein

Overlapping genes

A single chicken anemia virus protein induces apoptosis

On the informational content of overlapping genes in prokaryotic and eukaryotic viruses

Detection of signature sequences in overlapping genes and prediction of a novel overlapping gene in hepatitis G virus

Viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of deltaretroviruses

Different patterns of codon usage in the overlapping polymerase and surface genes of hepatitis B virus suggest a de novo origin by modular evolution

Overlapping genes and the proteins they encode differ significantly in their sequence composition from non-overlapping genes

Asymmetric evolution in viral overlapping genes is a source of selective protein adaptation

Overlapping messages and survivability

Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation

Bioinformatics prediction of overlapping frameshifted translation products in mammalian transcripts

What is principal component analysis?

A tumor-specific kinase activity regulates the viral death protein apoptin

Identification of BING-4 cancer antigen from an alternative open reading frame of a gene in the extended MHC class II region using lymphocytes from a patient with a durable complete regression following immunotherapy

On the detection of many outliers

A truncated apoptin protein variant selectively kills cancer cells

Evolution of viral proteins originated de novo by overprinting

A simple method to detect candidate overlapping genes in viruses using single genome sequences

FRESCo: finding regions of excess synonymous constraint in diverse viruses

Selective apoptosis induction in MCF-7 cell line by truncated minimal functional region of Apoptin

Nonparametric Statistics for the Behavioral Sciences

Clustal Omega, accurate alignment of very large numbers of sequences

CD4+ Th2 cell recognition of HLA-DR-restricted epitopes derived from CAMEL: a tumor antigen translated in an alternative open reading frame

Alternative tumour-specific antigens

Synonymous mutations frequently act as driver mutations in human cancers

A viral peptide that targets mitochondria protects against neuronal degeneration in models of Parkinson's disease

An umbraviral protein, involved in long-distance RNA movement, binds viral RNA and forms unique, protective ribonucleoprotein complexes

HAltORF : a database of predicted out-of-frame alternative open reading frames in human

Direct detection of alternative open reading frames translation products in human significantly expands the proteome

The N-terminus of Bunyamwera orthobunyavirus NSs protein is essential for interferon antagonism

Size selective recognition of siRNA by an RNA silencing suppressor

Direct detection of alternative open reading frames translation products in human significantly expands the proteome

Utilization of an

A breast and melanoma-shared tumor antigen: T cell responses to antigenic peptides translated from different open reading frames

The X protein of bornaviruses interfere with type I interferon signalling

Gene birth contributes to structural disorder encoded by overlapping genes

Length of the ORF, position of the first AUG and the Kozak motif are important factors in potential dual-coding transcripts

A peptide derived from apoptin inhibits glioma growth

Black circles indicate the 4 homologs (from human, chimpanzee, woolly monkey, and long-fingered bat) of the overlap polymerase/X protein of Hepatitis B virus

The linear regression function consists of the intercept on the Y-axis a (value =1.617) and of 23 values of b, the regression coefficient for each composition feature. The values of the regression coefficient are as follows: 0.001 for G, -0.035 for AT

164 for CGA(Arg), -0.176 for TCA(Ser), -0.221 for TCG(Ser), -0.158 for AGC(Ser), -0.086 for CCG

A high percentage of overlapping genes (94.9%) were correctly classified as overlap

A high percentage (98.4) of non-overlapping genes were correctly classified as non-overlap, as their

Histogram of the distribution of the LDA score in 312 overlapping genes (grey columns) and in 244 non-overlapping genes (black columns). The linear function, accounting for 21 composition features, consists of 21 coefficients

105 for CG, 0.217 for Arg, -0.171 for Pro, -0.861 for Asn, -1.084 for Gln, 1.320 for Asp, 1.428 for Tyr, -0.884 for amino acids with high codon degeneracy, -0.164 for amino acids with low codon degeneracy

With a discriminant score of -35.31, a high percentage (96.5%) of overlapping genes

The author is grateful to Alessio Peracchi (University of Parma) and Alberto Vianelli (University of Insubria) for helpful suggestions. Special thanks to Dr. Gianmarco Del Vecchio for preparing the figures. This work has benefited from the equipment and framework of the COMP-HUB Initiative, funded by the 'Departments of Excellence' program of the Italian Ministry for Education, University and Research (MIUR, 2018(MIUR, -2022.