key: cord-035157-97tfcgvq
authors: Panchin, Alexander Y.; Panchin, Yuri V.
title: Excessive G–U transversions in novel allele variants in SARS-CoV-2 genomes
date: 2020-07-28
journal: PeerJ
DOI: 10.7717/peerj.9648
sha: 
doc_id: 35157
cord_uid: 97tfcgvq

BACKGROUND: SARS-CoV-2 is a novel coronavirus that causes COVID-19 infection, with a closest known relative found in bats. For this virus, hundreds of genomes have been sequenced. This data provides insights into SARS-CoV-2 adaptations, determinants of pathogenicity and mutation patterns. A comparison between patterns of mutations that occurred before and after SARS-CoV-2 jumped to human hosts may reveal important evolutionary consequences of zoonotic transmission. METHODS: We used publically available complete genomes of SARS-CoV-2 to calculate relative frequencies of single nucleotide variations. These frequencies were compared with relative substitutions frequencies between SARS-CoV-2 and related animal coronaviruses. A similar analysis was performed for human coronaviruses SARS-CoV and HKU1. RESULTS: We found a 9-fold excess of G–U transversions among SARS-CoV-2 mutations over relative substitution frequencies between SARS-CoV-2 and a close relative coronavirus from bats (RaTG13). This suggests that mutation patterns of SARS-CoV-2 have changed after transmission to humans. The excess of G–U transversions was much smaller in a similar analysis for SARS-CoV and non-existent for HKU1. Remarkably, we did not find a similar excess of complementary C–A mutations in SARS-CoV-2. We discuss possible explanations for these observations.

INTRODUCTION SARS-CoV-2 is a novel coronavirus that causes an infectious respiratory disease called COVID-19 (Wu et al., 2020) . SARS-CoV-2 is closely related to the bat coronavirus RaTG13 with around 96% whole genome nucleotide sequence identity . At the complete genome scale it also shares 93.3% identity with the bat-derived coronavirus RmYN02, with 97.2% identity in the 1ab gene . Since the discovery of SARS-CoV-2, its evolution became of particular interest to biologists because of the amount of sequencing data that was produced and the importance of this data to provide insights on the determinants of viral pathogenicity (Gussow et al., 2020) and adaptations to human hosts (Van Dorp et al., 2020) . SARS-CoV-2 is also quite interesting from a genomics prospective, with an extreme deficiency of genomic CpG dinucleotides of debatable origin (Xia, 2020) .

Comparative analysis of genomic data favors the natural origin of SARS-CoV-2, accompanied by natural selection either before or after zoonotic transfer directly from bats or through an intermediate host (Andersen et al., 2020) . In either scenario, SARS-CoV-2 would be exposed to both a novel evolutionary landscape that affects the fitness of its genetic variants and novel cellular conditions that could affect its mutation rates directly. For example, it is known, that bats have evolved a number of adaptations including superior resistance to oxidative stress that allow them to harbor multiple viruses without getting the corresponding diseases (Schountz et al., 2017) . In light of this, we decided to investigate if the relative mutation frequencies in SARS-CoV-2 changed after its transmission to human hosts.

We compared the relative frequencies of single nucleotide variations (which we will refer to as mutations) in SARS-CoV-2 with the relative frequencies of substitutions that it acquired since the divergence with its last common ancestor with a closely related coronavirus from bats RaTG13. In our terminology, substitutions occurred before zoonotic transmission, while mutations were acquired after. A similar analysis was performed for SARS-CoV and HKU1 coronaviruses.

We found that SARS-CoV-2 has an over 9-fold excess of G-U mutations over substitutions. This effect was much weaker in a similar analysis for SARS-CoV and was not present for HKU1. On the other hand, the substitution profile of SARS-CoV-2 turned out to be quite similar to that of the other coronaviruses, lending further support to existing scenarios of its natural origin (Andersen et al., 2020) and suggesting that the changes in SARS-CoV-2 mutation frequencies have accompanied its transition to human hosts.

We obtained 1,271 SARS-CoV-2, 194 SARS-CoV and 38 HKU1 publicly available complete genomes from the NCBI NR database. To ensure similarity in data acquisition, we did this by using the BLASTn (Altschul et al., 1990) program with the three reference human coronavirus genomes as queries (NCBI Reference Sequences: NC_045512.2, NC_004718.3 and NC_006577.2). SARS-CoV-2 hits were filtered using the following strings: "Severe acute respiratory syndrome-related coronavirus" and "complete genome". SARS-CoV hits were filtered using the strings: "SARS coronavirus" and "complete genome". HKU1 genomes were filtered using strings "HKU1" and "complete genome". The final list of obtained accessions is available in Table S1 .

We used Clustal Omega (Sievers et al., 2011) to create three multiple alignments: one for SARS-CoV-2, one for SARS-CoV and one for HKU1 genomes. In each alignment, we established the consensus sequence. The most frequent nucleotide in each position was used as the consensus nucleotide. If any individual sequence contained a nucleotide N2 that is different from the consensus N1, we considered that a N1->N2 mutation has occurred in that position. We assume that these mutations have occurred in viruses after zoonotic transfer. Note that if several sequences contained the same nucleotide N2 that is different from the consensus N1, this would still count as only one mutation. The three final alignments are available in fasta format in Supplemental Materials.

As additional controls, we performed the following (and only the following) subanalysis. For SARS-CoV-2:

1. SARS-CoV-2 genomes that were sequenced with Ion Torrent 2. SARS-CoV-2 genomes that were sequenced with Oxford Nanopore 3. SARS-CoV-2 genomes that were sequenced in USA 4. SARS-CoV-2 genomes that were sequenced in China 5. Exclude mutations that occurred in less than two sequences 6. Substitute the consensus sequences with the reference sequence 7. Mask the first and last 100 nucleotides of the alignment 8. Remove genomes with "N" sequences

For each of the three human coronaviruses we obtained the genome of a closely related animal coronavirus and a more distant (outgroup) coronavirus from NCBI NR. For SARS-CoV-2 we used RaTG13 (GenBank: MN996532.1) as the close relative and pangolin coronavirus PCoV_GX-P5L (GenBank: MT040335.1) as an outgroup. For SARS-CoV we used Bat SARS-like coronavirus isolate Rs4231 (GenBank: KY417146.1) and Bat coronavirus BtCoV/273/2005 (GenBank: DQ648856.1). For HKU1 we used Betacoronavirus sp. strain VZ_BetaCoV_16715_52 (GenBank: MH687968.1) and Camel coronavirus HKU23 Ry123 (GenBank: KT368891.1). Multiple alignments were performed using Clustal Omega with default parameters (Sievers et al., 2011) . See Fig. S1 for a simple Neighbor-joining tree for all nine different complete coronavirus genomes created by Clustal Omega on the basis of the alignment (default parameters). See Jaimes et al. (2020) for a detailed phylogenetic analysis of SARS-CoV-2 and related coronaviruses.

Substitutions that occurred during human coronavirus evolution were identified by maximum parsimony. If a human coronavirus sequence contained a nucleotide N1 and the two animal coronaviruses contained a different nucleotide N2, we considered that an N2->N1 substitution has occurred (Method 1). We assume that these substitutions have occurred in viruses before zoonotic transfer. We did not count positions in which all three coronaviral genomes differed.

We also used a different measure of substitutions based on single nucleotide differences between each reference human coronavirus sequence and RaTG13, Rs4231 and Betacoronavirus sp. strain VZ_BetaCoV_16715_52 for SARS-CoV-2, SARS-CoV and HKU1 respectively (Method 2). This method does not allow us to establish a direction for substitutions, but provides a larger sample size. We used the same alignment as in Method 1.

We identified 1,251, 1,128 and 2,039 single nucleotide variants in available SARS-CoV-2, SARS-CoV and HKU1 genomes. We assumed that these are mutations that occurred after zoonotic transfer to human hosts. Mutations are deviations of individual genomes from the consensus sequences, but it should be noted that our SARS-CoV-2 consensus had only five nucleotide differences from the reference genome.

In the corresponding genomes we also identified 450, 499 and 3,029 (Method 1, with parsimony-reconstructed ancestral states) and 1,141, 1,237 and 6,514 (Method 2) single nucleotide substitutions. We assume that these substitutions occurred before zoonotic transfer. Substitutions are deviations in the reference genome from the predicted ancestral states (Method 1) or deviations in the reference genome from a closely related animal coronavirus genome (Method 2). Figure 1 shows the proportion of each mutation/substitution type (substitutions based on Method 1) among all mutations/substitutions. The relative substitution frequencies between SARS-CoV-2 and RaTG13 appear to be similar to those between other human coronaviruses and their relatives, consistent with the natural emergence of the SARS-CoV-2 virus (Andersen et al., 2020) .

Tables 1 and 2 give a more detailed view of the number of mutations and substitutions in SARS-CoV-2 (Methods 1 and 2 respectively). P-values are based on a two-tailed Fishers exact test.

Out of 1,251 SARS-CoV-2 novel variants, 193 (15.4%) are G-U transversions, which is over 9.5-fold greater comparing to 7/450 (1.56%, Method 1) or 20/1,141 (1.75%, Method 2) G-U substitutions (P < 0.0001/12, two-tailed Fishers exact test, with Bonferroni correction for 12 multiple comparisons). This effect was not found for SARS-CoV (41/1,128 or 3.63% mutations vs. 11/499 or 2.2% substitutions, Method 1 vs. 33/1,237 or 2.67% substitutions, Method 2) or HKU1 (89/2,039 or 4.36% mutations vs. 253/3,029 or 8.35% substitutions, Method 1 vs. 625/6,514 or 9.59% substitutions, Method 2).

Most SARS-CoV-2 genomes are currently sequenced using Illumina platforms. To rule out possible sequencing artifacts we performed several subanalysis. First, we searched for mutations in SARS-CoV-2 genomes that were sequenced using Ion Torrent (30 genomes) or Oxford Nanopore (192 genomes) . Although the mutation sample size was small, the fraction of G-U mutations was similar to the rest of the data: 13/79 or 16.4% (Ion Torrent) and 30/199 or 15.1% (Oxford Nanopore) G-U mutations. Rayko & Komissarov (2020) have reported a lower transition/transversion ratio in singleton SARS-CoV-2 genomic variations (that can only be seen in one genome submission). As a separate control, we looked at mutations that were identified in two or more independent genome assemblies. This yielded a similar high proportion of G-U mutations (49/360 or 13.6%). Masking of the first and last 100 nucleotides of the alignments or removing all sequences with at least one "N" letter resulted in 16% (188/1,173) or 14.87% (132/888) G-U mutations correspondingly. De Maio et al. (2020) reported other sequencing issues of SARS-CoV-2 genomes, such as particular highly-mutable sites that might be recurring artifacts. However, the number of such reported sites is too low to affect our results as we counted several single nucleotide variations of the same type in one site as the result of one mutation.

We wanted to see if the effect is robust to other subsamples, such as SARS-CoV-2 genomes sequenced in USA or China. For USA genomes, we obtained 15.1% G-U mutations (159/1,053). For Chinese genomes, the mutation sample size was very low, however the effect was similar (20.5% or 17 out of 83 mutations are G-U).

Among the 193 G-U mutations in SARS-CoV-2, 21 are outside of coding regions, 21 are synonymous and 151 are nonsynonymous. We found no nonsense G-U mutations. Coordinates of all G-U mutations are available in Table S2 .

There are two remarkable observations regarding the excess of G-U transversions in SARS-CoV-2. One is that it probably reflects a change in SARS-CoV-2 mutation rates after zoonotic transfer to humans, since the proportion of G-U substitutions measured between the SARS-CoV-2 and the bat coronavirus RaTG13 is unremarkable.

The second remarkable feature is that this excess of mutations is asymmetric: there is no similar effect for C-A mutations. SARS-CoV-2 is a positive (+) RNA strand virus. The copying of positive and negative strands of coronavirus RNA is executed by the same enzymes (Sola et al., 2015) . If RNA copying was prone to G-U errors when creating the positive strand, the same mechanism would be expected to introduce G-U errors when copying the negative strand, resulting in additional C-A errors on the positive strand. Note that in SARS-CoV-2 the G (19.6%) and C (18.4%) content are similar, as are A (29.9%) and U (32.1%) content. Recently it was suggested that SARS-CoV-2 mutation rates could be affected by variations in its RNA-dependent RNA polymerase (Pachetti et al., 2020) , but it is unclear how this could explain the asymmetric increase of G-U transversions.

The most known example of mutation bias is the excess of C-T mutations in the CpG context in the genomes of many animals (Cooper & Krawczak, 1989) , although other important mutation contexts exist (Panchin et al., 2011) . Usually in such cases, the excess of complementary mutations (such as G-A in the case of CpG context) is also present and is of the same magnitude.

In theory, strand-specific RNA editing could cause the observed mutation asymmetry. Recently, Nanopore experiments suggested that SARS-CoV-2 has unique RNA-editing sites

These A-G and U-C changes were the most predominant. However, in one of three patients, with the highest read coverage, G-U SNVs were also abundant. In the other two patients, the read coverage was much lower, so it is difficult to conclude if there was a difference in SARS-CoV-2 mutation rates between patients. Interestingly, the data provided by Di Giorgio et al. (2020) also reveals G-U and C-A mutation asymmetry not only in SARS-CoV-2 but also in MERS-CoV. One notable cause of G-T mutations in DNA is due to reactive oxygen species that generate 8-oxoguanine (8-oxoG) (Ohno et al., 2014), which can be paired not only with cytosines, but also with adenines, resulting in nucleotide mis-incorporation during DNA synthesis

which could explain why the excess of G-U substitutions is not observed between SARS-CoV-2 and bat coronavirus RaTG13. It is unclear, however, why SARS-CoV-2 is different from SARS-CoV in this regard. We believe this hypothesis requires further investigation. CONCLUSIONS We report a 9-fold asymmetrical G-U (+) strand or C-A (−) strand mutation bias in SARS-CoV-2. This feature cannot be traced in the substitution data that reflects the virus's evolutionary history before its transmission to our species. The observed effect points to a recently acquired change in SARS-CoV-2 mutation pattern or difference in its

Basic local alignment search tool

The proximal origin of SARS-CoV-2

APOBEC-1-mediated RNA editing

High basal heat-shock protein expression in bats confers resistance to cellular heat/ oxidative stress

Cytosine methylation and the fate of CpG dinucleotides in vertebrate genomes

Transcriptional mutagenesis mediated by 8-oxoG induces translational errors in mammalian cells

Issues with SARS-CoV-2 sequencing data

Oxidative stress as key player in severe acute respiratory syndrome coronavirus (SARS-CoV) infection. Archives of

Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2

Genomic determinants of pathogenicity in SARS-CoV-2 and other human coronaviruses

Phylogenetic analysis and structural modeling of SARS-CoV-2 spike protein reveals an evolutionary distinct and proteolytically sensitive activation loop

The architecture of SARS-CoV-2 transcriptome

RNA damage and surveillance under oxidative stress

8-Oxoguanine causes spontaneous de novo germline mutations in mice

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

New words in human mutagenesis

Quality control of low-frequency variants in SARS-CoV-2 genomes

Adenosine deaminases acting on RNA (ADARs) are both antiviral and proviral

Methylene blue and rose bengal photoinactivation of RNA bacteriophages: comparative studies of 8-oxoguanine formation in isolated RNA

Immunological control of viral infections in bats and the emergence of viruses highly pathogenic to humans

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega

Continuous and discontinuous RNA synthesis in coronaviruses

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection

A new coronavirus associated with human respiratory disease in China

Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense

A novel bat coronavirus closely related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the spike protein

We are grateful to Alexander Tyshkovskiy and Evgenia Dueva for helpful comments on our manuscript.

Funding

mechanism by which this mutation bias is introduced and how its asymmetry is maintained.

The authors declare that they have no competing interests.

Alexander Y. Panchin conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Yuri V. Panchin conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

The following information was supplied regarding data availability:Raw data and code are available in the Supplemental Files.

Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.9648#supplemental-information.