key: cord-281717-kzd9vvci authors: Digard, Paul; Lee, Hui Min; Sharp, Colin; Grey, Finn; Gaunt, Eleanor title: Intra-genome variability in the dinucleotide composition of SARS-CoV-2 date: 2020-05-08 journal: bioRxiv DOI: 10.1101/2020.05.08.083816 sha: doc_id: 281717 cord_uid: kzd9vvci CpG dinucleotides are under-represented in the genomes of single stranded RNA viruses, and coronaviruses, including SARS-CoV-2, are no exception to this. Artificial modification of CpG frequency is a valid approach for live attenuated vaccine development, and if this is to be applied to SARS-CoV-2, we must first understand the role CpG motifs play in regulating SARS-CoV-2 replication. Accordingly, the CpG composition of the newly emerged SARS-CoV-2 genome was characterised in the context of other coronaviruses. CpG suppression amongst coronaviruses does not significantly differ according to genera of virus, but does vary according to host species and primary replication site (a proxy for tissue tropism), supporting the hypothesis that viral CpG content may influence cross-species transmission. Although SARS-CoV-2 exhibits overall strong CpG suppression, this varies considerably across the genome, and the Envelope (E) open reading frame (ORF) and ORF10 demonstrate an absence of CpG suppression. While ORF10 is only present in the genomes of a subset of coronaviruses, E is essential for virus replication. Across the Coronaviridae, E genes display remarkably high variation in CpG composition, with those of SARS and SARS-CoV-2 having much higher CpG content than other coronaviruses isolated from humans. Phylogeny indicates that this is an ancestrally-derived trait reflecting their origin in bats, rather than something selected for after zoonotic transfer. Conservation of CpG motifs in these regions suggests that they have a functionality which over-rides the need to suppress CpG; an observation relevant to future strategies towards a rationally attenuated SARS-CoV-2 vaccine. CpG dinucleotides are under-represented in the DNA genomes of vertebrates (Cooper and Krawczak 30 1989; Simmonds, et al. 2013) . Cytosines in the CpG conformation may become methylated, and this 31 methylation is used as a mechanism for transcriptional regulation (Medvedeva, et al. 2014 ). 32 Methylated cytosines have a propensity to undergo spontaneous deamination (and so conversion to a 33 thymine). Over evolutionary time, this has reduced the frequency of CpGs in vertebrate genomes 34 (Cooper and Krawczak 1989) . However, loss of CpGs in promoter regions would affect 35 transcriptional regulation, and so CpGs are locally retained, resulting in functionally important 'CpG 36 islands' found in around half of all vertebrate promoter regions (Deaton and Bird 2011) . 37 Single strand RNA (ssRNA) viruses infecting vertebrate hosts reflect the CpG dinucleotide 38 composition of their host in a type of mimicry (Simmonds, et al. 2013) . It was hypothesised that this 39 is because vertebrates have evolved a CpG sensor which flags transcripts with aberrant CpG 40 frequencies Gaunt, et al. 2016 ). This idea was strengthened by the discovery 41 that the cellular protein Zinc-finger Antiviral Protein (ZAP) binds CpG motifs on viral RNA and 42 directs them for degradation (Takata, et al. 2017) , and further supported by observations that CpGs 43 can be synonymously introduced into a viral genome to the detriment of virus replication without 44 negatively impacting transcriptional or translational efficiency (Tulloch, et al. 2014; Gaunt, et al. 45 2016). Current understanding is therefore that ssRNA viruses mimic the CpG composition of their 46 host at least in part to subvert detection by ZAP. ssRNA viruses also under-represent the UpA 47 dinucleotide, but to a far more modest extent (Simmonds, et al. 2013) , and the reasons behind UpA 48 suppression are less well understood. A consequence of dinucleotide bias is that certain codon pairs 49 are under-represented (Tulloch, et al. 2014 ; Kunec and Osterrieder 2016) (so, for example, codon 50 pairs of the conformation NNC-GNN are among the most rarely seen codon pairs in vertebrates (Tats, 51 et al. 2008) ). Whether the two phenomena of CpG suppression and codon pair bias (CPB) are discrete 52 remains controversial (Futcher, et al. 2015 ; Kunec and Osterrieder 2016; Groenke, et al. 2020) . 53 The Coronaviridae have a generally low genomic cytosine content (Berkhout and van Hemert 2015) , 54 but as with other ssRNA viruses, nonetheless still under-represent CpG dinucleotides to a frequency 55 below that predicted from individual base frequencies of cytosine and guanine (Woo, et al. 2007) . 56 The Coronavirus family comprises four genera -the alpha, beta, gamma and delta-coronaviruses. 57 Human-infecting coronaviruses (HCoVs) have been identified belonging to the alpha and beta genera 58 (Hu, et al. 2015) . Alphacoronaviruses infecting humans include HCoV-229E and the more recently 59 discovered HCoV-NL63 (van der Hoek, et al. 2004 ). Betacoronaviruses include HCoV-OC43, HCoV-60 HKU1 (Woo, et al. 2005) , severe acute respiratory syndrome (SARS)-CoV (Rota, et al. 2003) , Middle 61 East respiratory syndrome (MERS)-CoV (Zaki, et al. 2012 ) and the recently emerged SARS-CoV-2 62 ). Prior to the emergence of SARS-CoV-2, SARS-CoV had the 63 strongest CpG suppression across human-infecting coronaviruses (Woo, et al. 2007 transcription regulation sequences (TRSs)); this complementarity allows viral polymerase jumping 74 from the 5' leader sequence to directly upstream of ORFs preceded by a TRS (Sawicki and Sawicki 75 1998). The negative sense sub-genomic RNAs serve as efficient templates for production of mRNAs 76 (Sawicki, et al. 2007 ). Generally, only the first ORF of a sub-genomic mRNA is translated (Perlman 77 and Netland 2009), although leaky ribosomal scanning has been reported as a means for accessing 78 alternative ORFs for several coronaviruses including SARS-CoV (Schaecher, et al. 2007 ). 79 SARS-CoV-2 was recently reported to have a CpG composition lower than other members of the 80 betacoronavirus genus, comparable to certain canine alphacoronaviruses; an observation used to draw 81 inferences over its origin and/or epizootic potential (Xia 2020 in GC content (from ~ 0.32 -0.47) was seen across the Coronaviridae, and as expected, all viruses 97 exhibited some degree of CpG suppression, with CpG O:E ratios ranging from 0.37 to 0.74 (Fig 2A) . To investigate the root of this variation, the coronavirus sequence dataset was refined to remove 99 sequences with more than 90% nucleotide identity to reduce sampling biases (so, for example, SARS-100 CoV sequences of human origin were stripped from over 1000 representative sequences to just one). The CpG compositions of the remaining 215 sequences (Table S1 ) were compared between 102 coronavirus genera (alpha, beta, gamma and delta). For the 215 representative sequences, a genus 103 could be assigned for 203. No differences in CpG composition between coronavirus genera were 104 apparent, although the gamma genus exhibited a tighter range (Fig 2B) . Next, we examined whether 105 differences in CpG composition between viruses isolated from different hosts explained the range in 106 CpG composition across the Coronaviridae. For the 215 representative sequences, a host could be 107 assigned to 210. Coronavirus sequences were divided into host groups, and groups with at least three 108 divergent sequences were compared; this included bat, avian, camelid, canine, feline, human, 109 mustelid, rodent, swine and ungulate viruses. Variation in CpG composition between coronaviruses 110 detected in different host species was evident across groups (p = 0.0057) and between groups, with 111 coronaviruses detected in canine and human species having lower CpG content and rodent and bat 112 coronaviruses having the highest (Fig. 2C ). Significant differences in CpG composition were detected 113 between bat and canine (p = 0.0001), avian and rodent (p = 0.005), canine and mustelid (p = 0.011), 114 canine and rodent (p < 0.0001), human and rodent (p = 0.002), and rodent and ungulate (p = 0.0026) 115 viruses. All frequency ranges overlapped however, indicating viral CpG frequency alone seems to be 116 a poor predictor of virus origin, contradicting the recent suggestion of a canine origin of SARS-CoV-2 117 (Xia 2020). Where sequences in a host group representative of both alpha and betacoronaviruses were 118 available (which was the case for bat, camelid, canine, human, rodent and swine viruses), these 119 sequences were split by genus and compared to determine whether coronavirus genera influenced 120 coronavirus CpG frequencies in a host species-specific manner. By this method, the lack of difference 121 in CpG composition of coronaviruses of different genera was maintained (Fig. 2D) . 122 To test the hypothesis that coronavirus CpG content varies according to tissue tropism (Xia 2020), we 123 classified the viruses according to their primary site of replication, where this was known or could be 124 inferred from the sampling route. Samples were split into five categories -'respiratory', 'enteric', 125 'multiple', 'other', or 'unknown'. Altogether, 206 of the 215 sequences were classifiable (detailed in 126 Table S1), with 9 sequences categorised as 'unknown' and excluded from further analyses. By this 127 admittedly inexact approach, viruses infecting the respiratory tract had a significantly lower mean 128 CpG composition than viruses with enteric tropism (p = 0.032; Fig. 2E ). However, the spread of 129 respiratory virus CpG frequencies was contained entirely within the range exhibited by enteric 130 viruses. Furthermore, 124 sequences were assigned to the enteric group, and only 22 to the respiratory 131 group. Of these 146 sequences, bat viruses accounted for 80, all of which were assigned to the enteric 132 group (despite reasonable sampling of respiratory tract in bats) and this cohort of viruses maintained 133 almost the full spread of CpG frequencies (Fig. 2E , Table S1 ). Thus, while coronavirus CpG 134 frequency may show some correlation with replication site, the dataset available does not permit 135 strong conclusions to be drawn or predictions about zoonotic potential to be made. 136 CpG O:E ratios, SARS-CoV-2 has a genomic CpG ratio of 0.408 (representing the mean of 1163 138 complete genome sequences). This is similar to the value calculated previously for a much smaller 139 sample (n = 5) of SARS-CoV-2 sequences (Xia 2020 the genomic CpG ratio. However, two ORFs in particular, E ORF and ORF10, had CpG ratios higher 146 than 1, indicating an absence of CpG suppression in those regions (Fig. 3A) . These two ORFs also did 147 not suppress the UpA dinucleotide, in contrast with other SARS-CoV-2 ORFs (Fig. 3B) . 148 Due to the difficulties in distinguishing between dinucleotide bias and CPB, CPB scores were also 149 calculated for each ORF and plotted against CpG composition (Fig. 3C) . CPB scores provide an 150 indication of whether the codon pairs encoded in each ORF are congruous with usage in vertebrate 151 genomes. A score below 0 indicates use of codon pairs that are disfavoured in host ORFs. An 152 approximately linear relationship between CpG O:E ratio and CPB score for each SARS-CoV-2 ORF 153 was apparent (R 2 = 0.80). E ORF and ORF10 both had negative CPB scores, indicating that they use 154 under-represented codon pairs and in keeping with the observation that both ORFs over-represent 155 CpG and UpA dinucleotides. 156 To examine the precise location of the CpG hotspots, a sliding window analysis of CpG content 157 across the 3' end of the SARS-CoV-2 genome (averaged over 1163 complete genome sequences) as 158 well as the closely related bat and pangolin sequences was performed. As expected, marked increases 159 in CpG O:E ratio were observed concomitant with the genomic regions associated with E ORF and 160 ORF10 (Fig. 3D) . The E ORF and ORF10 regions associated with high CpG composition were 161 maintained across the bat, pangolin and human sequences, indicating that since the bat sample was 162 collected in 2013, the higher CpG frequency in this region has not been negatively selected. While the 163 increase in CpG presentation was apparent across the entire E ORF, starting at the 3' end of ORF3 164 and ending at the beginning of the M gene, the CpG spike in ORF10 was more narrowly associated 165 with the putative coding region. Additionally, a CpG spike between the 3'-end of ORF8 and the 5'-166 end of the N gene was evident. The 5'-end of the N ORF also contains the overlapping ORF9b gene, 167 which when considered alone, has a CpG O:E ratio approaching 1 (Fig. 3A) CoV-2, ratios for E ORF: genomic CpG O:E were calculated (Fig. 4B) . In non-bat non-avian host 191 genomes, E ORF usually displayed CpG suppression in line with or stronger than that seen at the 192 genome level, whereas SARS-CoV and SARS-CoV-2 starkly contrasted with this, displaying far less 193 CpG suppression in this region. 194 To investigate the evolutionary history of E ORF CpG composition in the human-infecting 195 coronaviruses, a phylogenetic reconstruction of all 7 human coronavirus and 96 bat coronavirus E 196 genes was performed to determine whether CpG ratios in this region were ancestrally derived. As expected (Cotten, et al. 2013; , the human viruses were interspersed among the bat 198 viruses, reflective of their independent emergence events (Fig. 4C) Coronaviridae is striking. If coronaviruses also produce a protein with anti-ZAP activity, it is possible 216 that this has variable efficacy between strains, explaining the ability of coronaviruses to fluctuate CpG 217 composition considerably. Alternatively (or in addition), this may be host driven; we show that 218 average CpG suppression varies with host species (Fig 2C) and, as previously suggested (Xia 2020), 219 this may be linked with ZAP expression levels. We have demonstrated that CpG variation is not 220 related to viral taxonomic grouping (Fig. 2B) but we did find an association between viral CpG 221 composition and primary replication site, with respiratory coronaviruses having a lower CpG 222 composition than enteric ones (Fig. 2E) . This is the opposite of what has been previously suggested 223 (Xia 2020), though this proposal was not supported by any comprehensive investigation. 224 Nevertheless, our meta-analysis was subject to the sampling preferences of many labs who have 225 performed surveillance for coronaviruses, and many of the tissue tropism assignments we made have 226 not been verified by experimental infections. Another limitation of this analysis is that only sequences 227 of greater than 10% divergence were included. Tissue tropism can be defined by much smaller 228 differences; for example, a deletion in the spike protein of transmissible gastroenteritis virus (a 229 porcine coronavirus) altered the tropism of the virus from enteric to respiratory, while nucleotide 230 identity was preserved at 96% (Cox, et al. 1990; Rasschaert, et al. 1990 2017) and speculatively this may indicate that SARS-CoV-2 was genetically predisposed to make a 240 host switch into humans. Similarly, the genomic CPB score of 0.048 indicates that SARS-CoV-2 uses 241 codon pairs which are preferentially utilised in the human ORFeome, which may mean that the virus 242 was well suited for translational efficiency in humans at its time of emergence. 243 In coding regions which do not have overlapping ORFs, there is no requirement at the coding level for 244 CpG motifs to be retained (Kanaya, et al. 2001 ). E ORF and ORF10 are not known to be in overlapping reading frames; conversely, ORF9b overlaps with the ORF for nucleocapsid (N). Some 246 CpG retention in this region is therefore inevitable and may explain the high CpG composition of 247 ORF9b. This nevertheless leaves open the question of why CpG motifs are retained in the E ORF and 248 ORF10 regions (if this is not an ancestrally derived evolutionary hangover; as CpGs have not been 249 lost from these regions between 2013 and now (Fig. 3D) lower abundance than most other transcripts (Kim, et al. 2020) . It is therefore possible that E ORF is 289 of sufficiently low abundance for a high CpG frequency to be physiologically inconsequential. 290 Similar logic can be applied to ORF10, which is just 117 nucleotides in length. 291 Synonymous addition of CpGs into a virus genome has been suggested as a potential novel approach context of their CpG composition and find that SARS-CoV-2 has a low CpG composition in 295 comparison with other coronaviruses, but with CpG 'hotspots' in genomically disparate regions. This 296 highlights the potential for large scale recoding of the SARS-CoV-2 genome by introduction of CpGs 297 into multiple regions of the virus genome as a mechanism for generation of an attenuated live vaccine. 298 Introduction of CpG into multiple sites could also be used to subvert the potential of the virus to 299 revert to virulence through recombination. A challenge of live attenuated vaccine manufacture is to 300 enable sufficient production of a vaccine virus that has a replication defect. Strategic introduction of 301 CpGs into specific regions of the virus genome has the potential to negate a replication defect in ZAP coronaviruses were downloaded from NCBI on the 16 April 2020 (3407 sequences in total). 316 Sequences were then aligned and sequences less than 10% divergent at the nucleotide level, identified 317 using the 'identify similar/ identical sequences' function in SSE v1.4 were removed from the dataset. 318 Sequences were annotated into animal groups and genera based on their description in the NCBI 319 database. The trimmed dataset (Table S1 ) included 215 complete genome coronavirus sequences. 320 Individual groups were made for sequences originating from the following hosts: bat (n = 108), avian 321 (35), camelid (3), canine (7), feline (9), human (7), mustelids (5), rodents (8), swine (15), ungulates 322 (9) and 'other' (which included bottle-nosed dolphin (2), hedgehog (2), rabbit (2), beluga whale (1), 323 civet (1) and pangolin (1)). Groups were loosely defined based on taxonomic orders, with some 324 exceptions made to examine our specific research questions. Bats are of the order Chiroptera; multiple 325 avian orders were grouped together (Galliformes, Anseriformes, Passeriformes, Gruiformes, 326 Columbiformes and Pelicaniformes); even toed (Artiodactyla) and odd toed (Perissodactyla) ungulate 327 orders were grouped, with camelids analysed separately due to their association with MERS-CoV 328 (Azhar, et al. 2014 ); Canidae (canine) and Pantherinae (feline) sequences of the Carnivora order were 329 analysed separately, as canines have previously been suggested as an intermediate host species for 330 SARS-CoV-2 (Xia 2020) and cat infections with SARS-CoV-2 have been reported ); 331 humans were the only representatives from the Primate order; all remaining Carnivora, with the 332 exception of a single civet sequence, belonged to the Mustelidae (mustelids); rodents belong to the 333 Rodentia order; and swine belong to the Artiodactyla order; whales are also Artodactyla but swine 334 were considered separately due to considerable interest in porcine coronaviruses (Vlasova, et al. 335 2020). Sequences were also annotated for genus by reference to the NCBI description (203 of the 215 336 sequences were assigned to a genus), and for primary replication site by literature reference (refer to 337 Table S1 ). Replication site annotations were based on the sample type from which a coronavirus 338 sequence was obtained -'enteric' for faecal/ gastrointestinal samples, 'respiratory' for nasal, 339 oropharyngeal and other respiratory samples; 'multiple' if samples from multiple systems tested 340 positive, 'other' if the sample was collected from a site not falling into the enteric or respiratory 341 categories (e.g. brain), or 'unknown' if a sample type could not be determined. If only one sampling 342 route was tested and returned a positive result, the sequence was categorised in accordance with the 343 sole sampling route. The sequence datasets used in this paper are summarised in Fig. 1 These were then categorised by genera, host, and tissue tropism. The subset of 215 sequences were 381 also aligned over the E ORF and grouped by host (blue shaded boxes). Each box firstly describes each 382 dataset used, the number of sequences in that dataset is then indicated in italicized font, and the figure 383 to which the dataset corresponds is indicated in bold font. 384 include only one representative from sequences with less than 10% nucleotide diversity to overcome 390 epidemiologic biases (215 representative sequences), which were analysed in the subsequent sub-391 figures. B. Coronavirus genus against genomic CpG content. Other human-infecting coronaviruses 392 (HCoV-2292E, HCoV-NL63 (alphacoronaviruses) and HCoV-HKU1 and HCoV-OC43 393 (betacoronaviruses) are represented using orange circles. C. Vertebrate host of coronavirus against 394 genomic CpG content. Statistically significant differences between CpG compositions of viruses from 395 different hosts are indicated above the x axis line, with 'C' denoting a statistically significant 396 difference from canine coronaviruses and 'R' denoting a statistically significant difference from 397 rodent coronaviruses. Tukey's multiple comparisons test was used to identify differences in CpG 398 composition between viruses infecting different hosts. A p value < 0.05 is indicated with *, p < 0.01 = 399 **, p < 0.001 = *** and p <0.0001 = ****. D. Vertebrate host of coronavirus, with further sub-400 division into coronavirus genus, against genomic CpG content. Alphacoronaviruses are denoted with 401 filled circles and betacoronaviruses with open circles. E. Primary replication site against genomic 402 CpG content by host. Tukey's multiple comparisons test was used to identify differences in CpG 403 composition between viruses infecting different tissues. For a full breakdown of how these were 404 assigned, please refer to Table S1 . 405 The influence of CpG and UpA dinucleotide 429 frequencies on RNA virus replication and characterization of the innate cellular pathways underlying 430 virus attenuation and enhanced replication Evidence 432 for Camel-to-Human Transmission of MERS Coronavirus Bats and Coronaviruses On the biased nucleotide composition of the human coronavirus 435 RNA genome Genetic Inactivation of Poliovirus 437 Infectivity by Increasing the Frequencies of CpG and UpA Dinucleotides within and across 438 Synonymous Capsid Region Codons Translation initiation at alternate in-frame AUG codons in the 440 rabies virus phosphoprotein mRNA is mediated by a ribosomal leaky scanning mechanism Cytosine methylation and the fate of CpG dinucleotides in vertebrate 443 genomes Full-genome deep sequencing and phylogenetic analysis of novel human 446 betacoronavirus Sites of replication of a porcine respiratory coronavirus 448 related to transmissible gastroenteritis virus Characterisation of the transcriptome and proteome of SARS-CoV-2 using direct 451 RNA sequencing and tandem mass spectrometry reveals evidence for a cell passage induced in-452 frame deletion in the spike glycoprotein that removes the furin-like cleavage site CpG islands and the regulation of transcription Cytosine methylation by 455 DNMT2 facilitates stability and survival of HIV-1 RNA in the host cell during infection KHNYN is essential for the zinc finger antiviral protein (ZAP) to restrict HIV-1 containing 459 clustered CpG dinucleotides Candidates in Astroviruses, Seadornaviruses, Cytorhabdoviruses and Coronaviruses for +1 frame overlapping genes accessed by leaky scanning Reply to Simmonds et al.: Codon pair and dinucleotide bias have not been 464 functionally distinguished Elevation of CpG frequencies in influenza A genome attenuates pathogenicity 467 but enhances host response to infection Patterns of evolution and host gene mimicry 469 in influenza and other RNA viruses Mechanism of Virus Attenuation by Codon Pair Deoptimization The zinc-finger antiviral protein recruits the RNA processing 474 exosome to degrade the target mRNA Nonrandom utilization of codon pairs in Escherichia coli Hostile takeovers: viral appropriation of the NF-kB pathway. The 478 Bat origin of human coronaviruses High-Resolution Analysis of 481 Coronavirus Gene Expression by RNA Sequencing and Ribosome Profiling Codon Usage and tRNA Genes in 484 Eukaryotes: Correlation of Codon Usage Diversity with Translation Efficiency and with CG-485 Dinucleotide Usage as Assessed by Multivariate Analysis Identification of direct targets and modified bases of RNA cytosine 487 methyltransferases The architecture of SARS-CoV-2 489 transcriptome Point mutations define a sequence flanking the AUG initiator codon that modulates 491 translation by eukaryotic ribosomes MEGA X: Molecular Evolutionary Genetics 493 Analysis across Computing Platforms Codon Pair Bias Is a Direct Consequence of Dinucleotide Bias Requirement of the 5'-end genomic sequence as an upstream cis-acting 497 element for coronavirus subgenomic mRNA transcription Evidence for involvement of a ribosomal leaky scanning mechanism in the 499 translation of the hepatitis B virus Pol gene from the viral pregenome RNA Human cytomegalovirus evades ZAP detection by suppressing CpG 502 dinucleotides in the major immediate early genes Genomic 504 characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and 505 receptor binding The Genome Sequence of the SARS-Associated Coronavirus Asymmetrical distribution of CpG in an 'average' mammalian gene Effects of cytosine methylation on transcription factor binding 513 sites Attenuation of RNA viruses by redirecting their evolution in sequence 516 space Downstream Ribosomal Entry for Translation of Coronavirus TGEV 518 Gene 3b The role of ZAP and OAS3/RNAseL pathways in the attenuation of an RNA virus with elevated 521 frequencies of CpG and UpA dinucleotides Coronaviruses post-SARS: update on replication and pathogenesis Porcine respiratory coronavirus differs from transmissible 525 gastroenteritis virus by a few genomic deletions Dinucleotide and stop codon frequencies in single-stranded RNA 527 viruses Characterization of a Novel Coronavirus Associated with Severe 530 Translation reinitiation and leaky scanning in plant viruses A New Model for Coronavirus Transcription Coronaviruses and Arteriviruses A Contemporary View of Coronavirus Transcription The ORF7b Protein of Severe Acute Respiratory 538 Syndrome Coronavirus (SARS-CoV) Is Expressed in Virus-Infected Cells and Incorporated into SARS-539 CoV Particles Evidence for translation of the Borna disease virus G protein by 541 leaky ribosomal scanning and ribosomal reinitiation Bovine coronavirus I protein synthesis follows ribosomal scanning 543 on the bicistronic N mRNA SSE: a nucleotide and amino acid sequence analysis platform Modelling mutational and selection pressures on 549 dinucleotides in eukaryotic phyla -selection against CpG and UpA in cytoplasmically expressed RNA 550 and in RNA viruses 552 Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA The expected equilibrium of the CpG dinucleotide in vertebrate genomes under 555 a mutation model CG 557 dinucleotide suppression enables antiviral defence targeting non-self RNA The Short Form of the Zinc Finger Antiviral Protein Inhibits Influenza A 559 Virus Protein Expression and Is Antagonized by the Virus-Encoded NS1 Preferred and avoided codon pairs in three domains of life Sequence Context at Human Single Nucleotide Polymorphisms: 563 Overrepresentation of CpG Dinucleotide at Polymorphic Sites and Suppression of Variation in CpG 564 Islands RNA virus attenuation by codon pair 566 deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies. eLife 3:e04531. 567 van der Hoek L Identification of a new human coronavirus Porcine Coronaviruses. Emerging and 571 Transboundary Animal Viruses Overlapping signals for translational regulation and packaging of influenza A virus 574 segment 2 Characterization and Complete Genome Sequence of a Novel Coronavirus Cytosine deamination and selection of CpG 579 suppressed clones are the two major independent biological forces that shape codon usage bias in 580 coronaviruses Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense The 3C protease of 584 enterovirus A71 counteracts the activity of host zinc-finger antiviral protein (ZAP) Isolation of a 586 Novel Coronavirus from a Man with Pneumonia in Saudi Arabia A Novel 588 Coronavirus from Patients with Pneumonia in China