key: cord-0710099-fil707t3
authors: Di Gioacchino, Andrea; Šulc, Petr; Komarova, Anastassia V; Greenbaum, Benjamin D; Monasson, Rémi; Cocco, Simona
title: The heterogeneous landscape and early evolution of pathogen-associated CpG dinucleotides in SARS-CoV-2
date: 2021-02-08
journal: Mol Biol Evol
DOI: 10.1093/molbev/msab036
sha: 8b37b9705630f2b9bf50281e108de4cdece67b50
doc_id: 710099
cord_uid: fil707t3

COVID-19 can lead to acute respiratory syndrome, which can be due to dysregulated immune signaling. We analyze the distribution of CpG dinucleotides, a pathogen-associated molecular pattern, in the SARS-CoV-2 genome. We characterize CpG content by a CpG force that accounts for statistical constraints acting on the genome at the nucleotidic and amino-acid levels. The CpG force, as the CpG content, is overall low compared to other pathogenic betacoronaviruses; however it widely fluctuates along the genome, with a particularly low value, comparable to the circulating seasonal HKU1, in the spike coding region and a greater value, comparable to SARS and MERS, in the highly expressed nucleocapside coding region (N ORF), whose transcripts are relatively abundant in the cytoplasm of infected cells and present in the 3’UTRs of all subgenomic RNA. This dual nature of CpG content could confer to SARS-CoV-2 the ability to avoid triggering pattern recognition receptors upon entry, while eliciting a stronger response during replication. We then investigate the evolution of synonymous mutations since the outbreak of the COVID-19 pandemic, finding a signature of CpG loss in regions with a greater CpG force. Sequence motifs preceding the CpG-loss-associated loci in the N ORF match recently identified binding patterns of the Zinc finger Anti-viral Protein. Using a model of the viral gene evolution under human host pressure, we find that synonymous mutations seem driven in the SARS-CoV-2 genome, and particularly in the N ORF, by the viral codon bias, the transition-transversion bias and the pressure to lower CpG content.

When a virus enters a new host, it can present pathogen-associated molecular patterns (PAMPs) that are rarely seen in circulating strains that have adapted to that host's immune environment over evolutionary timescales. The emergence of SARS-CoV-2, therefore, provides a rare window into innate immune signaling that may be relevant for understanding immune-mediated pathologies of SARS-CoV-2, anti-viral treatment strategies, and the evolutionary dynamics of the virus, where evidence for selective pressures on viral features can reflect what defines "self" in its new host.

As a case in point, the 1918 influenza pandemic was likely caused by a strain that originated in water fowl and entered the human population after possible evolution in an intermediate host.

That viral genome presented CpG dinucleotides within a context and level of density rarely found in the human genome where they are severely underrepresented (Cheng et al., 2013; Karlin and Mrázek, 1997; Karlin et al., 1994) , particularly in a set of genes coding for the proteins associated with antiviral innate immunity (Greenbaum et al., 2008 (Greenbaum et al., , 2014 . Over the past century the 1918 H1N1 lineage evolved in a directed manner to lower these motifs and gain UpA motifs, in a way that could not be explained by its usage of aminoacid codon bias (Greenbaum et al., 2008 (Greenbaum et al., , 2014 .

It has since been found that these motifs can engage the pattern recognition receptors (PRRs) of the innate immune system (Jimenez-Baranda et al., 2011; Vabret et al., 2017) , and directly bind the Zinc finger Anti-viral Protein (ZAP) in a CpG-dependent manner (Luo et al., 2020; Meagher et al., 2019; Takata et al., 2017; Zhu et al., 2011) . Hence, the interrogation of emergent viruses from this perspective can predict novel host virus interactions.

COVID-19 presents, thus far, a different pathology than that associated with the 1918 H1N1, which was disproportionately fatal in healthy young adults. It has been characterized by a large heterogeneity in the immune response to the virus (Hirano and Masaaki, 2020; Mehta et al., 2020; Zhou et al., 2020) and likely dysregulated type-I interferon (IFN) signaling (Hadjadj et al., 2020; Kindler and Thiel, 2016; Ragab et al., 2020; Vabret et al., 2020) . Various treatments to attenuate inflammatory responses have been proposed and are currently under analysis or being clinically tested (RECOVERYCollaborati-veGroup, 2020; Xu et al., 2020a) . It is therefore essential to quantify pathogen-associated patterns in the SARS-CoV-2 genome for multiple reasons.

The first is to better understand the pathways engaged by innate immune system and the specific agonists to help build better antiviral therapies.

Another is to better predict the evolution of motif content in synonymous mutations in SARS-CoV-2, as it will help understand the process and timescales of attenuation in humans. Third is to offer a principled approach for optimizing 2 · doi:10.1093/molbev/mst000 MBE vaccine strategy for designed strains (Amanat and Krammer, 2020; Kames et al., 2020) to better reflect human-genome features.

In this work we will use the computational framework developed in Greenbaum et al. (2014) to carry out a study of non-self associated dinucleotide usage in SARS-CoV-2 genomes. The statistical physics framework is based on the idea of identifying the abundance or scarcity of dinucleotides given their expected usage based on host features.

It generalizes the standard dinucleotide relative abundance introduced in Karlin and Mrázek (1997) , as it can easily incorporate constraints in coding regions coming from amino-acid content and codon usage. The outcome of the approach are selective forces (Greenbaum et al., 2014) that characterize the deviations with respect to the number of dinucleotides which is statistically expected under a set of various constraints. Such formalism has further been applied to identify non-coding RNA from repetitive elements in the human genome expressed in cancer that engage PRRs (Tanne et al., 2015) , to characterize the CpG evolution through synonymous mutations in H1N1 (Greenbaum et al., 2014) , and to characterize local and non-local forces on dinucleotides across RNA viruses (Vabret et al., 2017) .

We perform an analysis of the landscape of CpG 

To characterize CpG dinucleotide usage on SARS-CoV-2 genome we have computed the CpG forces following the approach introduced in Greenbaum et al. (2014) and described in Methods. CpG forces are intensive parameters that characterize the abundance or scarcity of CpG dinucleotides in a DNA or RNA sequence with respect to their expected usage relative to a reference nucleotide distribution. We propose two frameworks to define these reference distribution, schematically represented in Fig. 1 . In the non-coding framework, MBE nucleotides are randomly drawn according to a fixed nucleotide bias, while in the coding framework the amino-acid sequence is fixed, and the codon bias defines the reference distribution.

Non-coding forces. The CpG non-coding force relative to the sequence nucleotide bias essentially captures the same information as the relative abundance index, I = q(CG) q(C)q(G) , where q(CG), q(C), q(G) are, respectively, the frequencies of CpG dinucleotides and of C, G nucleotides in the sequence (Karlin and Mrázek, 1997 (Greenbaum et al., 2014) and non-coding RNA (Tanne et al., 2015) (relative to the human nucleotide bias) are negative, and particularly low for type-I IFN transcripts involved in the innate immune system (Greenbaum et al., 2014) , confirming that CpG motifs are overall scarce in the human genome (Cheng et al., 2013; Greenbaum et al., 2014; Karlin and Mrázek, 1997; Karlin et al., 1994) . As shown in Table 1 strongly pathogenic viruses in humans, such as Ebola, the Spanish flu H1N1 (1918) and the avian flu H5N1 (2005) , are characterized by large CpG forces compared to the average force on human RNAs. The CpG force value can then be used as an indicator for the propensity of a viral sequence to be sensed by PRRs as non-self and engage the human innate immune response (Greenbaum et al., 2014; Tanne et al., 2015; Vabret et al., 2017) . A comparative analysis for non-coding force in the Coronaviridae family will be discussed in the following Sections.

Coding forces. The CpG coding force is based on the comparison of CpG occurrences in a coding RNA (or DNA) sequence and random synonymous sequences (associated to the same amino acids) drawn according to prescribed codon usage, cf. Fig. 1 . The computation of coding forces relative to the human codon usage for SARS-CoV-2 will be discussed in the following Sections and it will be used to characterize the evolution of SARS-CoV-2 sequences through synonymous mutations, under the pressure of the human host.

To allow for easy comparison and later access we list in Table 2 all CpG coding forces for the Coronaviridae family, as well as their non-coding counterparts.

We first computed the global non-coding force on CpG dinucleotides for SARS-CoV-2, a variety of other ssRNA viruses, and other viruses from Coronaviridae family affecting humans or other mammals (Fig. 2a) .

The value −1.1 of the global non-coding force for SARS-CoV-2 is comparable to the one for 4 · doi:10.1093/molbev/mst000 MBE human non-coding RNA and lower than other strongly pathogenic viruses in humans, such as H1N1, H5N1, Ebola (see Table 1 ). Among the coronaviruses (see Fig. 2a and Table 2 Lai and Cavanagh, 1997 and TNF-α) have been observed in, respectively, SARS and MERS and, at times, SARS-CoV-2 infection (Fajgenbaum and June, 2020; Hirano and Masaaki, 2020; Vabret et al., 2020; Zhou et al., 2014) . These results are qualitatively corroborated by the simpler analysis of CpG motif density (Suppl. Fig. SI .2).

We now restrict our analysis to the coding regions of SARS-CoV-2 and, in particular, on two structural proteins, N (nucleocapside) and S (spike) (Amanat and Krammer, 2020; Hoffmann et al., 2020; Xu et al., 2020b) . As stressed before, To better explain this CpG mutational trend along the sequence we define a putative equilibrium CpG force of the SARS-CoV-2 genome in human host, as the average CpG force of hCoVs in Table 2 : hCoVs have long been circulated in humans and are, therefore, supposed to be close to equilibrium with their host. Other choices for equilibrium force will be discussed later.

Regions with a CpG force much larger than the equilibrium one are predicted to be under strong evolutionary pressure to decrease their CpG content. This prediction is confirmed by the fact that CpG-decreasing syn-SNV are much more frequent than CpG-increasing ones, see above suggest to introduce, as will be done in the following, the CpG drive defined as the difference between the CpG local force and the equilibrium CpG force. Table 3 In N ORF, the ORF with largest density of CpG decreasing syn-SNV as shown in Table 3 Table 4 ). Notice that 43% (3 out of the 7) of the CpG-suppression related motifs in SARS-CoV-2 correspond to n = 7. Other motifs of the type CnxGcCG are also present in SARS-CoV-2, but their CpG is not lost in sequence data, see last 5 lines of Table 4 ; the dissociation constants associated to their spacer lengths are on average larger than the ones of the motifs showing CpG loss.

From the spacer-length dependent binding affinity given in Luo et al. (2020) (see Table 4) we have computed a score, which we call ZAP affinity score (ZAS), which is related to the probability of having at least one ZAP bound to such motif (see Methods for technical details).

The ZAS computed in sliding windows across the genome, is presented in Fig. 4a (top plot): N ORF is the richest region in motifs of the form CnxGxCG, with the largest ZAS. Our analysis is confirmed in Table 5 , which reports all syn-SNV removing CpG following an extended sequence motif. Even if N ORF represents only 4% of the total sequence length, 18% of extended motifs CnxGxCG and 26% (58% with counts) of syn-SNV removing a CpG on an extended motif are on this region. In contrast, only 2 extended motifs of type CnxGxCG were found in 5'UTR even if many repeated CpG at short interspace were present, see Suppl. Table 3 ), but have a large difference in CnxGxCG-like motif content, as shown in Table 5 .

Remarkably, when considering the counts, the number of CpG-decreasing mutations occurring in N ORF, out of which 71% are on CnxGxCGlike motifs, is 10-fold more than that occurring in M ORF. These results support the existence of early selection pressure to lower CnxGxCGlike motifs in N ORF, where they are particularly concentrated. correcting for sampling bias, will lie in between the two limit-case SMS discussed above.

The present work reports analysis of dinucleotide motif usage, particularly CpG, in the early evo- (2020). The degree to which this heterogeneity in any way reflects zoonotic origins should be further worked out using phylogenetic analysis.

In particular, the segment coding for the Spike protein has a much lower CpG force. The S protein has to bind ACE2 human receptors and TMPRSS2 (Hoffmann et al., 2020; Vabret et al., 2020) . A fascinating reason that could explain the low CpG force on this coding region is that it may come (at least in part) from other coronaviruses that better bind human entry receptors (Andersen 

Throughout this work we used CpG forces to quantify the CpG content of a given sequence.

Here we want to compare this approach with the simple count of CpG motifs in the sequence. In 

The technique at the core of many of the analyses made here is taken from Greenbaum et al. (2014) .

Here we briefly review this technique, starting from its non-coding version which takes as reference bias the nucleotide bias and then describing the coding version which takes as reference bias the codon bias at fixed amino-acid sequence. 

Here, q(s i ) is the nucleotide bias, that is the probability of the i-th nucleotide being s i (for example, we always used in this work the human frequency of nucleotides as q(s i )), f nc is the force we want to compute (the subscript nc stands for non-coding), and N m is the number of times the motif m appears in the sequence. Z is the normalization constant, that is

Therefore the force f nc is a parameter which quantifies the lack (if negative) or abundance 

Notice that this is equivalent to the request that f nc is so that probability of having observed s 0 is maximum.

Let us focus now on the specific case of a dinucleotide motif, that is our motif m consists of the pair ab, where a and b are two consecutive nucleotides (for example, a = C and b = G for the CpG motif). In this case, within an approximation discussed in the SI, Suppl. Sec. SI.3, the force computed as above turns out to be the logarithm of the relative abundance index, that is Table 1 .

Force computation in the coding case: Our technique can be generalized to coding sequences at fixed amino acid sequence and codon bias. In this case, we write each sequence s as a series of codons, and its probability is defined as 

We use the ideas discussed above to introduce a model in order to assign a score, which we 

The equilibrium force can be computed on a viral strain which is supposed to be at equilibrium with the human innate immune system, because it has evolved in the human host since a long time. Eq. (6) was used in Greenbaum et al. (2014) 5 Here we drop the subscripts nc and c used in the previous section to identify non-coding and coding forces, since the SMS is defined for a generic force.

to describe the evolution of the CpG numpber in H1N1, taking as the equilibrium force the one of the Influenza B strain. In analogy with this approach we take here as f eq the average force calculated on coding regions of the seasonal hCoVs (that is hCoV-229E, hCoV-NL63, hCoV-HKU1, hCoV-OC43). Other possible choices are discussed below (see also Suppl. Fig. SI.7) . τ is a parameter determining the characteristic time scale for synonymous mutations. Based on Eq. (6) we define the transition operator for CpG number

where ∆N CG = N CG −N CG . Notice that for all the synonymous mutations leaving unchanged the CpG number the above operator is one. The codon mutational operator is

where q(c i ) is the frequency of codon c i from the chosen codon usage bias. Putting together these two terms allows us to estimate how likely a specific synonymous mutation is to happen. The synonymous mutation score (SMS) accompanying a mutation is defined as the logarithm of this quantity,

To conclude, we remark that different models can be used in the SMS computation, where a model is specified by giving the choice of including or not the force term, the choice of the equilibrium force to be used, the choice of including or not the MBE codon bias term, and choice of the reference codon bias to be used.

It is well known that transversions (i. e. mutations of purines in pyrimidines and vice-versa) are suppressed with respect to transitions (i. e. mutations of purines in purines or pyrimidines in pyrimidines).

We introduce here a simple way to account for transition-transversion bias in the model used to assign the SMS. We suppose that a mutation with n transversions happens 4 times less than a mutation with n−1 transversions. This probability ratio, which is a standard value in the literature (Jiang and Zhao, 2006) , has been recently shown to be close to the observed value for SARS-CoV-2 (Roy et al., 2020) . To include that in our model, 

We introduce the ZAP Affinity Score (ZAS) to roughly quantify a priori the likelihood of ZAP 

where we also used that n is sufficiently small so However, we hypothesize that these requirements are fulfilled in cells, and that our interpretation in terms of binding probability is acceptable.

We discuss here how force values and SMS scores change by changing model parameters.

Parameters affecting the force values: In Fig. 3a the SARS-CoV-2 sequence has been processed to ensure the correct reading frame.

Therefore the ORF1ab gene is read in the standard frame up to the ribosomal shifting site, and it is read in the shifted frame from that site up to the end of the polyprotein. Moreover, the small non-coding parts between successive proteins have been cut, resulting in a loss of 634 nucleotides (including the 5'UTR and 3'UTR).

A Gaussian smoothing has been performed to obtain the plotted CpG forces (as in Fig. 2c ).

To produce the bar plots in Figs Table 3 .

This length is chose so that a large number of uploaded sequences (about 50000) have a UTR of the same length or longer. In the UTR analysis, all observed mutations are considered "synonymous".

The code used to compute coding and non-coding forces is publicly available at https://github. com/adigioacchino/dinucleotide_forces. can be generated in a coding framework as follows.

For each amino acid, a licit codon is randomly chosen according to prescribed codon usage (here, computed from the coding regions in the human genome). Notice that the above computational frameworks are not restricted to CpG, and can be applied to other dinucleotidic motifs. The maximum value of the local CpG force hints at the similarity of SARS-CoV-2 with the most pathogenic viruses, see Table 2 . Data from VIPR (Pickett et al., 2012) and GISAID (Elbe and Buckland-Merrett, 2017) , see Methods and

Suppl. Sec. SI.1 for details on data analysis. Position along the coding sequence (a) Table 6 . Model performance in predicting SNV on SARS-CoV-2 UTRs and ORFs.

Understanding human coronavirus hcov-nl63. The open virology journal

Sars-cov-2 vaccines: status report. Immunity

The proximal origin of sars-cov-2

Cpg usage in rna viruses: data and hypotheses

Intra-genome variability in the dinucleotide composition of SARS-CoV-2

Data, disease and diplomacy: Gisaid's innovative contribution to global health

Cytokine storm

Systematic discovery and functional interrogation of sars-cov-2 viral rna-host protein interactions during infection

Introductions and early spread of sars-cov-2 in the new york city area

Patterns of evolution and host gene mimicry in influenza and other rna viruses

Quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses

Nextstrain: real-time tracking of pathogen evolution

Impaired type i interferon activity and inflammatory responses in severe covid-19 patients

Covid-19: a new virus, but an old cytokine release syndrome

Sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor

Categorical spectral analysis of periodicity in human and viral genomes

Mutational spectrum in the recent human genome inferred by single nucleotide polymorphisms

Oligonucleotide motifs that disappear during the evolution of influenza virus in humans increase alpha interferon secretion by plasmacytoid dendritic cells

Sequence analysis of sars-cov-2 genome reveals features important for vaccine design

Compositional differences within and between eukaryotic genomes

Why is cpg suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses

The architecture of sars-cov-2 transcriptome

Sars-cov and ifn: too little, too late

The molecular biology of coronaviruses

Longitudinal analyses reveal immunological misfiring in severe covid-19

A predictive fitness model for influenza

Molecular mechanism of rna recognition by zinc-finger antiviral protein

Natural selection in the evolution of sars-cov-2 in bats, not humans

Structure of the zinc-finger antiviral protein in complex with rna reveals a mechanism for selective targeting of cg-rich viral sequences

Covid-19: consider cytokine storm syndromes and immunosuppression. The Lancet

Sars-cov-2 is restricted by zinc finger antiviral protein despite preadaptation to the low-cpg environment in humans

Vipr: an open bioinformatics database and analysis resource for virology research

The covid-19 cytokine storm; what we know so far

Trends of mutation accumulation across global sars-cov-2 genomes: Implications for the evolution of the novel coronavirus

Short sequence motif dynamics in the sars-cov-2 genome suggest a role for cytosine deamination in cpg reduction

From sars to mers, thrusting coronaviruses into the spotlight

Cg dinucleotide suppression enables antiviral defence targeting non-self rna

Distinguishing the immunostimulatory properties of noncoding rnas expressed in cancer cells

Sequence-specific sensing of nucleic acids

Immunology of covid-19: Current state of the science

Coronavirus genomes carry the signatures of their habitats

Database resources of the national center for biotechnology information

Effective treatment of severe covid-19 patients with tocilizumab

Evolution of the novel coronavirus from the ongoing wuhan outbreak and modeling of its spike protein for risk of human transmission

Active replication of middle east respiratory syndrome coronavirus and aberrant induction of inflammatory cytokines and chemokines in human macrophages: implications for pathogenesis

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Zinc-finger antiviral protein inhibits hiv-1 infection by selectively targeting multiply spliced viral mrnas for degradation