key: cord-262844-qeheeqe3 authors: Xia, Xuhua title: Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense date: 2020-04-14 journal: Mol Biol Evol DOI: 10.1093/molbev/msaa094 sha: doc_id: 262844 cord_uid: qeheeqe3 Wild mammalian species, including bats, constitute the natural reservoir of Betacoronavirus (including SARS, MERS, and the deadly SARS-CoV-2). Different hosts or host tissues provide different cellular environments, especially different antiviral and RNA modification activities that can alter RNA modification signatures observed in the viral RNA genome. The zinc finger antiviral protein (ZAP) binds specifically to CpG dinucleotides and recruits other proteins to degrade a variety of viral RNA genomes. Many mammalian RNA viruses have evolved CpG deficiency. Increasing CpG dinucleotides in these low-CpG viral genomes in the presence of ZAP consistently leads to decreased viral replication and virulence. Because ZAP exhibits tissue-specific expression, viruses infecting different tissues are expected to have different CpG signatures, suggesting a means to identify viral tissue-switching events. I show that SARS-CoV-2 has the most extreme CpG deficiency in all known Betacoronavirus genomes. This suggests that SARS-CoV-2 may have evolved in a new host (or new host tissue) with high ZAP expression. A survey of CpG deficiency in viral genomes identified a virulent canine coronavirus (Alphacoronavirus) as possessing the most extreme CpG deficiency, comparable to that observed in SARS-CoV-2. This suggests that the canine tissue infected by the canine coronavirus may provide a cellular environment strongly selecting against CpG. Thus, viral surveys focused on decreasing CpG in viral RNA genomes may provide important clues about the selective environments and viral defenses in the original hosts. Coronaviruses (CoV) evolve in mammalian hosts and carry genomic signatures of their host-specific environment, especially the host-specific antiviral and RNA modification activities. Many pathogenic single-stranded RNA viruses, including coronaviruses, exhibit strong CpG deficiency Greenbaum et al., 2008; Greenbaum et al., 2009; Takata et al., 2017; Yap et al., 2003) . Two mammalian enzymes are inferred to contribute to the observed CpG deficiency. The zinc finger antiviral protein (ZAP, known as ZC3HAV1 in mammals or hZAP in human), a key component in mammalian interferon-mediated immune response, binds specifically to CpG dinucleotides in viral RNA genomes via its RNA-binding domain (Meagher et al., 2019) . ZAP inhibits viral replication and mediates viral genome degradation (Ficarelli et al., 2020; Ficarelli et al., 2019; Meagher et al., 2019; Takata et al., 2017) . ZAP has two isoforms (ZAP-L and ZAP-S); both participate in initiating antiviral activities but only ZAP-S mediates the return to homeostasis after the antiviral response (Schwerk et al., 2019) . ZAP acts against not only retroviruses such as HIV-1 (Ficarelli et al., 2020; Ficarelli et al., 2019) , but also Echovirus 7 (Odon et al., 2019) and Zika virus (Trus et al., 2019) , both being positive-sense single-stranded RNA viruses like coronaviruses. In particular, selection against CpG in viral RNA disappears in ZAP-deficient cells (Takata et al., 2017) , suggesting that ZAP may be the only cellular agent targeting CpG in viral RNA genomes. Experimental evidence is consistent with the interpretation that CpG deficiency in RNA viruses has evolved in response to these cytoplasmic CpG-specific antiviral activities. During natural evolution of HIV-1 within individual patients, viral fitness decreased with increasing CpG dinucleotides (Theys et al., 2018) . Experimental increase of CpG dinucleotides in CpG-deficient viral genomes consistently leads to strong decrease in viral replication and virulence (Antzin-Anduetza et al., 2017; Burns et al., 2009; Fros et al., 2017; Trus et al., 2019; Tulloch et al., 2014; Wasson et al., 2017) , prompting the proposal of vaccine-development strategies involving increasing CpG to attenuate pathogenic RNA viruses (Burns et al., 2009; Ficarelli et al., 2020; Trus et al., 2019; Tulloch et al., 2014) . Another antiviral enzyme is APOBEC3G, found in innate immune cells. APOBEC3G was originally thought specific to single-stranded DNA such as reversetranscribed HIV-1, but is now known to modify a variety of RNA viruses, deaminating C to U (Sharma et al., 2016; Sharma et al., 2015; Sharma et al., 2019) . This would be effective against RNA viruses if the deaminated sites are functionally important. APOBEC3G co-purifies with highly edited mRNA substrates (Sharma et al., 2016) and therefore could act on coronavirus genomes which are positive-sense single-stranded RNA. While APOBEC3G is not strongly CpG-specific, it could contribute to CpG deficiency when coupled with ZAP-mediated antiviral activities targeting CpG. Modification of CpG to UpG in non-functional regions could reduce viral susceptibility to CpG-mediated attack by ZAP relative to viruses with unmodified CpG dinucleotides. Both ZAP and APOBEC3G exhibit tissue-specific expression patterns in human (Fagerberg et al., 2014) . Both are expressed in lungs, but ZAP is the most highly expressed where lymphocytes are the most abundant (bone marrow, lymph node, appendix, and spleen), whereas APOBEC3G is the most highly expressed in lymph node, spleen, and testis (Fagerberg et al., 2014) . A severely CpG deficient virus may indicate an evolutionary history in ZAP-abundant tissues, such as strongly CpG-deficient HIV-1 infecting host T cells in lymph organs where ZAP is abundant (Fagerberg et al., 2014) . The presence of such viruses indicates that they have found ways to evade ZAP-mediated cellular antiviral defense. The differential expression of ZAP and APOBEC3G in different host or host tissues is expected to leave different genomic signatures on viral RNA genomes. We may use the conventional index of CpG deficiency (Cardon et al., 1994; Karlin et al., 1997) implemented in DAMBE (Xia, 2018) : The index is expected to be 1 with no deficiency or excess, smaller than 1 if deficient and greater than 1 if excess. The 1252 Betacoronavirus (BetaCoV) full-length genomes deposited in GenBank (of which 1127 are unique), have mean±SE value of 0.516±0.0017 for ICpG, which is significantly (p < 0.0001) smaller than their null expectation of 1. If a coronavirus infects a different host tissue with different ZAP abundance, then its RNA genome will experience different selection pressure against its CpG. This difference in cellular antiviral activity would result in differences in ICpG during viral genomic evolution. In contrast, a coronavirus infecting a specific host tissue for a long time would experience the same cellular antiviral and RNA modification environment and is consequently expected to have similar and stable ICpG. Group 2 includes Betacoronavirus 1 genomes with two types of hosts: 1) ungulates (with bovine and equine coronavirus as well as porcine hemagglutinating encephalomyelitis virus), and 2) human, with CoV-OC43 being a recent derivative of bovine coronaviruses (Hulswit et al., 2019) . Group 3 are all SARS-related coronaviruses from three types of hosts: 1) Rhinolophus bats which serve as a natural reservoir of SARS-related coronaviruses (Li et al., 2005; Wu et al., 2016a; Wu et al., 2016b) and the new SARS-CoV-2 (Zhou et al., 2020) , 2) civets (from which coronavirus genomes with 99.6% identity to SARS virus genomes were identified) (Shi and Hu, 2008) , and 3) human patients infected by SARS-CoV-2. Fig. 1 shows that genomic GC% and ICpG can differ among different viral lineages in the same host, or among different hosts for the same viral lineage. The most striking pattern in Fig. 1 is an isolated but dramatic shift in the lineage leading to BatCoV RaTG13 which was reported (Zhou et al., 2020) (Theys et al., 2018) , but also in experimentally CpG dinucleotide-enriched viral genomes (Antzin-Anduetza et al., 2017; Burns et al., 2009; Fros et al., 2017; Trus et al., 2019; Tulloch et al., 2014; Wasson et al., 2017) . The association between decreased CpG and increased virulence in RNA viruses is mainly due to interferon-induced ZAP protein which binds to CpG dinucleotides in viral RNA genomes by its RNA-binding domain (Meagher et al., 2019) , inhibits viral replication and facilitates viral genome degradation (Ficarelli et al., 2020; Ficarelli et al., 2019; Meagher et al., 2019; Takata et al., 2017) . Thus, a decreased ICpG in a viral pathogen suggests an increased threat to public health, but an increased ICpG decreases the threat because such viral pathogens, with increased ICpG and reduced virulence, would be akin to natural vaccines. Many viral researchers have in fact proposed vaccine development by increasing CpG in viral RNA genomes (Burns et al., 2009; Ficarelli et al., 2020; Trus et al., 2019; Tulloch et al., 2014) . . Surprisingly, no available Betacoronavirus genome from diverse natural hosts has a genomic ICpG and GC% combination close to that observed in SARS-CoV-2 and BatCoV RaTG13 (Fig. 2) . BetaCoV lineages parasitizing Rhinolophus bats overall have relatively low ICpG values (Fig. 2) . BetaCoV infecting dromedary camels offers a weak hint that camel digestive system may select more strongly against CpG in viral genomes than camel respiratory system. Camel coronaviruses form two clusters. One cluster overlaps completely with MERS viruses (Fig. 2 ) that infect mammalian respiratory system (Fehr and Perlman, 2015; Li, 2016) . The other cluster includes camel coronavirus HKU23 strains positioned close to bovine CoV (grouped under "Ungulate_CoV" in Figs. 1 and 2) , both belonging to Embecovirus and infecting mainly mammalian digestive system but also respiratory systems (Athanassious et al., 1994; Chae et al., 2019; Fulton et al., 2015; Ribeiro et al., 2016; Symes et al., 2018) . Those viruses infecting camel digestive system have lower genomic ICpG and GC% than those infecting camel respiratory system (Fig. 2) . To search for a mammalian host with the potential to select viral lineages with low Poder, 2011; Pratelli, 2006) , have genomic ICpG and GC% values similar to those observed in SARS-CoV-2 and BatCoV RaTG13 (Fig. 3A) . The genome (accession KP981644) is from the most virulent pantropic CCoV invading multiple canine organs (Buonavoglia et al., 2006; Decaro et al., 2007; Zappulli et al., 2008) . It belongs to a clade with the lowest observed ICpG values (Fig. 3B) . Second, canids, like camels, also have coronaviruses infecting their respiratory system (canine respiratory coronavirus or CRCoV belonging to BetaCoV). There are two genomes sequenced for CRCoV (accessions JX860640 and KX432213). Their genomic ICpG values are 0.4756 and 0.4684, respectively, substantially higher than those for CCoVs infecting the digestive system (Fig. 3A) . Thus, similar to the pattern observed in coronaviruses infecting camels, CCoVs infecting canine digestive system have ICpG much lower than CRCoVs infecting canine respiratory system. Third, none of the available AlphaCoV genomes from bats or other mammalian host species possess genomic ICpG and GC% values similar to those observed in SARS-CoV-2 and BatCoV RaTG13 (Fig. 3) . Thus, although AlphaCoV infects a diverse array of bat lineages, these bat tissues do not seem to generate AlphaCoV strains with low ICpG values comparable to SARS-CoV-2 and BatCoV RaTG13. Fourth, I want to highlight one data point involving a CCoV genome represented as a green dot in Fig. 3 (highlighted by a green arrow in Fig. 3A , genome accession KC175339). The CCoV has a genomic GC% of 38.17% and ICpG of 0.4986, much higher than the rest. The virus was originally isolated from a dog but had been propagated extensively in cell culture before being sequenced (Dr. Gary R. Whittaker, pers. comm.). Viruses are propagated in cells that expresses the right cellular receptor for viral entry, but do not mount an immune response to kill the virus or get killed by the virus (Banerjee et al., 2019; Benfield and Saif, 1990) . The consequent relaxation of selection against the virus (and against CpG in the CCoV genome) in cell culture would allow CpG in the viral RNA genome to rebound through mutation, which would explain the increased ICpG (KC175339 in the phylogeny in Fig. 3B ). This process of regaining CpG is reminiscent of CpG-specific methylation in Mycoplasma species where CpG was regained when some lineages lost CpG-specific methyltransferases, with a fast-evolving lineage (M. pneumoniae) regaining CpG faster than a slow-evolving lineage (M. genitalium) (Xia, 2003) . This rapid change in ICpG with environmental change as shown in Fig. 3B has two important implications. First, it suggests the feasibility of tracking certain host-switching or tissue-switching events (which would be impossible if it takes hundreds of years for a virus to change ICpG). Second, many experimental studies (Burns et al., 2009; Ficarelli et al., 2020; Odon et al., 2019; Trus et al., 2019; Tulloch et al., 2014) Fifth, the cellular receptor for SARS-CoV-2 entry into the cell is ACE2 (angiotensin I converting enzyme 2) (Zhou et al., 2020) . ACE2 is pervasively expressed in human digestive system, at the highest levels in small intestine and duodenum (Fig. 3C) , with relatively low expression in lung (Fagerberg et al., 2014) . This suggests that mammalian digestive system is likely to be infected by coronaviruses. This is consistent with the interpretation that the low ICpG in SARS-CoV-2 was acquired by the ancestor of SARS-CoV-2 evolving in mammalian digestive system. The interpretation is further corroborated by a recent report that a high proportion of COVID-19 patients also suffer from digestive discomfort (Pan et al., 2020) . In fact, 48.5% presented with digestive symptoms as their chief complaint. Humans are the only other host species observed to produce coronavirus genomes with low genomic ICpG values, as shown by the cluster of human Alphacoronavirus NL63 genomes (Fig. 3) . This virus mainly infects the respiratory system, but also causes digestive problems in 33% of the patients reporting respiratory problems (Vabret et al., 2005) . In a comprehensive study of the first 12 COVID-19 patients in US (Midgley and The COVID-19 Investigation Team, 2020), one patient reported diarrhea as the initial symptom before developing fever and cough (Midgley and The COVID-19 Investigation Team, 2020) . Stool samples from 7 out of 10 patients tested positive for SARS-CoV-2, including 3 patients with diarrhea (Midgley and The COVID-19 Investigation Team, 2020), corroborating a previous report of SARS-CoV-2 detection in stool (Holshue et al., 2020) . In particular, live SARS-CoV-2 virus was isolated from stool of a COVID-19 patient . In this context, it is significant that BatCoV RaTG13, as documented in its genomic sequence in GenBank (MN996532), was isolated from a fecal swab. These observations are consistent with the hypothesis that SARS-CoV-2 has evolved in mammalian intestine or tissues associated with intestine. that had been propagated extensively in cell culture before sequencing. (B) Phylogeny from the alignment of all sequenced CCoV genomes, with leaf name in the format of (ACCN: ICpG). Genomes were aligned with MAFFT (Katoh et al., 2009) with the FFT-NS-2 option (more accurate than default). PhyML (Guindon and Gascuel, 2003) with the GTR+ substitution model and "best" option Increasing the CpG dinucleotide abundance in the HIV-1 genomic RNA inhibits viral replication Detection of bovine coronavirus and type A rotavirus in neonatal calf diarrhea and winter dysentery of cattle in Quebec: evaluation of three diagnostic methods The influence of CpG and UpA dinucleotide frequencies on RNA virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication Interferon Regulatory Factor 3-Mediated Signaling Limits Middle-East Respiratory Syndrome (MERS) Coronavirus Propagation in Cells from an Insectivorous Bat Cell culture propagation of a coronavirus isolated from cows with winter dysentery Canine coronavirus highly pathogenic for dogs Genetic inactivation of poliovirus infectivity by increasing the frequencies of CpG and UpA dinucleotides within and across synonymous capsid region codons Pervasive CpG suppression in animal mitochondrial genomes Acute phase response in bovine coronavirus positive post-weaned calves with diarrhea Molecular characterisation of the virulent canine coronavirus CB/05 strain Genomic analysis of 16 Colorado human NL63 coronaviruses identifies a new genotype, high sequence diversity in the N-terminal domain of the spike gene and evidence of recombination Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics Coronaviruses: an overview of their replication and pathogenesis Replication through Zinc Finger Antiviral Protein (ZAP)-Dependent and -Independent Mechanisms KHNYN is essential for the zinc finger antiviral protein (ZAP) to restrict HIV-1 containing clustered CpG dinucleotides CpG and UpA dinucleotides in both coding and non-coding regions of echovirus 7 inhibit replication initiation post-entry Enteric disease in postweaned beef calves associated with Bovine coronavirus clade 2 Patterns of evolution and host gene mimicry in influenza and other RNA viruses Patterns of oligonucleotide sequences in viral and host cell RNA identify mediators of the host innate immune system A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood First Case of 2019 Novel Coronavirus in the United States Human coronaviruses OC43 and HKU1 bind to 9-O-acetylated sialic acids via a conserved receptor-binding site in spike protein domain A Mechanism and biological role of Dnmt2 in Nucleic Acid Methylation Compositional biases of bacterial genomes and evolutionary implications Multiple alignment of DNA sequences with MAFFT Feline and canine coronaviruses: common genetic and pathobiological features Structure, Function, and Evolution of Coronavirus Spike Proteins Bats are natural reservoirs of SARS-like coronaviruses Structure of the zincfinger antiviral protein in complex with RNA reveals a mechanism for selective targeting of CG-rich viral sequences The COVID-19 Investigation Team. 2020. First 12 patients with coronavirus disease 2019 (COVID-19) in the United States The role of ZAP and OAS3/RNAseL pathways in the attenuation of an RNA virus with elevated frequencies of CpG and UpA dinucleotides Clinical characteristics of COVID-19 patients with digestive symptoms in Hubei, China: a descriptive, cross-sectional, multicenter study Genetic evolution of canine coronavirus and recent advances in prophylaxis Molecular detection of bovine coronavirus in a diarrhea outbreak in pasture-feeding Nellore steers in southern Brazil RNA-binding protein isoforms ZAP-S and ZAP-L have distinct antiviral and immune resolution functions The double-domain cytidine deaminase APOBEC3G is a cellular site-specific RNA editing enzyme APOBEC3A cytidine deaminase induces RNA editing in monocytes and macrophages Mitochondrial hypoxic stress induces widespread RNA editing by APOBEC3G in natural killer cells A review of studies on animal reservoirs of the SARS coronavirus First detection of bovine noroviruses and detection of bovine coronavirus in Australian dairy cattle CG dinucleotide suppression enables antiviral defence targeting nonself RNA Within-patient mutation frequencies reveal fitness costs of CpG dinucleotides and drastic amino acid changes in HIV CpG-Recoding in Zika Virus Genome Causes Host-Age-Dependent Attenuation of Infection With Protection Against Lethal Heterologous Challenge in Mice RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies Human coronavirus NL63 The CpG dinucleotide content of the HIV-1 envelope gene may predict disease progression Deciphering the bat virome catalog to better understand the ecological diversity of bat viruses and the bat origin of emerging infectious diseases ORF8-Related Genetic Evidence for Chinese Horseshoe Bats as the Source of Human Severe Acute Respiratory Syndrome Coronavirus DNA methylation and mycoplasma genomes DAMBE7: New and improved tools for data analysis in molecular biology and evolution Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling Systemic fatal type II coronavirus infection in a dog: pathological findings and immunohistochemistry Isolation of 2019-nCoV from a Stool Specimen for use under a CC0 license A pneumonia outbreak associated with a new coronavirus of probable bat origin Supporting Online Material Betacoronavirus_CpG.xlsx with Fig. 1 and Fig. 2 (and the compiled data for generating them) xlsx with Fig. 3 (and the compiled data for generating them) were used to search for the best tree. (C) Tissue-specific gene expression of ACE2, with data from Fagerberg et al Research Council (NSERC, RGPIN/2018-03878) of Canada. I am particularly grateful to K. Katoh and his colleagues for excellent comments and critical references, and to G. R.Whittacker for information on canine coronavirus (ACCN KC175339). I have also benefitted from discussion with D. Gray, X. Jiang, J. Mennigen, G. Wang, C.-I. Wu, J.Yang and W. Zhai. Five anonymous reviewers contributed significantly to the improvement of the manuscript. Heather Rowe corrected many grammatical errors.