key: cord-0956026-2yntik9n authors: Chen, Zigui; Boon, Siaw S.; Wang, Maggie H.; Chan, Renee W.Y.; Chan, Paul K.S. title: Genomic and evolutionary comparison between SARS-CoV-2 and other human coronaviruses date: 2020-12-05 journal: J Virol Methods DOI: 10.1016/j.jviromet.2020.114032 sha: 826e67eaaf295448fd1015ae041d05f1db5b06e7 doc_id: 956026 cord_uid: 2yntik9n Three highly pathogenic human coronaviruses can cause severe acute respiratory syndrome (SARS-CoV, SARS-CoV-2 and MERS-CoV). Although phylogenetic analyses have indicated ancient origin of human coronaviruses from animal relatives, their evolutionary history remains to be established. Using phylogenetics and “high order genomic structures” including trimer spectrums, codon usage and dinucleotide suppression, we observed distinct clustering of all human coronaviruses that formed phylogenetic clades with their closest animal relatives, indicating they have encompassed long evolutionary histories within specific ecological niches before jumping species barrier to infect humans. The close relationships between SARS-CoV and SARS-CoV-2 imply similar evolutionary origin. However, a lower Effective Codon Number (ENC) pattern and CpG dinucleotide suppression in SARS-CoV-2 genomes compared to SARS-CoV and MERS-CoV may imply a better host fitness, and thus their success in sustaining a pandemic. Characterization of coronavirus heterogeneity via complementary approaches enriches our understanding on the evolution and virus-host interaction of these emerging human pathogens while the underlying mechanistic basis in pathogenicity warrants further investigation. The novel coronavirus disease pandemic emerged in December 2019 is dramatically threatening global public health and economy, resulting in over 7.6 million confirmed cases and >420,000 deaths in at least 216 countries/cities as of June 15, 2020 (https://www.who.int/emergencies/diseases/novel-coronavirus-2019). This disease is caused by a novel coronavirus, SARS-CoV-2, belonging to the genus Betacoronavirus in the family Coronaviridae, with major clinical symptoms of cough and fever but also acute respiratory distress syndrome or multiorgan failure Lu et al., 2020) . The family Coronaviridae consists of four genera: Alpha-, Beta-, Gamma-and Deltacoronavirus, with the former two genera infecting mammals only (Cui, Li, and Shi, 2019) . Before COVID-19, six human coronaviruses (HCoVs) have been identified, including two (SARS-CoV, MERS-CoV) highly pathogenic clusters that cause severe respiratory pneumonia, and the other four (HCoV-OC43, HCoV-229E, HCoV-NL63 and HCoV-HKU1) that mainly cause common cold. In late 2002, the first severe acute respiratory syndrome coronavirus (SARS-CoV) emerged in Guangdong, China and spread to more than 30 countries. The overall fatality was ca. 10% with more than 8,000 infections (www.who.int) (Zhong et al., 2003) . SARS-CoV was not found in humans since 2004. In 2012, another highly pathogenic coronavirus, Middle East respiratory syndrome coronavirus (MERS-CoV) emerged in Middle Eastern countries (Zaki et al., 2012) , causing severe pneumonia and renal failure in humans, with a mortality of ~30%. Phylogenetic analyses indicate that all human coronaviruses have an ancient origin from animals: SARS-CoV, MERS-CoV, HCoV-NL63, HCoV-229E, and perhaps SARS-CoV-2 are related to coronaviruses detected in bats; whereas HCoV-OC43 and HCoV-J o u r n a l P r e -p r o o f Currently, there is no vaccine or therapeutics for effective prevention or treatment of HCoV infection while the research and development of neutralizing antibodies, mainly targeting S1 and/or S2 regions have been vigorously undertaken (Jiang, Hillyer, and Du, 2020) . With advances in DNA sequencing and bioinformatics analysis, multiple sequence alignments have been used to interrogate genomic diversity and evolution. The highly divergent homologous sequences among HCoV genomes, however, may lead to ambiguous alignments that degrade resolution and bias phylogenetic inference. Alternatively, nonparametric agnostic approaches based on the distribution of exact sub-sequences (k-mer spectrum, the DNA 'word' with defined length) and additional genetic metrics (for example, codon usage preference, dimer nucleotide composition) provide complementary information on the complexity and relationship of homologous sequences (Chan and Ragan, 2013; Chor et al., 2009; Shackelton, Parrish, and Holmes, 2006; Vinga and Almeida, 2003) . These agnostic methods also avoid complex computation or model selection while capturing signals otherwise lost to indel, recombination or shuffling (Chan and Ragan, 2013) . In an effort to interrogate the evolutionary process driving the divergence of coronaviruses, we have primarily focused on human coronaviruses because of the association with severe respiratory pneumonia in human beings. Multiple parametric and nonparametric algorithms were applied. Using the rich resource of a large number of genomes, we sought to uncover hidden biological signals not easily discovered with homology-based methods alone. To better understand the evolutionary position of SARS-CoV-2, we first examined the phylogenetic relationship of the subgenus Sarbecoviruses within the genus Betacoronaviruses using the concatenated nucleotide sequences of 12 ORFs/genes (ORF1a, ORF1b, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N and ORF10) ( Figure 2A , Table S1 ). In consistent with previous reports, SARS-CoV-2 shares sequence similarities of 96.13 ± 0.06% with a bat coronavirus isolate (RaTG13, NCBI accession number MN996532, GISAID accession number EPI_ISL_402131), similar to the genomic differences between SARS-CoV and bat SARS-like CoVs (e.g., KY417150, KY417146) (similarities of 96.06 ± 0.25%), implying potential zoonotic transmission of SARS-CoV-2 from bats as their natural reservoirs to humans through unknown intermediate hosts. SARS-CoV-2 also shares close genomic similarities with coronavirus isolates from pangolin animals (EPI_ISL_410984, 90.22 ± 0.03%; EPI_ISL_410538 -EPI_ISL_410542, 85.27 ± 0.03%), whereas the complete genome sequence similarities between SARS-CoV and SARS-CoV-2 are 79.4 ± 0.17%. The subgenus Sarbecovirus has average sequence similarities of ≤ 48.4% with other coronaviruses (data not shown). The k-mer and other nonparametric approaches, such as codon usage and dinucleotide composition, constitute a "higher order genomic structure" that may reflect the influence of evolutionary processes extending beyond common measures of Darwinian selection. In order to explore an agnostic evolutionary model, we used trimer (k=3) frequency distribution to construct the phylogeny of the subgenus Sarbecovirus (Figures 2B). Both parametric (alignment phylogeny) and non-parametric (trimer spectrum phylogeny) approaches showed distinct J o u r n a l P r e -p r o o f separation between SARS-CoV and SARS-CoV-2, with their animal relatives closely clustering to each other. The trimer spectrums between human bCoV clusters may imply differential codon usages between viruses and host cells. We then measured the ENC values of coronaviruses across 6 genes (ORF1a, ORF1b, S, E, M, and N) shared by all members, with a main focus on the difference between seven human coronavirus clusters (Table S2 ). The ENC statistic is a way of analyzing how biased a gene is in terms of its codon usage, with values ranging from 20 when a gene is effectively using only a single codon for each amino acid (strongest bias) to 61 when a gene tends to use all codons with equal frequency (no bias). The ENC values of the surveyed coronavirus genomes ranged between 35.6 and 54.1 (a mean value of 45.4 ± 4.7), with a significant correlation with GC contents (adjusted R 2 = 0.953, p < 0.001) ( Figure S1 ). Strong codon usage bias was observed in HCoV-HKU1 (35.7 ± 0.09) and HCoV-NL63 (37.3 ± 0.08) genomes when compared to other HCoV clusters (45.9 ± 4.2, p < 0.001) ( Figure 3A ). Interestingly, SARS-CoV-2, HCoV-OC43 and HCoV-229E genomes shared similar ENC values, but demonstrated stronger codon usage bias than SARS-CoV and MERS-CoV (p = 0.008). In order to obtain a better understanding on the relationship between codon usage bias and gene composition, a plot of ENC values against GC contents at the synonymous third codon position (GC3s) was constructed ( Figure 3B ). This method was used to estimate the factors shaping the codon usage pattern. If codon usage pattern was affected by GC3s alone, the observed ENC values should be on or just below the expected ENC* curve indicating the synonymous codon J o u r n a l P r e -p r o o f usage bias may be subject to GC-biased mutational pressure. As shown, all surveyed HCoV genomes lay slightly under the expected ENC* curve (the red curve in Figure 3B ), implying an important role of uneven base composition and hence, of mutational pressure that affected the formation of codon usage bias. However, codon usage bias tended to be less dependent on variation of GC3s when GC contents decreased, with measures of (ENC* -ENC) / ENC* suggesting that other factors, such as translational or natural selection, may act as forces affecting stronger codon usage bias and higher genomic diversity of HCoV-HKU1 and HCoV-NL63 ( Figure 3C ). Differential codon usage patterns within distinct ORFs were observed between HCoV clusters ( Figure S2 ). SARS-CoV and MERS-CoV, for example, had the least codon usage biases across ORFs while HCoV-HKU1 and HCoV-NL63 were the strongest ones. The S and N genes, however, showed no significant difference in ENC values between SARS-CoV-2 and SARS-CoV. The M gene may encompass more diversified selection forces amongst HCoV clusters, as lower values of (ENC* -ENC) / ENC* were observed, probably due to the relatively small protein encoded. It is noted that the E gene was not included for comparison because of the small protein size (< 89 aa), for which interpretation may not be applied. ENC value measures the codon usage bias of an entire genome/gene but not of individual codon. We further calculated the Relative Synonymous Codon Usage (RSCU) values across 6 ORFs of HCoV genomes to better estimate the differential usage of each synonymous codon. Among 59 codons encoding for 18 amino acids (except for Met, Trp, and stop codons), 6 and 9 were J o u r n a l P r e -p r o o f defined as preferred (RSCU values > 1.6) and suppressed codons (< 0.6) within all HCoV genomes, respectively ( Figure 3D , Table 1 ). For example, HCoV genomes prefer to GGT (RSCU ≥ 1.91) rather than GGG (≤ 0.26) to encode Glycine. Interestingly, the majority of human-preferred codons were less commonly found in HCoV genomes, such as CTG encoding for Leucine (Human, 2.37; HCoV, ≤ 0.58) and GCC encoding for Alanine (Human, 1.60; HCoV, ≤ 0.59); in contrast, CGT, the optimized codon encoding for Arginine in HCoV genomes (mean of 2.03) was rarely found in human genomes (0.48). The animal CoV genomes share similar RSCU patterns as the HCoV ones. In contrast to HCoV genomes that intended to use A/T-ending codons, as observed in 96% (25/26) of codons with average values of RSCU less than 1.0, human host tends to use G/Cending codons (74%, 20/27). However, both HCoV and human displayed a strong tendency to avoid using CG-ending codons (HCoV ≤ 0.28, human ≤ 0.46), suggesting a potential role of CpG dinucleotide depletion in shaping the codon usages. To further investigate the variation of synonymous codon usage bias in modulating HCoV genomic diversity, correspondence analysis based on the RSCU patterns was performed ( Figure 3E ). Scatter plots of the first two axes well supported the distinct separation of HCoV clusters, with SARS-CoV and SARS-CoV-2 sharing relatively similar RSCU patterns, such as preferred codon usage of ACA (Threonine), CCA (Proline) and TCA (Serine) when compared to other HCoV clusters, but relatively lower values of TTG (Leucine) and CGT (Arginine) ( Table 1 ). The four HCoV clusters associated with common cold (HCoV-OC43, HCoV-HKU1, HCoV-229E and HCoV-NL63) may form a separate group based on the codon usage patterns, with higher J o u r n a l P r e -p r o o f RSCU values of GT-ending codons when compared to the SARS-related HCoV genomes (SARS-CoV, SARS-CoV-2 and MERS-CoV). When individual ORFs were accessed, we observed similar codon usage patterns amongst HCoV clusters when compared to the concatenated 6 ORFs/genes ( Figure 4 ). Interestingly, the N gene of SARS-CoV and SARS-CoV-2 shared nearly identical codon usage patterns, implying similar biological property of nucleoprotein between these two viral clusters in packaging viral particles. Within individual HCoV cluster, ORF1a, ORF1b and S gene usually shared a more similar codon usage patterns that were different to that of M and N genes ( Figure S3 ). Since CG-ending codons may be less frequent in coronavirus genomes (Table 1) , we then measured the relative abundance of dinucleotide across 6 ORFs/genes to determine the influence of dinucleotide suppression on codon usage bias. As expected, CpG dinucleotide was mostly depleted (mean value of observed/expected ratio of 0.48 ± 0.06) ( Figure 5A , Table 2 ), with significant association with overall GC content (pearson correlation of 0.709, p<0.001) and GC content at the third position (GC3) (cor. 0.736, p<0.001) ( Figure S4A ). TpC dinucleotides represented the second most suppressed dinucleotide (0.78 ± 0.06), consistent with the strong scarcity of nTC codon usage in the surveyed coronavirus genomes (Tables 1 and 2 ). Since TC dinucleotide is one of the preferred target sequences of host restriction factor APOBEC3 proteins, the suppression of TC dinucleotide could be a result of evolutionary selection allowing viruses to evade restriction from host immune protection. TpA dinucleotides were also stringently excluded (0.88 ± 0.05), probably due to in part usage of universal stop (TAA, TAG) and the increased susceptibility of TpA to ribonuclease digestion (Beutler et al., 1989) . Interestingly, the loss of TpA seems to be significantly associated with the gain of TpG (Pearson correlation of 0.940, p < 0.001) ( Figure S4B , Table S3 ). Deamination of 5-methylcytosine (5mC) within the CpG island may lead to a cytosine (C) to thymine (T) transition, potentially resulting in a loss of CpG and a gain of TpG. However, the correlation between the loss of CpG and the gain of TpG was not strong (cor=0.138, p = 0.006) in the surveyed coronavirus genomes. The gain of TpG was also associated with the loss of ApT (cor=0.730, p < 0.001), but with gain of CpA (cor=0.772, p < 0.001) and ApC (cor=0.605, p < 0.001). Discriminative dinucleotide suppression patterns were observed between HCoV clusters. For example, two Alphacoronavirus HCoV cluters (HCoV-229E, NCoV-NL63) had a significant gain of ApA but a loss of ApG when compared to the Betacoronavirus HCoV clusters (p < 0.001) ( Figures 5B and 5C ). We also observed higher O/E ratio of CpT, TpC and GpG within MERS-CoV, SARS-CoV and SARS-CoV-2 when compared to the common cold-related HCoV clusters. Interestingly, MERS-CoV genomes had the least loss of CpG dinucleotide, followed by SARS-CoV while SARS-CoV-2 had the most, which might imply differential viral gene expression between these three HCoV clusters. Different levels of dinucleotide suppression were found amongst ORFs/genes ( Figure S5 ). For example, the S gene of SARS-CoV-2 had a significant loss of CpG dinucleotide compared to other genes or other HCoV clusters; the N gene had an overall gain of ApG but a loss of ApC. These differences probably imply an evolutionary apomorphy between HCoV genes that warrants further investigation on their biological properties and clinical relevancies. Analyses of the origin and evolutionary tree of life have long been based on sequence-aligned dissimilarity to present the homology of organisms in association with genotype, phenotype and phylotype. However, the genetic heterogeneity, such as recombination, duplication, insertion/deletion, genetic fusion and shuffling, and potential sequencing errors, challenges the accuracy of sequence alignment and comparison that strongly relies on heuristic solution and data quality. The relevance of alignment scores to homology may be deducted for coronavirus genomes because of complex histories evolving from animal hosts; hence, complementary approaches containing as much homological signals as possible will provide an opportunity to explore the underlying evolution (Chor et al., 2009) . In this study, we applied multiple parametric and nonparametric algorithms (e.g., evolutionary phylogeny, trimer spectrums, codon usage bias and dinucleotide suppression) to compare the differential genomic characterisation between the main coronavirus clusters that infect humans (SARS-CoV, SARS-CoV-2, MERS-CoV, HCoV-OC43, HCoV-HUK1, HCoV-229E and HCoV-NL63). Using the features of a large number of viral genomes we provide evidences with different evolutionary constraints driving the heterogeneity of HCoV genomes. The sharp patterns of codon usage and dinucleotide suppression amongst HCOV clusters (e.g., stronger codon usage bias of HCoV-HKU1/HCoV-NL63 probably associated with low viral expression) and genes (e.g., distinct patterns between It has been reported that codon usage preferences among viruses and bacteria reflect a balance between mutational biases, genetic drift and natural selection for translational optimization (Bulmer, 1987; Hershberg and Petrov, 2008) . Among seven identified human coronavirus clusters, SARS-CoV, SARS-CoV-2 and MERS-CoV are highly pathogenic and have been linked to the development of acute respiratory distress syndrome while increasing evidences indicate that they were also different in transmission, mortality, susceptible population, and early clinical manifestations. Since synonymous mutation is often thought to be selectively neutral, the observed variation in codon usage between different HCoV clusters and their genes may suggest the presence of mutational bias and/or selective pressure that may impact translational efficiency. In RNA viruses, constraints on RNA structures necessary for replication and packaging have also been invoked as another selection pressure for codon bias (Goodfellow et al., 2000) . Lastly, the difference in codon usage bias between coronavirus genes could also be explained in part by mutational pressure and selection on genes with different lengths, since selection may be acting to maximize translational efficiency of energetically costly longer genes but reduce the size of highly expressed proteins (Moriyama and Powell, 1998; Zhao and Chen, 2011) . These observations raise the complexity of the mechanistic basis that may contribute to niche adaptation, immune exposure, expression and evasion, virulence potential, and probably pathogenicity of human coronaviruses that warrants further investigation. Viruses rely on host cellular machinery for transcription and replication. However, the virus may have significantly different codon usage to the host genomes for replication and duplication, an evolutionary adaptation to strong host defences and replicative/repair mechanism. Firstly, deoptimized codon usage in coronavirus genomes with respect to that of their hosts may facilitate viral fitness by limiting viral gene expression and eliciting host immune response (Lauring et al., 2012) . Suppression by means of codon usage maladaptation may allow viruses to better escape immune surveillance for persistent infection. On the other hand, overexpression of critical viral genes using host cellular machinery may leave the virus more vulnerable; for example, avian-derived influenza A virus M2 overexpression in mammalian model systems was associated with intracellular accumulation of autophagosomes, which is a critical aspect of influenza A virus host adaptation (Calderon et al., 2019) . Secondly, codon usage in HCoV genomes might be subject to host innate immune pressure, such as from ABOPEC3, a family of cellular cytidine deaminases that introduce directional C>T substitutions. It has been reported that APOBEC3-mediated cytidine deaminase activity could inhibit replication of HCoV-NL63 (Milewska et al., 2018) . Although hypermutation in progeny viruses was not observed, the dramatically underrepresented TpC dinucleotide and TC-ending codons, the preferred dinucleotide target site of many APOBEC3 members, may be responsible for the long-term accumulation of genomic changes that affect the success of niche adaptation or function fitness. Additionally, in HIV, editing cDNA by APOBEC3 could introduce additional genomic composition bias (Liddament et al., 2004) . Various phagocytic cells, cytokines, interferons (IFNs), and IFN-stimulated genes have also been reported to play critical roles as defence against DNA/RNA virus infection, but the codon usage optimization of human coronaviruses, particularly for the highly pathogenic clusters in facilitating the immune evasion warrants further J o u r n a l P r e -p r o o f investigation. Thirdly, codon usage and dinucleotide composition in HCoV genomes could be due in part to host driven CpG elimination pressure. In mammalian host genomes the CpG dinucleotide is underrepresented, because in this context C is methylated and then deaminated, producing C-T transition (Lister et al., 2009 ). This creates an obvious mutational pressure on codon usage and results in underrepresentation of nCG codons. DNA methylation functions as a host defence mechanism by regulating gene expression (Shackelton et al., 2006; Upadhyay et al., 2013) . When CpG residues of foreign DNA were methylated, pathogen activity can be repressed due to alterations in pathogen transcriptional profiles. CpG repression in RNA viruses might be associated with viral base composition, synonymous codon usage and host selection (Upadhyay et al., 2013) . Similar to a number of +ssRNA viruses, HCoV genomes mimic host's CpG usage, which could be tied to evolutionary selection in regulating gene expression, assisting immune evasion, and avoiding C to T mutation elicited by host CpG elimination pressure. However, the extent and dynamics of CpG methylation in HCoV genome and the role in vegetative viral replication and progression to disease remains uncertain. Codon usage bias of viruses may facilitate persistent infections and/or reinfection in hosts. In contrast, optimization to hosts' codon usage dramatically increases the expression levels of pathogenic genes, may provide an important consideration in developing effective vaccines. Vaccination with codon-optimization for specific viral genes, such as HPV16 E6/E7 (Lin et al., 2006; Steinberg et al., 2005) , avian influenza virus H5N1 HA (Stachyra et al., 2016) , HIV-1 reverse transcriptase (Latanova et al., 2018) , may promot host immune response in vitro that may inform vaccine development. Although SARS-CoV-2-specific neutralizing monoclonal antibodies are currently not available, the developed anti-SARS-CoV neutralizing antibodies J o u r n a l P r e -p r o o f may have potential cross-reactivity against SARS-CoV-2 infection since they share a most recent common ancestor, and both use S1-RBD-ACE2 as a binding-receptor pathway for attachment Wrapp et al., 2020) . For example, a SARS-CoV specific human monoclonal antibody, CR3022, has been reported to be able to bind potently with SARS-CoV-2 receptorbinding domain, providing an alternative candidate for SARS-CoV-2 prevention (Tian et al., 2020) . Hence, codon optimization may provide a promising strategy for the development of more efficient vaccine against human coronaviruses. In summary, we applied "high order genomic structures" including trimer spectrums, codon usage and dinucleotide suppression to present a comprehensive analysis of coronaviruses by comparing the diversity and genetic features amongst seven HCoV clusters. Distinct codon usage patterns were observed, which is mainly consistent with the phylogenetic relationships revealed by parametric algorithms. We also observed sharp codon usage patterns amongst genes, mainly between long and short genes, and between structural and non-structural genes. The close relationships between SARS-CoV and SARS-CoV-2 imply similar evolutionary origin, while the lower ENC values in SARS-CoV-2 genome may indicate stronger codon usage bias demonstrating its lower gene expression and/or better host adaptation. The greater variability in synonymous codon usage within coronavirus genomes indicates more complex evolutionary histories in which ancient viral divergence coupled to niche and/or host adaptation has fixed a number of conserved properties, whereas virus divergence and other evolutionary considerations have led to variability within certain limits. The observations raise the complexity of the mechanistic basis that may contribute to niche adaptation, immune exposure, expression and J o u r n a l P r e -p r o o f evasion, virulence potential, and probably pathogenicity of human coronaviruses. The forces affecting synonymous variation in coronaviruses cannot be definitively identified by computational means alone; given the short period of SARS-CoV-2 pandemic and potential mutation of RNA viruses through transmission. However, characterization of coronavirus heterogeneity via complementary approaches provides an opportunity to further explore viral genome evolution, regulation and pathogenesis, and the fundamental mechanism of virus-host interaction. Data availability. A total of 3,557 complete genome sequences assigned to the genera Alphacoronavirus and Betacoronavirus available on the Global Initiative on Sharing All Influenza Data (GISAID) and USA National Center for Biotechnology Information (NCBI) were clustered to 2,522 genomes using a similarity threshold of 0.1%. These sequences were globally aligned to check potential errors, and 5'-and 3'-UTR regions were trimmed. Among them, all sequences assigned to the subgenus Sarbecovirus (n=114) and belonging to HCoV clusters (n=355) were retained for further analysis. In addition, fifty-two reference sequences representing each coronavirus species within the family Coronaviridae were downloaded from NCBI public domain (Tables S1 and S2 ). Phylogenetic analysis. The nucleotide sequences of each ORF were aligned using translation algorithm based on the aligned amino acid sequence matrix using MAFFT within Geneious Primer package. Maximal likelihood (ML) trees were constructed using RAxML MPI v8.2.12 (Stamatakis, 2006) with optimized parameters via CIPRES Science Gateway (Miller, Pfeiffer, J o u r n a l P r e -p r o o f and Schwartz, 2010). Phylogenetic trees were constructed using MAFFT v7.402 (Katoh and Toh, 2010) inferred from the concatenated nucleotide sequence alignments of 6 open reading frames (ORFs) shared by all coronaviruses. K-mer spectrum clustering. We chose k = 3 (a total of 4 3 = 64 trimers) in consideration of the relative short size of viral genome. The trimer frequencies in percentage were normalized and a Kullback-Leibler (KL) distance matrix (Kullback, 1987) was calculated, based on which a hierarchical cluster analysis was performed for phylogenetic topology. We used count function in R's package 'seqinr' (Charif and Lobry, 2007) to count the number of each trimer; and hclust function in R's package 'stats' (Team, 2014) to construct the hierarchical tree. biased a gene is in terms of its codon usage (Wright, 1990) . The ENC values range from 20 when a gene is effectively using only a single codon for each amino acid (strongest bias) to 61 when a gene trends to use all codons with equal frequency (no bias). The codonW package (http://codonw.sourceforge.net/) was used to calculate the ENC values. In order to examine the influence of GC content on codon usage, we plotted the relationship between ENC and GC3s (GC content at the synonymous third codon position). This was compared to the expected ENC* if GC content were solely responsible for the codon biases, calculated as (Shackelton et al., 2006; Wright, 1990 ): * = 2 + 3 + 29 3 2 + (1 − 3 ) 2 values for 59 codons (except for Met, Trp, and stop codons) were calculated by using the ratio of the observed frequency of codons relative to the expected frequency in the absence of usage bias to measure the extent of non-random usage of synonymous codons (Sharp, Tuohy, and Mosurski, 1986) . Given that all the synonymous codons for the same amino acids are used equally, the RSCU value would be 1. The value would be much less than 1 if a codon were used less frequently than expected and vice versa. The website tool CAIcal (http://genomes.urv.cat/CAIcal/) was used to calculate the RSCU values. The RSCU values in the host genomes served as the references, as retrieved from the Kazusa codon usage database (http://www.kazusa.or.jp/codon/) (Nakamura, Gojobori, and Ikemura, 2000) . It is a well-known phenomenon that CpG dinucleotide is uncommon in most mammalian (including human) genomes, termed CpG suppression. To further understand the potential interplay between viruses and their hosts, we measured the genetic metrics including GC content, dinucleotide suppression (ρXY) and CpG methylation of betacoronaviruses surveyed in this work. The GC content of a string of DNA/RNA is simply the fraction of the letters that are C plus those that are G. A measure of dinucleotide suppression was calculated as the observed frequency of the dinucleotide relative to the product of the frequencies of the individual nucleotides (Karlin and Mrazek, 1997) . For example, the ratio = * where fCG represents the frequency of a dimer CG, fC and fG denote the probabilities of its constituent monomers, respectively. The dinucleotide O/E ratio would be 1 if the occurrences of J o u r n a l P r e -p r o o f its individual nucleotide were independent, and the genome exhibits suppression if it has ρXY much less than 1. We used correspondence analysis to study the correlation between codon usage and coronavirus genomic composition. This analysis positions each coronavirus genome (row) and relative synonymous codon usage or dinucleotide suppression (column) to create a series of orthogonal axes to identify variation affecting codon usage bias, with each subsequent axis explaining a decreasing amount of the variation. Multidimensional scaling of the redundancy analysis (RDA), or optionally principal coordinate analysis (PCoA) was applied using rda function in R's package 'vegan' to generate two-dimensional representations for matrix 1 and 2, and visualized using the biplot. By definition, the axes are ordered according to the amount of variance in the data, with samples sharing similar variations clustering together. Differences in codon usage bias were assessed with permutational multivariate analysis of variance (PERMANOVA) using the adonis2 function in R's package 'vegan'. Statistical analysis. The significance of differences in the genetic metric measures was tested using phylogenetic generalized linear models (pgls function in R's package 'caper'), nonparameter Wilcoxon and Mann-Whitney U-test test (implemented in R's package 'stats'), or a two-way analysis of variance (ANOVA). The Pearson's correlation coefficient was used to test the association between paired observations (cor.test in R's package 'stats'). All plotting and statistical comparisons were performed in R v3.0.2 (Team, 2014) using scripts developed inhouse (available upon request). The complete genome sequences of coronaviruses analysed in this study were downloaded from the Global Initiative on Sharing All Influenza Data (GISAID) and USA National Center for Biotechnology Information (NCBI) (see list in Table S1 ). The authors declare that they have no competing interests. PC is not involved in the review of this manuscript. ZC, data curation, formal analysis, writing-original draft and editing; SSB, formal analysis, writing-review and editing; MHW, formal analysis, writing-review and editing; RWYC, formal analysis, writing-review and editing; PKSC, supervision, writing-review and editing. J o u r n a l P r e -p r o o f ^ the preferred codons (RSCU > 1.6) and the suppressed codons (RSCU < 0.6) are highlighted in red and green, respectively. Amino acid SARS-CoV-2 HCoV-OC43 HCoV-HKU1 HCoV-NL63 Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage Coevolution of codon usage and transfer RNA abundance Dysregulation of M segment gene expression contributes to influenza A virus host restriction SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis, Structural approaches to sequence evolution Genomic DNA k-mer spectra: models and modalities Origin and evolution of pathogenic coronaviruses The spike protein of SARS-CoV--a target for vaccine and therapeutic development MERS-CoV spike protein: a key target for antivirals Molecular Evolution of Human Coronavirus Genomes Identification of a cis-acting replication element within the poliovirus coding region Selection on codon bias Neutralizing Antibodies against SARS-CoV-2 and Other Human Coronaviruses Compositional differences within and between eukaryotic genomes Parallelization of the MAFFT multiple sequence alignment program The kullback-leibler distance Codon optimization and improved delivery/immunization regimen enhance the immune response against wild-type and drug-resistant HIV-1 reverse transcriptase Codon usage determines the mutational robustness, evolutionary capacity, and virulence of an RNA virus APOBEC3F properties and hypermutation preferences indicate activity against HIV-1 in vivo A DNA Vaccine Encoding a Codon-Optimized Human Papillomavirus Type 16 E6 Gene Enhances CTL Response and Anti-tumor Activity Human DNA methylomes at base resolution show widespread epigenomic differences Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding APOBEC3-mediated restriction of RNA virus replication Creating the CIPRES Science Gateway for inference of large phylogenetic trees Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli Codon usage tabulated from international DNA sequence databases: status for the year 2000 Middle East respiratory syndrome coronavirus neutralising serum antibodies in dromedary camels: a comparative serological study Evolutionary basis of codon usage and nucleotide composition bias in vertebrate DNA viruses Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes Codon optimization of antigen coding sequences improves the immune potential of DNA vaccines against avian influenza virus H5N1 in mice and chickens RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models Modification of HPV 16 E7 genes: correlation between the level of protein expression and CTL response after immunization of C57BL/6 mice R: A language and environment for statistical computing. R Foundation for Statistical Computing Potent binding of 2019 novel coronavirus spike protein by a SARS coronavirusspecific human monoclonal antibody CpG dinucleotide frequencies reveal the role of host methylation capabilities in parvovirus evolution Alignment-free sequence comparison-a review SARS-CoV infection in a restaurant from palm civet The 'effective number of codons' used in a gene Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Codon usage roles in human papillomavirus Epidemiology and cause of severe acute respiratory syndrome (SARS) in Guangdong, People's Republic of China We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID's EpiFlu™ Database (https://www.epicov.org/) on which this research is based.J o u r n a l P r e -p r o o f