key: cord-0787056-7ddukj9b authors: Huang, Qingsheng; Gao, Huan; Zheng, Lingling; Chen, Xiujuan; Huang, Shuai; Liang, Huiying title: Synonymous sites in SARS-CoV-2 genes display trends affecting translational efficiency date: 2020-05-31 journal: bioRxiv DOI: 10.1101/2020.05.30.125740 sha: ac30b1bd2440aef2cb7d126921f6a712341e92ee doc_id: 787056 cord_uid: 7ddukj9b A novel coronavirus, SARS-CoV-2, has caused a pandemic of COVID-19. The evolutionary trend of the virus genome may have implications for infection control policy but remains obscure. We introduce an estimation of fold change of translational efficiency based on synonymous variant sites to characterize the adaptation of the virus to hosts. The increased translational efficiency of the M and N genes suggests that the population of SARS-CoV-2 benefits from mutations toward favored codons, while the ORF1ab gene has slightly decreased the translational efficiency. In the coding region of the ORF1ab gene upstream of the −1 frameshift site, the decreasing of the translational efficiency has been weakening parallel to the growth of the epidemic, indicating inhibition of synthesis of RNA-dependent RNA polymerase and promotion of replication of the genome. Such an evolutionary trend suggests that multiple infections increased virulence in the absence of social distancing. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a previously unknown 26 coronavirus, has caused the pandemic of coronavirus disease 2019 (Fig. 1B) . Compare to the average mutation density, ORF1ab:A, which is 53 the coding region of the ORF1ab gene upstream of the -1 frameshift site, contains slightly 54 more synonymous sites, while ORF1ab:B, which is downstrem of the -1 frameshift site, 55 contains less synonymous sites. 56 57 FCTE measures the effect of mutation on the translation of the residing gene. We 58 estimated FCTE by the fold change of codon usage frequencies of the codon pair before 59 and after a synonymous mutation (Table S1 ). The codon usage frequency was calibrated 60 by the codon frequency averaged over a repertoire of genes weighted by their expression 61 levels in the type II alveolar (AT2) cells of lung tissue, which is probably the target cells 62 of SARS-CoV-2(15). Two topologies of phylogenetic trees of the genomes of SARS- 63 CoV-2 were examined. The first topology is star-like, in which the central ancestor is the 64 consensus sequence (Fig. 2A) . The second topology is a maximum likelihood tree rooted 65 at the earliest collected genome, in which a mutation is defined as a pair of codon states 66 in the parent and child nodes (Fig. S1, Fig. 2B ORF1ab:B fragments and the full-length ORF1ab gene (Fig. 3) . The evolutionary trend of 83 the ORF1ab:A fragment is parallel to the growth of the epidemic (Fig. 3A) 91 We estimated FCTE by codon usage frequencies calibrated for average lung tissue (Table 92 S1 ). The values of codon usage frequencies are similar to those for the AT2 cells. The 159 The codon usage frequency was calibrated by the averaged frequency of codons in a 184 We downloaded 623 sequences from GISAID's EpiCoV™ (http://www.gisaid.org, last 185 visit on 13 March 2020, External Data S1), and obtained the sequence and annotations of 186 the reference genome of SARS-CoV-2 from NCBI (accession: NC_045512) (1). A 187 multiple sequence alignment (MSA) of 516 genome sequences (sequences of more than 188 29000 nt) was built using MUSCLE version 3.8.31(32) with default setting. The MSA 189 was trimmed, removing nucleotides preceding the first ORF (ORF1ab) and following the 190 last ORF (ORF10). For quality control, we discarded genomes of which the collection 191 date was ambiguous or the sequence contained any gaps or more than one unresolved 192 nucleotide (symbol "N"), which also included genomes collected from pangolins and bats. 193 The refined MSA consists of 317 sequences. 194 A maximum likelihood phylogenetic tree of the 317 sequences was built using NG version 0.9.0(33) with GTR+G substitution model and bootstrapping with based convergence criterion up to 1000 replicates. The tree was essentially unrooted, and 197 we rooted the tree at the earliest collected sample (EPI_ISL_402123) (Fig. S1 ). The second topology of the phylogenetic tree of the virus genomes was a maximum 216 likelihood tree rooted at the earliest collected genome (Fig. S1 ). For a synonymous triplet 217 site, the maximum likelihood ancestor states in internal nodes of the tree were 218 reconstructed using R package phangorn version 2.5.5 (34). Pairs of states of a 219 synonymous site before and after the mutation were defined for edges whose two ends 220 had different codons, i.e., the sum of absolute difference of reconstructed probability of 221 states of codons was greater than 1. The ancestral and mutational states of the edge were 222 the codon of the parent and the child nodes, respectively. The number of descendants of 223 the child node indicates the prosperity of the mutation. 225 FCTE characterizes the tendency of codon usage bias in coding regions. In this study, we 226 estimated FCTE by the fold change of codon usage frequency before and after a 227 synonymous mutation (Fig. 2, A and B) . Synonymous sites of a coding region were 228 grouped by the residing coding regions, and we performed a one-sample exact Wilcoxon 229 test to see if translational efficiency of the coding region was evolutionary stable (Fig. 2, 230 C and D). 231 We plotted log2 FCTE of mutations in the ORF1ab gene against collection times of the 232 virus genomes. For displaying the parallel development of FCTE of mutations with 233 epidemic (Fig. 3) , we extracted from World Health Organization's report on COVID-19 234 the epidemic curve of clinical diagnosed cases, which include laboratory confirmed cases, 235 in Wuhan (4). 236 In addition to the FCTE estimated by codon usage frequency of AT2 cells, we also 237 calculated the FCTE by codon usage frequency for average lung tissue and for all human 238 genes without weighting ( Fig. S2 and S3 ). Sequences from GISAID's EpiCoV™ on which this study is based. The last 470 column "317 genomes" indicates whether a genome was used in constructing the 471 maximum likelihood phylogenetic tree and examining the synonymous triplet sites