key: cord-353777-t8q99tlq authors: Jia, Yong; Shen, Gangxu; Zhang, Yujuan; Huang, Keng-Shiang; Ho, Hsing-Ying; Hor, Wei-Shio; Yang, Chih-Hui; Li, Chengdao; Wang, Wei-Lung title: Analysis of the mutation dynamics of SARS-CoV-2 reveals the spread history and emergence of RBD mutant with lower ACE2 binding affinity date: 2020-04-11 journal: bioRxiv DOI: 10.1101/2020.04.09.034942 sha: doc_id: 353777 cord_uid: t8q99tlq Monitoring the mutation dynamics of SARS-CoV-2 is critical for the development of effective approaches to contain the pathogen. By analyzing 106 SARS-CoV-2 and 39 SARS genome sequences, we provided direct genetic evidence that SARS-CoV-2 has a much lower mutation rate than SARS. Minimum Evolution phylogeny analysis revealed the putative original status of SARS-CoV-2 and the early-stage spread history. The discrepant phylogenies for the spike protein and its receptor binding domain proved a previously reported structural rearrangement prior to the emergence of SARS-CoV-2. Despite that we found the spike glycoprotein of SARS-CoV-2 is particularly more conserved, we identified a mutation that leads to weaker receptor binding capability, which concerns a SARS-CoV-2 sample collected on 27th January 2020 from India. This represents the first report of a significant SARS-CoV-2 mutant, and raises the alarm that the ongoing vaccine development may become futile in future epidemic if more mutations were identified. Highlights Based on the currently available genome sequence data, we proved that SARS-COV-2 genome has a much lower mutation rate and genetic diversity than SARS during the 2002-2003 outbreak. The spike (S) protein encoding gene of SARS-COV-2 is found relatively more conserved than other protein-encoding genes, which is a good indication for the ongoing antiviral drug and vaccine development. Minimum Evolution phylogeny analysis revealed the putative original status of SARS-CoV-2 and the early-stage spread history. We confirmed a previously reported rearrangement in the S protein arrangement of SARS-COV-2, and propose that this rearrangement should have occurred between human SARS-CoV and a bat SARS-CoV, at a time point much earlier before SARS-COV-2 transmission to human. We provided first evidence that a mutated SARS-COV-2 with reduced human ACE2 receptor binding affinity have emerged in India based on a sample collected on 27th January 2020. Monitoring the mutation dynamics of SARS-CoV-2 is critical for the development of effective approaches to contain the 20 pathogen. By analyzing 106 SARS-CoV-2 and 39 SARS genome sequences, we provided direct genetic evidence that 21 SARS-CoV-2 has a much lower mutation rate than SARS. Minimum Evolution phylogeny analysis revealed the putative 22 original status of SARS-CoV-2 and the early-stage spread history. The discrepant phylogenies for the spike protein and its 23 receptor binding domain proved a previously reported structural rearrangement prior to the emergence of SARS- Despite that we found the spike glycoprotein of SARS-CoV-2 is particularly more conserved, we identified a mutation that 25 leads to weaker receptor binding capability, which concerns a SARS-CoV-2 sample collected on 27 th January 2020 from 26 India. This represents the first report of a significant SARS-CoV-2 mutant, and raises the alarm that the ongoing vaccine 27 development may become futile in future epidemic if more mutations were identified. 28 29 Evolutionary rate assessment 95 The ratio of nonsynonymous mutations (dN) to synonymous mutations (dS) was calculated using codeml in the PAML 96 (version 4.7) package (18). CDS sequences for each protein encoding gene were filtered to remove redundant identical 97 sequences. Then codon-based CDS sequence alignment was performed using MUSCLE program, and an individual NJ 98 tree was generated using MEGA7.0 (17) with p-distance model. The obtained sequence alignment and phylogenetic tree 99 files were used as PAML inputs for dN and dS calculations. 100 Protein structural analyses 101 3D structure of the SARS-CoV-2 spike glycoprotein in complex with (PDB: 6VW1, 6VW1) has been determined recently 102 (5, 9) Genetic diversity analyses identified a single amino acid mutation in RBD of the spike protein in SARS-CoV-2 112 As of 24 th March 2020, there are a total of 174 nucleotide sequences for SARS-CoV-2 in the NCBI database. By restricting 113 to the complete or near-complete genomes, 106 sequences from 12 countries were obtained and used for further analyses. 114 This encompasses 54 records from USA, 35 from China, and the rest from other countries: Australia (1), Brazil (2), Finland 115 (1), India (2), Italy (1), Nepal (1), Spain (3), South Korea (1), and Sweden (1). 116 Based on the gene model of the reference SARS-CoV-2 genome (GeneBank: NC_045512.2), a total of 12 protein-encoding 117 open reading frames (ORFs), plus 5UTR and 3UTR were annotated ( Figure 1A) . Overall, the gene sequences from 118 different samples are highly homologous, sharing > 99.1% identity, with the exception of 5UTR (96.7%) and 3UTR (98%) ( Table 1) , which are relatively more divergent. Sequence alignment showed that there is no mutation in ORF6, ORF7a, 120 and ORF7b. The genetic diversity profile across the 106 genomes was displayed in Figure 1A . A few nucleotide sites 121 within ORF1a, ORF1b, ORF3a, and ORF8 exhibiting high genetic diversity were identified ( Figure 1A) . 122 The S protein is critical for virus infection and vaccine development. As shown in Figure To assess how the mutation rate and genetic diversity of SARS-CoV-2, the ratio of nonsynonymous mutations (dN) and 138 synonymous mutations (dS), was calculated for each protein-encoding ORF based on the 106 SARS-CoV-2 and 39 SARS 139 genomes. For SARS-CoV-2, the highest dN was observed for ORF8 (0.0111), followed by ORF1a (0.0081), ORF9 (0.0079), 140 and ORF4 (0.0063) ( Table 1) , indicating these genes may be more likely to accumulate nonsynonymous mutations. In 141 contrast, ORF1b (0.0029), S gene (0.0040) encoding the spike protein, and ORF5 (0.0023) are relatively more conserved 142 in terms of nonsynonymous mutation. Noteworthy, ORF6, ORF7ab and ORF10 are strictly conserved with no 143 nonsynonymous mutation. Compared to SARS-CoV-2, SARS displayed higher mutation rates for all of the ORFs in the 144 virus genome (Table 1) , suggesting an overall higher levels of genetic diversity and mutation rate. In particular, the dN and 145 dS values for the S gene in SARS-CoV is around 12 and 7 times higher than that for SARS-CoV-2. In contrast, the mutation 146 rate differences for ORF1a and ORF1b between SARS-CoV-2 and SARS are relatively milder, varying from 1.5 times to 147 4.8 times only. In contrast to SARS-CoV-2, which has strictly conserved ORF6, ORF7a, and ORF7b, SARS displayed 148 mutation rates at different levels. Notably, the dS for ORF10 are comparable between the two genomes at 0.0326 and 0.0341, 149 respectively. 150 151 To trace the potential spread history of SARS-CoV-2 across the world, an unrooted Minimum Evolution (ME) tree of the 153 106 genomes was developed based on whole-genome sequence alignment. The clustering pattern of the ME phylogeny 154 The spike glycoprotein is critical for the virus infection. Recent study suggested that the S protein in SARS-CoV-2 may 185 has underwent a structural rearrangement(13). To investigate this hypothesis, two separate phylogenies were developed 186 based on the full-S and RBD sequences, respectively. Overall, the two phylogenies displayed similar clustering patterns, 187 separating into three major clades (Figure 3) . SARS-CoV-2 was identified in the same major clade, and was clustered most 188 closely with two bat SARS CoVs (highlighted in purple and green colors, Figure 3 ) and the human SARS-CoV (orange 189 color, Figure 3) . In both phylogenies, SARS-CoV-2 is most closely related to bat_CoV_RaTG13, suggesting SARS- 2 may have originated from bat. However, the evolutionary positions of human SARS-CoV and bat-SL-CoVZ45 were 191 swapped between the full-S and RBD-only phylogenies. In the full-S phylogeny, bat-SL-CoVZ45 is relatively more similar 192 to human SARS-CoV-2, whilst human SARS-CoV is closer to SARS-CoV-2 than bat-SL-CoVZ45. Taken together, these results suggested that the RBD of SARS-CoV-2 is more likely originated from human SARS-CoV, whilst the rest part of 194 the S protein in SARS-CoV-2 may have originated from bat-SL-CoVZ45, supporting the potential structural rearrangement 195 of S protein in SARS-CoV-2. bat_CoV_RaTG13 is similar to SARS-CoV-2, indicating the proposed structural 196 rearrangement may have occurred in bat first before its transmission to human. The RBD of virus S protein binds to a receptor in host cells, and is responsible for the first step of CoV infection (3). Thus, 204 amino acid mutation to RBD may have significant impact on receptor binding and vaccine development. The 3D structure 205 of the spike protein RBD of SARS-CoV-2 (PDB: 6VW1) has recently been determined in complex with human ACE2 206 receptor (6). One of the12 amino acid mutations in the RBD of S protein (R408I) was identified among the 106 SARS-207 CoV-2 genomes. Sequence alignment showed that 408R is strictly conserved in SARS-CoV-2, SARS-CoV and bat SARS-208 like CoV (Figure 4A) . Based on the determined CoV2_RBD-ACE2 complex structure, 408R is located at the interface 209 between RBD and ACE2, but is positioned relatively far away from ACE2, thus does not have direct interaction with ACE2 210 ( Figure 4B) . However, the determined RBD0-ACE2 structure showed that 408R forms a hydrogen bond (3.3 Å in length) 211 with the glycan attached to 90N from ACE2 ( Figure 4C ) (6). The hydrogen bond may have contributed to the exceptionally 212 higher ACE2 binding affinity. In contrast, despite this arginine residue is also conserved in human SARS-CoV 213 (corresponding to 395R in PDB: 2AJF), it is positioned relatively distant (6.1 Å) from the glycan bound to 90N from ACE2 214 ( Figure S1) . Interestingly, the 408R-glycan hydrogen bond seem to be disrupted by the R408I mutation in one SARS-CoV-215 Potential interventions for novel coronavirus in China: A systematic review Timely development of vaccines against SARS-CoV-2. Emerg Microbes Infec 9 Structure of SARS coronavirus spike receptor-binding 319 domain complexed with receptor Evidence for a Common Evolutionary Origin of Coronavirus Spike Protein Receptor-321 Binding Subunits Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Structural basis of receptor recognition by SARS-CoV-2 Angiotensin receptor blockers as tentative SARS-CoV-2 therapeutics. Drug 326 development research The spike protein of SARS-CoV -a target for vaccine and therapeutic 328 development Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Proof of principle for epitope-focused vaccine design Influenza Virosomes in Vaccine 334 Coronaviruses An RNA 336 proofreading machine regulates replication fidelity and diversity Genomic characterisation and epidemiology of 2019 novel coronavirus: 338 implications for virus origins and receptor binding Evolution of the novel coronavirus from the ongoing Wuhan outbreak and 340 modeling of its spike protein for risk of human transmission AnnotationSketch: a genome 343 annotation drawing library MUSCLE: multiple sequence alignment with high accuracy and high throughput MEGA7: molecular evolutionary genetics analysis version 7.0 347 for bigger datasets Phylogenetic analysis by maximum likelihood Genomic variance of the 2019-nCoV coronavirus On the origin and continuing evolution of SARS-CoV-2 A pneumonia outbreak associated with a new coronavirus of probable bat origin Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Preliminary identification of potential vaccine targets 362 for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies contrast to the arginine residue, which is electrically charged and highly hydrophilic, the mutated isoleucine residue has a 217 highly hydrophobic side chain with no hydrogen-bond potential ( Figure 4E ). To sum up, the R408I mutation identified 218 from the SARS-CoV-2 strain in India represents a SARS-CoV-2 mutant with potentially reduced ACE2 binding affinity. 219 220 221 Hydrophobic profile changes due to R408I mutation, with with red and white colours representing the highest hydrophobicity and the lowest 225 hydrophobicity respectively. All amino acid number according to the S protein of SARS-CoV-2 (NC_045512.2) and human ACE2, respectively. Based on the currently available genome sequence data, our results showed that the mutation rate of SARS-CoV-2 is much 229 lower than that for SARS, which caused the 2002-2003 outbreak. Our study is the first to provide a direct quantitative 230 comparison between SARS-COV-2 and SARS. A relatively stable genome of SARS-CoV-2 is a good indication for the 231 epidemic control, as less mutation raises the hope of the rapid development of validate vaccine and antiviral drugs. Our 232 results are consistent with several recent genetic variance analyses on SARS- 20) , which suggested the SARS-233CoV-2 genomes are highly homogeneous. Molecular geneticists closely monitoring the virus development also suggested 234 that the mutation rate of SARS-CoV-2 maintains at a low level. Whilst it is generally safe to say that SARS-CoV-2 tends 235 to mutate at a low rate, all current analyses are merely based on data collected at the early stage of this pandemic. As the 236 virus continues to spread rapidly around the world, and more genomic data is accumulated, the evolution and mutation 237 dynamics of SARS-CoV-2 still need to be monitored closely. 238 One critical aim of our study is to identify the original status of SARS-CoV-2 before its wide transmission across different 240 countries. Due to the short time space of sample collection and a relatively low mutation rate for SARS-CoV-2, we believe 241 that a Minimum Evolution phylogeny may outperform other phylogenetic methods to achieve this aim. As expected, the 242 earliest few reported SARS-CoV-2 accessions collected from Wuhan China were identified at the center of the phylogenetic 243 tree with the shortest branch. Interestingly, a number of virus genomes from USA were found almost identical to these 244 putative original versions of virus from Wuhan. However, according to public media, the outbreak of SARS-CoV-2 in USA 245 occurred relatively later than other countries. One possible explanation for this observation is that, the spread of SARS-246CoV-2 in USA might start much earlier than previously thought or reported. Due to a dominant proportion of the samples 247 in this study were collected from China and USA, we observed a significantly higher level of genetic diversity from these 248 two countries. Most SARS-CoV-2 accessions from the other countries can find their closely related sisters from either 249China or USA. This data bias, on the other hand, may give us an advantage to trace the spread history of SARS-CoV-2 in 250 different countries. This suggestion is reliable because all of the samples studies in this study were collected at the early 251 stage of the pandemic, which may avoid the potential data noise caused by recent published genomes of complex spread 252 background. One notable finding in our phylogenetic tree is that, the singleton SARS-CoV-2 accessions collected from 253 Australia, Brazil, South Korea, Italy and Sweden were clustered together with two USA samples but without a Chinese 254 version, suggesting that these infection cases may be somehow related. In addition, one of the three samples collected from 255 the cruise ship stranded in Japan was found closely related to a sample collected from Guangzhou, China, whilst the other 256 two were grouped with several cases from USA. Noteworthy, out phylogeny seems to support the presence of two major 257 types of SARS-CoV-2 in the target samples, suggesting the potential existence of two spread sources. Interestingly, this 258 speculation is corroborated by an independent clustering analyses using different phylogeny method (20). 259 260 Until now, the origin of SARS-CoV-2, and how it has been transmitted to human remains largely a mystery. Early genomic 261 data proved that human SARS-CoV-2 is an enveloped, positive-sense, and single-stranded RNA virus in the subgenus 262Sarbecovirus of the genus Betacoronavirus (13, 14) . Evolutionarily, SARS-CoV-2 is most closely related to bat SARS-like 263CoV (88% genome sequence identity) and human SARS CoV (79%), the latter of which has caused world pandemic in 264 2003 (13). Based on the strong genome sequence identity between SARS-CoV-2 and bat SARS-like COVs, it was initially 265 speculated that SARS-CoV-2 may have originated from bat (14, 21). However, a more recent study proposed that pangolin 266 may be the most likely reservoir hosts due to the identification of closely related SARS-COVs from this species as well 267 (22). Both of these two animals can harbor coronaviruses related to SARS-CoV-2. However, direct evidence of the 268 transmission of SARS-CoV-2 from either bat or pangolin to human is still missing. 269 270 Prior to this study, several publications have suggested that SARS-CoV-2 may have originated from the genome 271 recombination of SARS-like CoVs from different animal hosts, as evidenced by the discrepant clustering patterns for the 272 phylogenies using different genetic regions. Lu (13) first observed that the RBD of S protein in SARS-CoV-2 is more 273 closely related to human SARS-CoV, whilst the other part of its genome is more similar to bat SARS-CoV. Later Peng (23) 274 identified a bat CoV_RaTG13 and several pangolin SARS-CoVs that are consistently closer to SARS-CoV-2 than human 275 SARS-CoV in either full-S protein or RBD. By combining the data from these two studies, our study confirmed the 276 observations reported in both studies, and further determined that the S protein recombination actually happened between 277 human SARS-CoV and a bat SARS-CoV, much earlier before its transmission to human, with the newly identified bat 278 SARS-CoV-RaTG13 as an intermediate. 279 Another notable finding in this study corresponds to the identification of an amino acid mutation in the RBD of S protein 281 in SARS-CoV-2. Mostly importantly, we showed that this amino acid mutation is very likely to cause a reduced binding 282 affinity to human ACE2 receptor. The RBD of S protein binds to a receptor in host cells, and is responsible for the first 283 step of CoV infection. The receptor binding affinity of RBD directly affects virus transmission rate. Thus, it has been the 284 major target for antiviral vaccine and therapeutic development such as SARS (8). Despite the S protein gene seems to be 285 more conserved than the other protein-encoding genes in the SARS-CoV-2 genome, our study provide direct evidences 286 that a mutated version of SARS-CoV-2 S protein with varied transmission rate may have already emerged. Based on the 287 close relationship of SARS-CoV-2 to SARS, current vaccine and drug development for SARS-CoV-2 has also focused on 288 the S protein and its human binding receptor ACE2 (7, 24). Thus, the observation in this study raised the alarm that SARS-289CoV-2 mutation with varied epitope profile could arise at any time, which means current vaccine development against 290 SARS-CoV-2 is at great risk of becoming futile. Because the receptor recognition mechanism seems to be highly conserved 291 between SARS-CoV-2 and SARS-CoV, which have been proved to share the common human cell receptor ACE2. One 292 suggestion for the next step of therapeutic development is probably to focus on the identification of potential human ACE2 293 receptor blocker, as suggested in a recent commentary (7). This approach will avoid the above-