key: cord-0767634-zu46bdpu authors: Isabel, Sandra; Graña-Miraglia, Lucía; Gutierrez, Jahir M.; Bundalovic-Torma, Cedoljub; Groves, Helen E.; Isabel, Marc R.; Eshaghi, AliReza; Patel, Samir N.; Gubbay, Jonathan B.; Poutanen, Tomi; Guttman, David S.; Poutanen, Susan M. title: Evolutionary and structural analyses of SARS-CoV-2 D614G spike protein mutation now documented worldwide date: 2020-06-08 journal: bioRxiv DOI: 10.1101/2020.06.08.140459 sha: 7a969204fe0031c95f43288d3f88ce33597bf828 doc_id: 767634 cord_uid: zu46bdpu The COVID-19 pandemic, caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), was declared on March 11, 2020 by the World Health Organization. As of the 31st of May, 2020, there have been more than 6 million COVID-19 cases diagnosed worldwide and over 370,000 deaths, according to Johns Hopkins. Thousands of SARS-CoV-2 strains have been sequenced to date, providing a valuable opportunity to investigate the evolution of the virus on a global scale. We performed a phylogenetic analysis of over 1,225 SARS-CoV-2 genomes spanning from late December 2019 to mid-March 2020. We identified a missense mutation, D614G, in the spike protein of SARS-CoV-2, which has emerged as a predominant clade in Europe (954 of 1,449 (66%) sequences) and is spreading worldwide (1,237 of 2,795 (44%) sequences). Molecular dating analysis estimated the emergence of this clade around mid-to-late January (10 - 25 January) 2020. We also applied structural bioinformatics to assess D614G potential impact on the virulence and epidemiology of SARS-CoV-2. In silico analyses on the spike protein structure suggests that the mutation is most likely neutral to protein function as it relates to its interaction with the human ACE2 receptor. The lack of clinical metadata available prevented our investigation of association between viral clade and disease severity phenotype. Future work that can leverage clinical outcome data with both viral and human genomic diversity is needed to monitor the pandemic. Evolutionary and structural analyses of SARS-CoV-2 D614G 27 spike protein mutation now documented worldwide 28 29 The COVID-19 pandemic, caused by the Severe Acute Respiratory Syndrome Coronavirus 2 30 (SARS-CoV-2), was declared on March 11, 2020 by the World Health Organization. As of the 31 st 31 of May, 2020, there have been more than 6 million COVID-19 cases diagnosed worldwide and 32 over 370,000 deaths, according to Johns Hopkins. Thousands of SARS-CoV-2 strains have been 33 sequenced to date, providing a valuable opportunity to investigate the evolution of the virus on a 34 global scale. We performed a phylogenetic analysis of over 1,225 SARS-CoV-2 genomes spanning 35 from late December 2019 to mid-March 2020. We identified a missense mutation, D614G, in the 36 spike protein of SARS-CoV-2, which has emerged as a predominant clade in Europe (954 of 1,449 37 (66%) sequences) and is spreading worldwide (1,237 of 2,795 (44%) sequences). Molecular dating 38 analysis estimated the emergence of this clade around mid-to-late January (10 -25 January) 39 2020. We also applied structural bioinformatics to assess D614G potential impact on the 40 virulence and epidemiology of SARS-CoV-2. In silico analyses on the spike protein structure 41 suggests that the mutation is most likely neutral to protein function as it relates to its interaction 42 with the human ACE2 receptor. The lack of clinical metadata available prevented our 43 investigation of association between viral clade and disease severity phenotype. Future work that 44 can leverage clinical outcome data with both viral and human genomic diversity is needed to 45 monitor the pandemic. 46 47 In late December 2019, a cluster of atypical pneumonia cases was reported and epidemiologically 48 linked to a wholesale seafood market in Wuhan, Hubei Province, China 1 . The causative agent was 49 identified as a novel RNA virus of the family Coronaviridae and was subsequently designated SARS-50 CoV-2 owing to its high overall nucleotide similarity to SARS-CoV, which was responsible for 51 previous outbreaks of severe acute respiratory syndrome in humans between 2002-2004 2,3 . Previous 52 studies of SARS-CoV-2 genomes sequenced during the early months of the epidemic (late December 53 2019 up to early February 2020) estimated the time of its emergence at the end of November (18 th and 54 25 th ) 4-6 , approximately a month before the first confirmed cases. It has been hypothesized that SARS-55 CoV-2 may have undergone a period of cryptic transmission in asymptomatic or mildly symptomatic 56 individuals, or in unidentified pneumonia cases prior to the cluster reported in Wuhan in late December 57 2019 3 . Based on the high nucleotide identity of SARS-CoV-2 to a bat coronavirus isolate (96%) 7 , a 58 possible scenario is that SARS-CoV-2 had undergone a period of adaptation in an as yet identified 59 animal host, facilitating its capacity to jump species boundaries and infect humans 3 . The present rapid 60 spread of the virus worldwide, coupled with its associated mortality, raises an important concern of its 61 further potential to adapt to more highly transmissible or virulent forms. 62 The availability of SARS-CoV-2 genomic sequences concurrent with the present outbreak provides a 64 valuable resource for improving our understanding of viral evolution across location and time. We 65 performed an initial phylogenetic analysis of 749 SARS-CoV-2 genome sequences from late-December 66 2019 to March 13, 2020 (based on publicly available genome sequences on GISAID) and noted 152 67 SARS-CoV-2 sequences initially isolated in Europe beginning in February, 2020, which appear to have 68 emerged as a distinct phylogenetic clade. Upon further investigation, we found that these strains are 69 distinguished by a derived missense mutation in the spike protein (S-protein) encoding gene, resulting 70 in an amino acid change from an aspartate to a glycine residue at position 614 (D614G). At the time of 71 our study, the D614G mutation was at particularly high frequency 20/23 (87%) among Italian SARS-72 CoV-2 sequenced specimens, which was then emerging as the most severely affected country outside of 73 China, with an overall case fatality rate of 7.2% 8 . 74 During the course of our analysis, the number of available SARS-CoV-2 genomes increased 76 substantially. We subsequently analyzed a total of 2,795 genome sequences of SARS-CoV-2 77 (Supplementary data Table 1 ). For those sequences with demographics (65% of sequenced specimens), 78 the male to female ratio was 0.56:0.44, with a median age of 49 years old, and a range from less than 1 79 to 99 years old. As of 30 March 2020, the D614G clade includes 954 of 1,449 (66%) of European 80 specimens and 1,237 of 2,795 (44%) worldwide sequenced specimens. A comparison against the 81 previous set of genomes collected for our phylogenetic and molecular dating analysis revealed that for 82 samples submitted during the period from March 17-30, 2020, the D614G clade became increasingly 83 prevalent worldwide, expanding from 22 to 42 countries (Figure 1 ), as also reported previously 9, 10 . The 84 demographic distribution for this mutation, when known, (male to female ratio, 0.56:0.44; median age, 85 48 years old; age range, less than 1 to 99 years old) was not significantly different compared to the 86 reported demographics for all sequenced SARS-CoV-2 specimens. 87 We employed molecular dating to estimate the time of emergence of the D614G clade. Based on a 89 curated set of 442 genomes representing the sequence diversity of SARS-CoV-2 samples available at 90 the time of analysis (30 March 2020), the mean time to most recent common ancestor (tMRCA) was 91 estimated to be on 18 January 2020 (95% highest posterior density (HPD) interval: 10 January -25 92 January), indicating its relatively recent emergence ( Figure 2 ). Although the mutation appears to be 93 clade-specific, we noted that D614G also arose another time in a single isolate belonging to a distinct 94 lineage, Wuhan/HBCDC-HB-06/2020 (EPI 412982), collected on 7 February 2020. Given the high 95 degree of nucleotide identity of the D614G clade (~99.6%), we expect that future tMRCA estimates 96 will not differ substantially. The mean tMRCA for the root of the tree was estimated to be 13 November information, which are subject to change as more information becomes available and compared to other 105 epidemiological information. For instance, as different countries review and test archived specimens 106 from cases of severe pneumonia or influenza-like illness for SARS-CoV-2, it is expected that additional 107 cases may be identified, such as in France where a patient without travel history to China was identified 108 to have COVID-19 in late December concurrent with the initial reported cases from Wuhan 11 . These 109 retrospective analyses will provide crucial insights into the early transmission dynamics and evolution 110 of SARS-CoV-2 and its rapid global spread. 111 112 From our findings of the recent emergence of the D614G clade and the increasing number of specimens 113 harboring the mutation identified worldwide, we sought to investigate the potential significance of the 114 mutation on clinical disease severity phenotypes. Unfortunately, only limited clinical disease severity 115 data was available for patients whose sequenced samples were included in our analysis (asymptomatic 116 64, symptomatic 8, mild symptoms 2, pneumonia 4, hospitalized 163, released 48, recovered 26, 117 nursing home 3, live 24, ICU 2, deceased 1), preventing us from being able to meaningfully correlate 118 disease severity and genotype from these data. However, using country-wide crude case fatality data 119 for countries from which sequencing data was available, there was no significant correlation between 120 proportion of D614G clade sequences and crude case fatality rate as of 30 March 2020 (Spearman's 121 rank correlation coefficient, r 0.22, 95% CI -0.12 to 0.51). In addition, on analysis of crude case fatality 122 rate by age-group (available for China, Italy, South Korea, Spain, and Canada) there was no significant 123 correlation with proportion of D614G clades in the sequences analysed for these countries 124 (Supplementary Table 2 The structure of the SARS-CoV-2 spike (S) protein is shown in Figure 3 . The S protein is a heavily 143 glycosylated trimeric protein that mediates entry to host cells via fusion with ACE2. Recently, Wrapp 144 and colleagues used Cryo-EM to determine the structure of the S protein and analyze its 145 conformational changes during infection 19 . Using their three-dimensional model of the S protein 146 structure, we set out to investigate the effects that a mutation in position 614 might have. First, we 147 analyzed changes in inter-atomic contacts within a radius of 6 Å from position 614 before and after 148 substitution of aspartate by glycine. Notably, four inter-chain destabilizing (i.e., hydrophobic-149 hydrophilic) contacts are lost with residues of an adjacent chain upon D614G mutation (see Table 1 and 150 Figure 3C ). This suggests that a small repelling interaction between adjacent chains is removed upon 151 this aspartate substitution (see Table 1 ). However, it is unlikely that this would have a significant effect 152 on recognition and binding to ACE2 given the relative distal position of this mutation with respect to 153 the receptor-binding domain (RBD) (see Figure 3B ), but further analyses would be required to assess 154 whether the D614G mutation has an effect on the way the S protein changes its conformation after 155 interaction with ACE2. Lan and colleagues also showed residues in the RBD act as epitopes for SARS-156 CoV-2 and mutations can influence antibody binding 20 . Given the important role that glycosylation 157 plays in regulating the function of spike proteins in coronaviruses 21 , we decided to search for potential 158 changes in a glycosylated residue (asparagine) in position 616. As shown in Figure 3D (Figure 4) . We do note, however, that numerous factors affect the measurement and 182 limit the reliability of Ct value measurement, including sample type, quality of swabs, quality of 183 collection method, and time from onset to sample collection 23 . In contrast, another study has found a 184 statistically significant decrease in PCR Ct values associated with the mutated amino acid G variant 10 185 but its clinical significance remains unclear. In spite of the lack of evidence for phenotypic differences, 186 it is interesting that in a short period of time since its emergence the D614G clade has become 187 widespread all around the world. It is possible that even if the observed mutation does not impact the 188 protein's interaction with ACE2, it may not be completely neutral with respect to viral fitness. For 189 example, given that the molecular weight of glycine is significantly smaller than that of aspartate, the 190 mutation could be advantageous from a cost minimization point of view 24 in a reference spike monomer (blue) and four residues (pink) in its adjacent spike protein monomer 436 chain (white). These four contacts are destabilizing and create a hydrophilic-hydrophobic repelling 437 effect that is lost upon replacement of aspartate by glycine in the D614G mutation (see Table 1 ). (D) 438 Spatial distribution of aspartate 614 residue (green) and an adjacent glycosylated asparagine residue in 439 position 616 (orange). The two residues point in opposite directions and thus it is unlikely they share a 440 meaningful interaction. A novel coronavirus from patients with pneumonia in China Genetic diversity and evolution of SARS-CoV-2 A Genomic Perspective on the Origin and Emergence of SARS-326 The proximal origin 328 of SARS-CoV-2 The global spread of 2019-nCoV: a molecular evolutionary analysis Early phylogenetic estimate of the 332 effective reproduction number of SARS-CoV-2 A pneumonia outbreak associated with a new coronavirus of probable bat origin Case-Fatality Rate and Characteristics of Patients Dying 337 in Relation to COVID-19 in Italy Bioinformatic prediction of potential 339 T cell epitopes for SARS-Cov-2 Spike mutation pipeline reveals the emergence of a more transmissible form of 341 SARS-CoV-2 SARS-CoV-2 was already spreading in France in late COVID-19 National Emergency Response Center Korea Centers for Disease Control Coronavirus Disease-19: The First 7,755 Cases in the Republic of 346 Government of Canada. Coronavirus Disease 2019 (COVID-19) Daily epidemiology update On the origin and continuing evolution of 352 SARS-CoV-2". (2020) Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2 Molecular Evolution of the SARS Coronavirus, during the Course of the SARS 357 Epidemic in China. Science (80-. ) Cross-host evolution of severe acute respiratory syndrome coronavirus in palm 359 civet and human Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. 361 Science (80-. ) Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 363 receptor Expression, glycosylation, and modification of the spike (S) 365 glycoprotein of SARS CoV Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Improved Molecular Diagnosis of COVID-19 by the Novel, Highly 369 Sensitive and Specific COVID-19-RdRp/Hel Real-Time Reverse Transcription-PCR Assay 370 Validated In Vitro and with Clinical Specimens Cost-minimization of amino acid usage Multiple nucleic acid binding sites and intrinsic disorder of severe acute 373 respiratory syndrome coronavirus nucleocapsid protein: implications for ribonucleocapsid 374 protein packaging Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-376 RNA polymerase variant Global Spread of SARS-CoV-2 Subtype with Spike Protein Mutation 378 D614G is Shaped by Human Genomic Variations that Regulate Expression of TMPRSS2 and 379 MX1 Genes MAFFT multiple sequence alignment software version 7: 385 Improvements in performance and usability IQ-TREE: a fast and effective 387 stochastic algorithm for estimating maximum-likelihood phylogenies A new method for inferring timetrees from temporally sampled molecular 390 sequences Issues with SARS-CoV-2 sequencing data ClonalFrameML: efficient inference of recombination in whole 394 bacterial genomes BEAST 2: a software platform for Bayesian evolutionary analysis Posterior summarization in 398 Bayesian phylogenetics using Tracer 1.7 BEAST 2.5: An advanced software platform for Bayesian evolutionary 400 analysis Ggtree: an R Package for Visualization 402 and Annotation of Phylogenetic Trees With Their Covariates and Other Associated Data Automated analysis of 405 interatomic contacts in proteins