key: cord-260054-iihgc5nr authors: Cavallo, Luigi; Oliva, Romina title: D936Y and Other Mutations in the Fusion Core of the SARS-Cov-2 Spike Protein Heptad Repeat 1 Undermine the Post-Fusion Assembly date: 2020-06-08 journal: bioRxiv DOI: 10.1101/2020.06.08.140152 sha: doc_id: 260054 cord_uid: iihgc5nr The iconic “red crown” of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is made of its spike (S) glycoprotein. The S protein is the Trojan horse of coronaviruses, mediating their entry into the host cells. While SARS-CoV-2 was becoming a global threat, scientists have been accumulating data on the virus at an impressive pace, both in terms of genomic sequences and of three-dimensional structures. On April 21st, the GISAID resource had collected 10,823 SARS-CoV-2 genomic sequences. We extracted from them all the complete S protein sequences and identified point mutations thereof. Six mutations were located on a 14-residue segment (929-943) in the “fusion core” of the heptad repeat 1 (HR1). Our modeling in the pre- and post-fusion S protein conformations revealed, for three of them, the loss of interactions stabilizing the post-fusion assembly. On May 29th, the SARS-CoV-2 genomic sequences in GISAID were 34,805. An analysis of the occurrences of the HR1 mutations in this updated dataset revealed a significant increase for the S929I and S939F mutations and a dramatic increase for the D936Y mutation, which was particularly widespread in Sweden and Wales/England. We notice that this is also the mutation causing the loss of a strong inter-monomer interaction, the D936-R1185 salt bridge, thus clearly weakening the post-fusion assembly. Coronavirus Disease 2019 is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). SARS-CoV-2 is a novel virus belonging to the β genus coronaviruses, which also include two highly pathogenic human viruses identified in the last two decades, the severe acute respiratory syndrome coronavirus (SARS-CoV) and the Middle East respiratory syndrome coronavirus (MERS-CoV) (1) (2) (3) . Coronaviruses are named after the protruding spike (S) glycoproteins on their envelope, giving a crown (corona in latin) shape to the virions (4) . Of the four structural proteins of coronavirues, S, envelope (E), membrane (M), and nucleocapsid (N), the S protein is the one playing a key role in mediating the viral entry into the host cells (5) (6) (7) , making it one of the main targets for the development of therapeutic drugs and vaccines (8) (9) (10) (11) (12) (13) (14) . Comprised of two functional subunits, S1 and S2, it first binds to a host receptor through the receptor-binding domain (RBD) in the S1 subunit and then fuses the viral and host membranes through the S2 subunit (7, 15) . In the prefusion conformation, the SARS-CoV-2 S protein forms homotrimers protruding from the viral surface, where its RBD binds to the angiotensin-converting enzyme 2 (ACE2) receptor on the host cell surface (1) (like the SARS-CoV homolog (16) , and differently from MERS-CoV S, which recognizes a different receptor, the dipeptidyl peptidase 4 (17) ). Receptor binding and proteolytic processing by cellular proteases then cause S1 to dissociate and S2 to undergo large-scale conformational changes towards a stable structure, bringing viral and cellular membranes into close proximity for fusion and infection (7, 15, 18) . While the outbreak of COVID-19 was rapidly spreading all over the world, affecting millions of people and becoming a global threat, laboratories worldwide promptly started to sequence a large number of SARS-CoV-2 genomes. All the available genomic data is accessible through the Global Initiative on Sharing All Influenza Data (GISAID) website, an invaluable open access resource (19, 20) . Simultaneously, crucial structural knowledge has been achieved on SARS-CoV-2, especially regarding the S protein. 3D structures are now available from the Protein Data Bank (PDB) (21) for the SARS-CoV-2 S protein in the pre-fusion conformation, also bound to the ACE2 receptor (22) (23) (24) (25) (26) (27) (28) , and for the post-fusion core of its S2 subunit in the postfusion conformation (29) . On April 21 st 2020, 4 months after the first sequencing (30), 10,823 genomic sequences of SARS-CoV-2 were available from GISAID. Therefore, we considered the time ripe for an assessment of the mutational spectrum of the SARS-CoV-2 spike protein. To this aim, we extracted all the complete S protein sequences from the GISAID 21 st April dataset and identified all the mutations occurring in at least 2 identical sequences (see Table S1 ). From this analysis, a 14-amino acid segment in the fusion core of the heptad repeat 1 (HR1) emerged as a hotspot for mutations. While the mutations we identified corresponded to a 1 mutation every 12 positions along the protein sequence, as many as 6 amino acids were found to be mutated in the above 14-amino acid segment: S929, D936, L938, S939, S940 and S943. After the proteolytic processing, in the post-fusion conformation, the S protein HR1 and HR2 motifs interact with each other to form a six-helix bundle (6-HB), which promotes initiation of the viral and cellular membranes fusion. The HR1 "fusion core" is named after its role in giving many interactions with HR2 in the post-fusion conformation, thus playing a key role in the virus infectivity (31) . Based on the structural location of the above highly concentrated mutations and on their nonconservative nature, we considered them of particular interest and decided to further investigate their structural basis, both in the pre-and post-fusion conformation, as well as their sequencing dates and geographical distribution. As we show in the following, as many as three of them are responsible for the loss of inter-monomer Hbonds in the post-fusion conformation, while one of them, S943P, would introduce unexpected structural strain in the pre-fusion conformation. A search in the GISAID resource updated to May 29 th showed a significant increase in occurrences especially for one mutant, D936Y, unreported to date, which has become a common variant in some European countries, especially Sweden. It is also the mutant having the most significant structural role, causing the loss of an intermonomer salt bridge in the post-fusion assembly. We downloaded the 10,823 genomic sequences available from GISAID on April 21 st 2020. From these sequences, we extracted the nucleotide sequences of the spike protein and translated them to protein sequences with in-house scripts. Nucleotides sequences featuring an internal stop codon, having at least one undefined ("N") nucleotide or resulting in spike proteins of length different from 1,273 amino acids were discarded. Sequences annotated as pangolin, bat or canine were also discarded. The remaining 7,692 protein sequences were further analysed. First, we clustered them in sets of identical sequences with CD-HIT (32), obtaining 120 clusters of at least 2 sequences and 245 unique sequences. As a reference system for further analyses, we used the first dated (on December 24 th 2019) genomic sequence in GISAID, isolated and sequenced in Wuhan (Hubei, China) (30) . Then, upon alignment to the reference sequence, we identified point mutations in all the sets of at least two sequences. We downloaded again the 34,805 genomic sequences available from GISAID on May 29 th 2020 (gisaid_hcov-19_2020_05_29_14) and followed the above pipeline to extract 23,332 complete 1273-residue long S protein sequences. We then recorded the presence and frequency in them of any mutation occurring in the fusion core of the HR1 (residues 929-949) with in-house scripts. Mutants 3D models were built using the mutate_model module of the Modeller 9v11 program (33) . This is an automated method for modelling point mutations in protein structures, which includes an optimisation procedure of the mutated residue in its environment, beginning with a conjugate gradients minimisation, continuing with molecular dynamics with simulated annealing and finishing again by conjugate gradients. The used force field is CHARM-22, for details see Reference (34) . Models for mutants in the pre-fusion conformation were built starting from the EM structure of the pre-fusion trimeric conformation (PDB ID: 6VSB, resolution 3.46 Å, (22)). Models for mutants in the post-fusion conformation were built starting from the X-ray structure of the S2 subunit fusion core (PDB ID: 6LXT, resolution 2.90 Å, (29)). Molecular models were analysed and visually inspected with Pymol (35) . The COCOMAPS web server (36) was used to analyse the inter-chain contacts and Hbonds as well as the residues accessibility to the solvent. We downloaded all the SARS-CoV-2 genomic sequences from the GISAID resource on April 21 st 2020, extracted from them 7,692 complete S protein sequences and identified all the point mutations occurring in at least two identical sequences (see Methods). The 111 mutations we identified, occurring at 105 different positions spread all over the protein sequence, are reported in Table S1 , with the relative number of occurrences. While the mutations we identified were spaced on average 12 positions along the protein sequence, a segment of 14 amino acids harboured 6 mutations, at positions 929, 936, 938, 939, 940 and 943, proposing itself as a mutational hotspot. This sequence segment is part of the "fusion core" of the HR1, in the protein S2 subunit. The HR1 of coronaviruses S proteins undergoes one of the most notable rearrangements within the protein between the pre-and post-fusion conformations. In the post-fusion conformation, in fact, it experiences a refolding of the pre-fusion multiple helices and intervening regions into a single continuous helix ( Figure 1 ). As already mentioned, three of these long helices then form a 6HB with three HR2 helical motifs (18, 29, 31) . The HR1 and its "fusion core" in particular thus play a crucial role in the virus infectivity. The following 6 mutations were identified in the fusion core of the HR1 on April 21 st 2020: S929I, D936Y, L938F, S939F, S940F, S943P. Two of these mutations, D936Y and S943P, were among the most frequent in the ensemble of mutations we identified. Besides the widespread D614G, now dominant over the original D614 variant (37, 38) , only 5 other mutations (two of them being very peripheral, L5F and P1263L) recurred indeed in ≥ 20 sequences (see Table S1 ). S943P was also reported in (38) , where it was hypothesized to be spreading via recombination. The L938F mutation was a particularly late one; it was found in 2 sequences, associated to the D614G mutation, both from England and dated March 29 th . The S929I mutation was found in 2 sequences from USA (Washington), dated March 12 th and 27 th , associated to the D614G mutation. Finally, the S940F mutation had a unique geographical distribution, as it was found in 2 sequences from Australia (New South Wales) dated February 28 th and March 4 th , not associated to the D614G mutation. In addition, it was found in 1 single sequence from France, dated March 20 th , where it was associated to the D614G mutation. In conclusion, with the exception of S940F, which was found in Australia, all the mutations in the HR1 core fusion were spread in two continents, Europe and/or North America. Furthermore, most of them originated from the D614G variant. This is in agreement with them seeming to be quite late mutations, sequenced starting from the end of February/March 2020, i.e. over two months after the first Wuhan variant dated December 24 th 2019 (30) ( Table 1) The ≈3-fold increment in frequency of the S929I and S939F mutants was in line with the increment of the sequences in the dataset. The three additional occurrences of the S929I mutation were from USA (Washington), Wales and England. A novel S929T mutation was also reported twice from Scotland. Additional occurrences of the S939F mutation were instead from USA, 7, England, 2, Kazakhstan, 1, and UAE, 1. As for the L938F and S940F mutants, their increment was significantly lower than the increment of the sequences in the dataset. A positive selection thus clearly hasn't emerged to date for these mutations. The only additional occurrence of L938F was from Denmark, while the 2 additional occurrences of S940F were from France and USA (Washington). The S943P mutation represented a special case. Most of the sequences harbouring such a mutation were indeed modified between the April 21 st and the May 29 th datasets, so that they do not feature anymore the mutation to proline. However, 3 novel occurrences of the same mutation, S943P, emerged from China (Beijing). In addition, 3 sequences from Scotland presented the novel S943I mutation. As for the remaining positions of the HR1 fusion core, to May 29 th , either they were fully conserved (S937, K933, A942, A944, L945, G946, K947 and Q949), or they hosted one single occurrence of mutation (to valine for A930, to aspartate for I931 and G932, to histidine for Q934 and D935, to alanine for T941 and to arginine for L948). Because of the rarity of such mutations, we will not discuss them here. However, we will continue to monitor them over time. All the amino acids undergoing mutations in the SARS-CoV-2 S protein are conserved in the bat coronavirus RaTG13 S protein (sharing an overall sequence identity of 97% with SARS-CoV-2 S protein), while as many as five of them are mutated in the SARS-CoV-2 S protein (overall 76% sequence identical to the SARS-CoV-2 homolog) (see Figure 1 ). Four of these mutations are however conservative (aspartate to glutamate, serine to threonine), except S929, which is a lysine in SARS-CoV. It has been proposed that such mutations in the SARS-CoV-2 HR1 may be associated with enhanced interactions with the HR2, further stabilizing the 6-HB structure and maybe leading to increased infectivity of the virus (29) . It is noteworthy that no one of the point mutations we identified restored the corresponding SARS-CoV amino acid. In the pre-fusion conformation, all the mutated positions, but S943, are located on the second of four non-coaxial helical segments composing the HR1 (Figure 1 ). Four of them, S929, D936, S939 and S940, are exposed to the solvent (Table 2) , and can be modelled as larger (mostly aromatic) residues without causing any structural strain (see Figure 2 ). These mutations are not expected to cause relevant changes in the prefusion structure, although they could have a destabilizing effect as a consequence of posing large aromatic residues in direct contact with the solvent instead of smaller apolar (leucine), polar (serine in 2 cases) or even charged (aspartate) residues. In addition, S940 involves its side-chain in a H-bond with the main-chain of D936, 4 residues upstream. The loss of this H-bond in the S940F mutant also points to a slight destabilization of the pre-fusion conformation. As for L938, it is buried in the prefusion conformation, pointing towards a three-stranded anti-parallel β-sheet made of the S711-P728 segment from the S1 subunit and of the Y1047-P1053 and G1059-A1078 segments from the S2 subunit, without directly contacting it (distances above 5 Å). It can also be modelled as a large phenylalanine without causing sterical strain. Upon mutation, it seems to optimize the hydrophobic interactions with the neighboring residues, especially I726 and A944. Finally, S943 is located on a turn immediately downstream the helical segment hosting the above five mutations, between the second and third helical segments. The wild-type residue S943 features φ and ψ dihedral angles of 58.5° and 24.5°, respectively, which fall in an unfavourable region for prolines. In the S943P model we generated, the P943 φ and ψ dihedrals assume the values of 3.0° and 68.2°, placing the residue in an outlier region (39). The favoured φ angle for prolines is indeed restricted to the value of -63 ± 15°, (40) characteristic of α-helices. A proline at such a position would therefore introduce an anomaly in the pre-fusion conformation, while strongly promoting the transition to the post-fusion single continuous helical conformation. It is also worth noticing that this would be the only mutation among those we identified so far, introducing a proline residue in the SARS-CoV-2 S protein (Table S1 ). In light of the analysis of the GISAID May 29 th updated, we also modelled the S943I mutation. Being isoleucine compatible with the S943 dihedral values, this mutation does not result in any structural strain. When looking at the post-fusion conformation of the SARS-CoV-2 spike protein S2 subunit, these mutations appear more revealing. Three of the wild-type residues, S929, D936 and S943, are indeed engaged in side-chain to side-chain H-bonds with the HR2 segment of an adjacent monomer. In particular, S929, D936 and S943 (HR1 on Chain A) are H-bonded to S1196, R1185 and E1182, respectively (HR2 on Chain C, Figure 3 ). These are all strong H-bonds, especially the one between S943 and E1182, involving a negatively charged residue, and the one between D936 and R1185, being actually a salt bridge (estimated to contribute an additional 3-5 kcal/mol to the free energy of protein stability as compare to a neutral H-bond (41)). All these three H-bonds are lost upon mutation, which points to a weakening of the post-fusion assembly. Of the remaining three mutations, S939F is completely exposed to the solvent and therefore, like in the pre-fusion conformation, expected to act unfavourable on the protein solvation energy. On the contrary, in case of L938F and S940F, which are substantially buried within the structure, mutation to a large aromatic phenylalanine seems even to optimize the network of the hydrophobic interactions; in case of F940, with the aliphatic parts of the side-chains of E1182 and R1185 on an adjacent monomer, and, in case of F938, with V1189 and A1190 on the same monomer and with other F938 residues on both the adjacent monomers. When comparing the effect of the mutations on the pre-and post-fusion structures, it emerges that the S929I, D936Y and S943I mutations strongly destabilize the postfusion conformation, while having a marginal impact on the stability of the pre-fusion one. On the contrary, S940F seems to favour the post-fusion conformation over the pre-fusion one. As for S938F and S939F, they seem to have a comparable effect on both the conformations, slightly stabilizing and destabilizing, respectively. Finally, the S943P mutation would strongly destabilize both the pre-and post-fusion conformations. Based on a thorough analysis of the S protein sequences, that we extracted from the genomic sequences of SARS-CoV-2 reported in GISAID on April 21 st , we identified the fusion core of the HR1 as a mutational hotspot. The D936Y and S943P mutations were the most numerous, being among the most frequently occurring mutations overall at the time. Other, less frequent, mutations were S939F and then S929I, L938F and S940F. Overall, such mutations appeared to be late ones, emerging starting from the end of February or even mid March 2020, and were mainly localized in Europe and USA. Based on their frequency, on their location in a protein region playing a key role in the post-fusion conformation and also on the non-conservative nature of the mutations themselves, we decided to further investigate the structural basis of such mutations, finding out that they all can play a role in tuning the stability of the preand/or post-fusion S protein conformation. Other potentially interesting mutations are S929I and S939F, whose number of occurrences underwent a ≈2/3-fold increase. On the other hand, the increment in the occurrence of L938F and S940F was marginal, posing less emphasis on such mutations, which will be nonetheless useful to continue monitoring. Finally, the S943P mutation, although still reported in few cases, underwent a dramatic reduction of occurrences, due to modification of the original sequences where they were first reported. At the same time, a S943I mutation emerged, that will also be worth continuing to monitor. We remind here that a proline at position 943 would cause a significant destabilization on the S protein pre-fusion conformation. It is also worth noticing that the 2 mutations significantly increasing their frequency over time, D936Y and S929I, were also those that, together with S943P/I, caused the loss of a inter-monomer H-bond in the post-fusion conformation of the protein. Interestingly, the now emerging S943I mutation gets the same effect without destabilizing the pre-fusion conformation. The most frequently occurring mutation in the HR1 "fusion core", common in Sweden and UK on May 29 th , is also the one causing the loss of a strong inter-monomer salt bridge. Our structural analyses provide a rationale for such mutations, pointing to a weakening of the post-fusion assembly. However, only experiments on cellular systems will clarify whether this may be a virus strategy for reducing its membrane fusion capacity, thus lowering its virulence. We gratefully acknowledge all the Authors from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based. Table 2 . Solvent accessibility of mutated residues in the pre-and post-fusion conformations. Amino acid Pre-fusion Post-fusion I929 exposed partly buried (18.6%) a Y936 exposed partly buried (19.0%) F938 buried (95.8%) buried (95.3 %) F939 exposed exposed F940 exposed buried (62.3 %) Table S1 . List of mutations identified in GISAID in at least 2 identical sequences on April 21 st 2020, in sequential order. A pneumonia outbreak associated with a new coronavirus of probable bat origin A new coronavirus associated with human respiratory disease in China Clinical features of patients infected with 2019 novel coronavirus in Wuhan Receptor recognition mechanisms of coronaviruses: a decade of structural studies Activation of the SARS coronavirus spike protein via sequential proteolytic cleavage at two distinct sites Host cell entry of Middle East respiratory syndrome coronavirus after two-step, furin-mediated activation of the spike protein Cell entry mechanisms of SARS-CoV-2 SARS-CoV-2 SPIKE PROTEIN: an optimal immunological target for vaccines The spike protein of SARS-CoV--a target for vaccine and therapeutic development Key residues of the receptor binding motif in the spike protein of SARS-CoV-2 that interact with ACE2 and neutralizing antibodies Potent binding of 2019 novel coronavirus spike protein by a SARS coronavirus-specific human monoclonal antibody 2020) A human monoclonal antibody blocking SARS-CoV-2 infection Candidate drugs against SARS-CoV-2 and COVID-19 SARS-CoV-2 Vaccines: Status Report. Immunity Ready, set, fuse! The coronavirus spike protein and acquisition of fusion competence Structure of SARS coronavirus spike receptor-binding domain complexed with receptor Structure of MERS-CoV spike receptor-binding domain complexed with human receptor DPP4 Tectonic conformational changes of a coronavirus spike glycoprotein promote membrane fusion 2017) Data, disease and diplomacy: GISAID's innovative contribution to global health GISAID: Global initiative on sharing all influenza data -from vision to reality The Protein Data Bank Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation 2020) Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Structure of the SARS-CoV-2 spike receptorbinding domain bound to the ACE2 receptor Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2 2020) Structural basis of receptor recognition by SARS-CoV-2 A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS-CoV Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study Fusion mechanism of 2019-nCoV and fusion inhibitors targeting HR1 domain in spike protein CD-HIT: accelerated for clustering the next-generation sequencing data Comparative protein modelling by satisfaction of spatial restraints Modeling mutations in protein structures COCOMAPS: a web application to analyze and visualize contacts at the interface of biomolecular complexes Could the D614 G substitution in the SARS-CoV-2 spike (S) protein be associated with higher COVID-19 mortality? Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv Structure validation by Calpha geometry: phi,psi and Cbeta deviation Influence of proline residues on protein conformation pH-induced denaturation of proteins: a single salt bridge contributes 3-5 kcal/mol to the free energy of folding of T4 lysozyme Authors declare no competing interests.