key: cord-1007053-sry6zhil authors: Teruel, Natália; Mailhot, Olivier; Najmanovich, Rafael Josef title: Modelling conformational state dynamics and its role on infection for SARS-CoV-2 Spike protein variants date: 2021-01-11 journal: bioRxiv DOI: 10.1101/2020.12.16.423118 sha: 2839b3ac70eee8c4df302556f832eb6b1772c751 doc_id: 1007053 cord_uid: sry6zhil The SARS-CoV-2 Spike protein needs to be in an open-state conformation to interact with ACE2 as part of the viral entry mechanism. We utilise coarse-grained normal-mode analyses to model the dynamics of Spike and calculate transition probabilities between states for 17081 Spike variants. Our results correctly model an increase in open-state occupancy for the more infectious D614G via an increase in flexibility of the closed-state and decrease of flexibility of the open-state. We predict the same effect for several mutations on Glycine residues (404, 416, 504, 252) as well as residues K417, D467 and N501, including the N501Y mutation, explaining the higher infectivity of the B.1.1.7 and 501.V2 strains. This is, to our knowledge, the first use of normal-mode analysis to model conformational state transitions and the effect of mutations thereon. The specific mutations of Spike identified here may guide future studies to increase our understanding of SARS-CoV-2 infection mechanisms and guide public health in their surveillance efforts. The coronavirus pandemic has emerged as a major and urgent issue affecting individuals, families and societies as a whole. Among all outbreaks of aerosol transmissible diseases in the 21st century, the COVID-19 pandemic, caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) virus [1, 2] , has the highest infection and death cumulative numbers -83 million infections and over 1.8 million deaths, according to the World Health Organization (WHO) epidemiological report of January 5, 2021 [3] . Recent WHO reports also show significant weekly increases in the number of infections and deaths as countries start to face upcoming waves of the disease. In 2003 the SARS coronavirus (SARS-CoV) pandemic caused 8,098 infections and 774 deaths before it was brought under control [4, 5] . In 2012, the Middle East respiratory syndrome-related coronavirus (MERS-CoV) outbreak caused 2499 infections and 858 deaths, presenting the highest fatality rate [6] . SARS-CoV-2, SARS-CoV and MERS-CoV, as coronaviruses in general, present considerable mutation rates, which may contribute to future outbreaks. For instance, SARS-CoV-2 is estimated to have a mutation rate close to the ones presented by MERS-CoV [7] and by SARS-CoV [8] , as well as other RNA viruses, showing a median of 1.12 × 10 −3 mutations per site per year [9] . The high mutation rate may in part be responsible for the zoonotic nature of these viruses and points to a clear risk of still-undetected additional members of the coronavirideae family of viruses making the jump from their traditional hosts to humans in the future. The SARS-CoV-2 Spike protein (Uniprot ID P0DTC2) is responsible for anchoring the virus to the host cell. The entry receptor for SARS-CoV-2 and other lineages of human coronaviruses is the human cell-surface protein angiotensin converting enzyme 2 (ACE2) (Uniprot ID Q9BYF1) [10] . Therefore, studying the Spike protein family is essential to understand the evolution of coronaviruses. SARS-CoV-2 Spike is a homo-trimeric glycoprotein, with each chain built by subunits S1 and S2, delimited by a furin cleavage site at residues 682-685. The S1 subunit comprises the N-terminal Domain [11] . The study of the binding properties between Spike and ACE2, although important, cannot explain all the nuances of the infection mechanism. An example of this limitation is the comparison between SARS-CoV and SARS-CoV-2, which have different rates of infection even though they share similar Spike-ACE2 affinities [12] . These facts lead us to consider the contribution of the dynamics of the Spike protein to the infection process. Computational structural biology methods have grown in both accuracy and usability over the years and are increasingly accepted as part of an integrated approach to tackle problems in molecular biology. Such integration permits to speed up research, decrease needs in infrastructure, reagents, and human resources and allows us to evaluate increasingly larger data sets. Computational approaches are being extensively used in the study of SARS-CoV-2 and its mechanisms of infection [13] [14] [15] . Among these, we highlight the study of dynamic properties of the Spike protein as well as in antibody recognition and the search for therapeutic interventions [16] [17] [18] . Several aspects of the dynamics of the Spike protein are being currently studied, with a range of particular goals: to evaluate the docking of small molecules to the RBD domain [19] , to search for alternative target binding-sites for vaccine development [20] , to understand residue-residue interactions and their effects on conformational plasticity [21] and to investigate the flexibility of different domains in particular conformational states [22] . Normal Mode Analysis (NMA) methods are being employed in the study of different conformational states [23] and of different coronavirus variants [24] . These methods, however, are limited with respect to their ability to study the effects of mutations on dynamics due to the fact that such methods are either extremely taxing on computational resources (e.g., molecular dynamics) or agnostic to the nature of amino acids (e.g., traditional coarse-grained NMA methods). In the past, our group developed a coarsegrained NMA method called ENCoM (for Elastic Network Contact Model) that is more accurate than alternative coarse-grained NMA methods due to the explicit consideration of the chemical nature of amino acids and their interactions and consequently their effect on dynamics [25] . ENCoM performs better than other NMA methods on traditional applications and is the only coarse-grained NMA method capable of predicting the effect of mutations on protein stability and function as a result of dynamic properties [26] [27] [28] . In this study, we use the ENCoM method to study the dynamics of the Spike protein, considering different conformational states and several sequence variants observed during the current pandemic, as well as through large-scale analysis of in silico mutations. Experimental analysis of the effect of the SARS-CoV-2 Spike mutation D614G and the comparison between SARS-CoV and SARS-CoV-2 Spike proteins show unique dynamic characteristics that correlate with epidemiological and experimental data on infection. The present work shows that we can replicate such results computationally, suggesting that rigidity or flexibility of different Spike conformational states affects infectivity. We present a high throughput analysis of simulated single amino acid mutations on dynamic properties to seek potential hotspots and individual Spike variants that may be more infectious and therefore may guide public health decisions if such variants were to appear in the population. We also introduce a Markov model of occupancy of molecular states with transition probabilities derived from our analysis of dynamics that recapitulates experimental data on conformational state occupancies. This is the first application of an NMA method that derives transition probabilities from normal modes and employs them in a dynamic system to predict the occupancy of different conformational states. We model the occupancy of several variants and highlight those that may be useful in studying future epidemiological trends that can be responsible for new outbreaks. We performed our analyses using the crystallographic models of the SARS-CoV-2 Spike protein in the open (PDB ID 6VYB) and closed (PDB ID 6VXX) states. The open (prefusion) state was designed with an abrogated Furin S1/S2 cleavage site and two consecutive proline mutations that improve expression [29] . Despite the mutations, the engineered structures correctly represent the conformational states of Spike, as confirmed by independently solved structures [23, 30] . The PDB structures used for the SARS-CoV comparison were 5X58 and 5X5B for closed state and one RBD open state, respectively [31] . We removed heteroatoms, water molecules, and hydrogen atoms from the PDB structures. Missing residues were reconstructed using template-based loop reconstruction and refinement with Modeller [32] . Single amino acid mutants were generated using FoldX4 [33] . ΔΔSvib and occupancy calculations were We analysed dynamic properties of the Spike protein with ENCoM [25] . ENCoM employs a potential energy function that includes a pairwise atom-type non-bonded interaction term and thus makes it possible to consider the effect of the specific nature of amino-acids on dynamics. NMA explores protein vibrations around an equilibrium conformation by calculating the eigenvectors and eigenvalues associated with different normal modes [35] [36] [37] . Representing each protein residue as a single point, for a given conformation of a protein with N amino acids, we obtain 3N -6 nontrivial eigenvectors. Each eigenvector represents a linear, harmonic motion of the entire protein in which each amino acid moves along a unique 3-dimensional Euclidean vector. The associated eigenvalues rank the eigenvectors in terms of energetic accessibility, lower values corresponding to global, more easily accessible motions. NMA calculations allow us to computationally estimate b-factors associated with the protein structure, as shown in Equation 1 for the i th residue, which in turn are related to local flexibility. Higher predicted b-factors denote more flexible positions. Individually calculated b-factors are combined in a vector for a protein sequence or part thereof and called Dynamic Signature. The eigenvectors and associated eigenvalues can also be used to obtain a variable called vibrational entropy that can be used to compare the relative stability of two states. For example, by measuring the individual mutations according to a single score. Vibrational Entropy calculations are dependent on the thermodynamic b factor, that for pseudo-physical models such as ENCoM serves as a scaling factor. This term was optimized to fit experimental Gibbs free energy differences [38] and established as b = 1. The vibrational contribution of the entropic components of the free energy is calculated as described in Eq. 2 [39] in units of J.K -1 , where N is the total number of amino acids in the protein, vi is the vibrational frequency and KB is the Boltzmann constant. Equation 3 shows the association between eigenvalues and vibrational frequency. The Najmanovich Research Group Toolkit for Elastic Networks (NRGTEN) [38] , with the latest implementation of ENCoM, also includes a function to evaluate state occupancies by calculating transition probabilities between different states. A probability Pj of moving along each eigenvector j can be obtained using a Boltzmann distribution given its associated eigenvalue and a scaling factor γ. Let's consider two conformations A and B of the same protein and the vector EA®B, which represents the conformational change going from conformation A to conformation B. The overlap between each normal mode Mj computed from conformation A and the EA®B vector is a value between 0 and 1 describing how well that normal mode recapitulates the conformational change required to go from one state to the other [40] . We can then calculate the transition probability of going from conformation A to conformation B as the weighted sum of the Boltzmann probability Pj of each normal mode Mj times the overlap between that normal mode and the conformational change EA®B. The reverse probability PB®A can be computed in the same fashion, giving an indication of which conformation is favored between the two. A simple way of computing the occupancies of these conformations from the transition probabilities is to use a Markov model. Each conformation is represented by a state, and the transition probabilities between states are computed as described above. We add a constant k to all states as the probability of staying in that state. Since all states must have outgoing transition probabilities that sum to 1, we normalize these values after the addition of k. For a two-state Markov chain representing the open and closed states of the Spike protein, we obtain the diagram shown in Figure 2 . All transition probabilities are computed using ENCoM and Eq. 6. The parameters k and γ need to be optimized for the system being studied as they are not directly coupled to physical quantities because of the pseudo-physical, coarsegrained nature of the ENCoM model. Once the parameters are set, there is a unique equilibrium solution that gives the occupancies of the two states. This approach could be easily generalized to a Markov model with more than two states, where the transition between any two states is computed exactly as described above if that transition is deemed possible. Studies revealed that the mutation has indeed greater infectivity, triggering higher viral loads [43, 44] . Several hypotheses have emerged to explain the mechanisms behind this higher infectivity primarily focused on possible effects on the Furin cleavage site [30, 45, 46] , but recently also considering possible important dynamic differences [44, 47, 48] . In order to test if Dynamic Signatures reveal differences between Spike variants, we analysed the 13741 sequences of the protein available on May 08 in the COVID-19 Viral Genome Analysis Pipeline, enabled by data from GISAID [49, 50] . The mutant Spike proteins harboring mutations (Table S1) (Table S1 ). Analysis of the effect of mutations on the Dynamic Signature show that the D614G mutation produces similar dynamic patterns largely independent of the other mutations accumulated, and dynamic patterns that are distinct from that of the wild type and other mutants on both the open and closed states. The dynamic characteristics of D614G are very specific and cannot be obtained with random mutations ( Figure S1 , Table S2 ). Performing the clustering using segments of the Dynamic Signature representing in modelling or the fact that they are not expected to have a pronounced effect on dynamic [23] . It should be noted that Spike cannot accommodate the vast majority of such single mutations, particularly in its core as these would lead to unstable or misfolded conformations. However, those that occur near the surface are more likely to represent single residue variations of the Spike protein that lead to a stable, correctly folded protein. Therefore, the stability of specific mutations highlighted in this work, unless otherwise stated (such as those already observed experimentally or within the RBD domain as stated below), needs to be validated experimentally. The heatmap in Figure 6A shows In Figure 7 we map ΔΔSvib values ( Figure 6B ) on the structure of Spike, colored according to the median value for each position with the same color scheme as the heatmap. From the 17081 single mutations considered, we show the top 64 mutants with with ΔΔSvib>0.3 (Tables 1 and S3 ) as well as the bottom 20 in terms of ΔΔSvib values (Table S3 ). The mutants with predicted open state occupancy higher than that of the wild type are presented in Table 1 . The Dynamic Signature comparison for 3 of those most infectious candidates ( Figure 8A ) and 3 of the least infectious candidates ( Figure 8B) shows some of the patterns that could lead to a greater or lesser effect on infectivity. For instance, in Figure 8A we can see that high scores can come from a large flexibility of the closed state, a very large rigidity of the open state, or have the contribution of both. We can also observe that these effects can be different in each chain and can affect more the NTD, the RBD, or both. Finally, these single mutants also show how a point mutation can have widespread impacts across the protein. [44] observed an occupancy for states with two or three RBD domains in the open conformation, but these were not observed by Gobeil et al. [30] and Xiong et al. [34] or taken into consideration in several other structural studies [20] [21] [22] [23] [24] . As such, we employ the two-state model shown in Figure 2 Table 2 We utilized this data to calculate occupancy differences for each variant (Figure 9 ). The range of variation of our predicted occupancies is small compared to that of experimental values. We believe that given the limitations of our coarse-grained model as well as additional phenomena that ultimately affect occupancy, our predictions reflect only a fraction of the myriad of factors contributing to the occupancy. Nonetheless, our predictions correctly capture the pattern of relative variations of occupancy observed in the experimental data. To ensure that the calculated correlation is not due to chance, we simulated random sets of occupancies for the 6 sequence variants and calculated simulated correlations for the 110 different combinations of k and g to determine if the observed correlations represent an actual signal in the data or could be randomly obtained with different values for the parameters k and g. We observed a marked shift with higher correlations for the data representing our predicted occupancies when compared to the gaussian noise data ( Figure S3 ), suggesting that the predicted occupancies are not due to chance. Figure 10A mutants predicted to shift the equilibrium towards the closed state (p-value=2.04x10 -6 ). Figure 10B shows the location in the structure of the mutants in Table 1 . We can see that the least infectious candidates (blue) are positioned in the interfaces between NTD and RBD domains, while the most infectious candidates, especially the ones validated by the occupancy prediction (dark red), are more concentrated in the interfaces between different RBD domains. Therefore, the N501Y mutant shows a marked increase of the occupancy of the open state relative to other mutations. Additionally, this mutation was shown to also increase binding affinity to the ACE2 receptor relative to the wild type with a Δlog10(KD,app) of 0.24 [52] . Therefore, we predict that N501Y has a strong potential to contribute to increased transmission. Notice that N501Y has a ΔΔSvib value of 2.53x10 -1 J.K -1 that is slightly below the 3.00x10 -1 J.K -1 threshold, suggesting that there may be many other mutations with ΔΔSvib values below our set threshold that turn out to have augmented occupancies for the open state relative to the wild type. D614G shows that changes in the occupancy of conformational states can impact infectivity despite no changes or even weaker binding affinities [44] . A recent study [52] on binding and expression of Spike mutations within the RBD domain (positions 331 to 531) shows that several (but not all, see below) of the mutations that we predicted to have increased occupancy of the open state are associated with a decrease of binding affinity with ACE2. Incidentally, the data also shows that the mutations in Table 1 within the RBD produce stable and properly folded Spike proteins. As shown for D614G, infection does not rely on binding affinity alone, and even a strain with higher dissociation rates from ACE2 can bring about fitness advantages. The mutation N501W is predicted to have the largest effect in augmenting the occupancy of the open state relative to the wild type. This mutation is associated with stronger binding to ACE2 (Δlog10(KD,app)=0.11) [52] relative to the wild type Spike (but lower than N501Y). Furthermore, N501W appears to have increased expression relative to the wild type with a Δlog(MFI) of 0.1 compared to decrease in relative expression of -0.14 for N501Y [52] . The authors note that changes in expression correlate with folding stability [52] . However, even with a Δlog(MFI) of -0.14, N501Y is viable and spreading. Therefore, N501W might be even more stable and infective. We consider all mutations with increased predicted occupancy of the open state in Table 1 ΔΔSvib values, one of them being zero (corresponding to the amino acid found in the wild type). The option max will show the top ΔΔSvib score for each position. Therefore, it shows which mutation for that specific position represents the candidate with the highest predicted infectivity as defined here in terms of a propensity to higher occupancy of the open state. The option min will show the lowest score for each position and the mutation associated with the least predicted infectious candidate. The option median returns the median score, presenting a general trend for any given position, and var shows the variance between the results for each position, highlighting sites in which mutations to different residues lead to a broader range of ΔΔSvib values. Furthermore, for the mutations for which occupancy was calculated, the data can be accessed through the same menu. As new occupancy data is calculated, it will be added SARS-CoV-2 mutations are still arising and spreading around the world. The A222V mutation, reportedly responsible for many infections, emerged in Spain during the Summer of 2020 and since then has spread to neighbor countries [53] ; In Denmark, new strains related to SARS-CoV-2 transmission in mink farms were confirmed in early October by the WHO and shown to be caused by specific mutations not previously observed with the novelty of back-and-forth transmission between minks and humans [58] . A new strain containing N501Y first appeared in the UK and is now on the rise worldwide at the time of writing. Such occurrences point to the possibility that new mutations in SARS-CoV-2 may bring about more infectious strains. Using the methods described in this paper, it is possible to predict potential variants that might have an advantage over the wild type virus insofar as these are the result of changes in occupancy of states and with the limitations of the simplified coarse-grained model employed here. In our analyses, flexibility properties and conformational state occupancy probabilities contribute to the infectivity of a SARS-CoV-2. Our results explain the behaviour of the D614G strain, the increased infectivity of SARS-CoV-2 relative to SARS-CoV as well as offers a possible explanation for the rise of new strains such as those harboring the N501Y mutation. The results we present on SARS-CoV-2 Spike mutations have several limitations. First and foremost, some of the in silico mutation discussed may not be thermodynamically stable, may affect expression, cleavage, or binding to ACE2, and our approach does not consider that Spike is, in fact, a glycoprotein and the sugar molecules may have an effect on dynamics. However, the remarkable agreement between our model and experimental observations shows that the simplified model of Spike and the coarsegrained methods used here allow us to calculate dynamic properties of Spike that are relevant to understand infection and epidemiological behavior. It is important to keep in mind that all of the mutations that we discuss in Table 1 that lay within positions 331 and 531 within the RBD domain were already experimentally validated and are viable [52] . However, we highlight the need for experimental validation of our predictions particularly for those candidates that we believe would help elucidate the extent of the effect of the conformational dynamics of Spike on infectivity. Beyond in vitro biophysical studies, experimental alternatives exist such as using pseudo-type viruses or virus-like-particles that would not require studying gain-of-function mutations using intact viruses. Alternatively, loss-offunction mutations can be created with intact viruses and compared to the wild type SARS-CoV-2 virus to validate the role of dynamics on infectivity. To the best of our knowledge, this is the first time that a Normal Mode Analysis method is used to Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A pneumonia outbreak associated with a new coronavirus of probable bat origin Review of bats and SARS Middle East respiratory syndrome Middle East respiratory syndrome coronavirus in dromedary camels: an outbreak investigation Moderate mutation rate in the SARS coronavirus genome and its implications Variant analysis of SARS-CoV-2 genomes Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses. Nat Microbiol Structural basis for the recognition of the SARS-CoV-2 by full-length human ACE2 Cell entry mechanisms of SARS-CoV-2 Structure-based virtual screening and molecular dynamics simulation of SARS-CoV-2 Guanine-N7 methyltransferase (nsp14) for identifying antiviral inhibitors against COVID-19 Dynamics of the ACE2-SARS-CoV-2/SARS-CoV spike protein interface reveal unique mechanisms. Sci Rep SARS-CoV-2 Main Protease: A Molecular Dynamics Study Cross-neutralization of SARS-CoV-2 by a human monoclonal SARS-CoV antibody Isolation of potent SARS-CoV-2 neutralizing antibodies and protection from disease in a small animal model Potent Neutralizing Antibodies against SARS-CoV-2 Identified by High-Throughput Single-Cell Sequencing of Convalescent Patients' B Cells Supervised molecular dynamics for exploring the druggability of the SARS-CoV-2 spike protein Fighting COVID-19 Using Molecular Dynamics Simulations. ACS central science A graph-based approach identifies dynamic H-bond communication networks in spike protein S of SARS-CoV-2 Continuous flexibility analysis of SARS-CoV-2 spike prefusion structures Molecular Simulations and Network Modeling Reveal an Allosteric Signaling in the SARS-CoV-2 Spike Proteins Exploring the intrinsic dynamics of SARS-CoV-2, SARS-CoV and MERS-CoV spike glycoprotein through normal mode analysis using anisotropic network model A coarse-grained elastic network atom contact model and its use in the simulation of protein dynamics and the prediction of the effect of mutations ENCoM server: exploring protein conformational space and the effect of mutations on protein function and stability Applications of Normal Mode Analysis Methods in Computational Protein Design Vibrational entropy differences between mesophile and thermophile proteins and their use in protein engineering Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein D614G mutation alters SARS-CoV-2 spike conformational dynamics and protease cleavage susceptibility at the S1/S2 junction. bioRxiv. Cold Spring Harbor Laboratory Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteins reveal the dynamic receptor binding domains Comparative protein modelling by satisfaction of spatial restraints The FoldX web server: an online force field A thermostable, closed SARS-CoV-2 spike protein trimer Normal mode analysis as a method to derive protein dynamics information from the Protein Data Bank Long-range correlation in protein dynamics: Confirmation by structural data and normal mode analysis Normal Mode Analysis The NRGTEN Python package: an extensible toolkit for coarse-grained normal mode analysis of proteins, nucleic acids, small molecules and their complexes Fast and accurate computation schemes for evaluating vibrational entropy of proteins Hinge-bending motion in citrate synthase arising from normal mode calculations Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant Higher binding affinity of Furin to SARS-CoV-2 spike (S) protein D614G could be associated with higher SARS-CoV-2 infectivity The SARS-CoV-2 Spike Protein D614G Mutation Shows Increasing Dominance and May Confer a Structural Advantage to the Furin Cleavage Domain The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity. bioRxiv. Cold Spring Harbor Laboratory The SARS-CoV-2 spike protein: balancing stability and infectivity Global initiative on sharing all influenza data -from vision to reality disease and diplomacy: GISAID's innovative contribution to global health Stabilized coronavirus spikes are resistant to conformational changes induced by receptor recognition or proteolysis. Sci Rep Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv Case Study: Prolonged Infectious SARS-CoV-2 Shedding from an Asymptomatic Immunocompromised Individual with Cancer Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. medRxiv Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies. bioRxiv. Cold Spring Harbor Laboratory dms-view: Interactive visualization tool for deep mutational scanning data Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans