key: cord-103709-86hv27vh
authors: Zhang, Dong Yan; Wang, Jian; Dokholyan, Nikolay V.
title: Prefusion spike protein stabilization through computational mutagenesis
date: 2020-06-19
journal: bioRxiv
DOI: 10.1101/2020.06.17.157081
sha: 
doc_id: 103709
cord_uid: 86hv27vh

A novel severe acute respiratory syndrome (SARS)-like coronavirus (SARS-CoV-2) has emerged as a human pathogen, causing global pandemic and resulting in over 400,000 deaths worldwide. The surface spike protein of SARS-CoV-2 mediates the process of coronavirus entry into human cells by binding angiotensin-converting enzyme 2 (ACE2). Due to the critical role in viral-host interaction and the exposure of spike protein, it has been a focus of most vaccines’ developments. However, the structural and biochemical studies of the spike protein are challenging because it is thermodynamically metastable1. Here, we develop a new pipeline that automatically identifies mutants that thermodynamically stabilize the spike protein. Our pipeline integrates bioinformatics analysis of conserved residues, motion dynamics from molecular dynamics simulations, and other structural analysis to identify residues that significantly contribute to the thermodynamic stability of the spike protein. We then utilize our previously developed protein design tool, Eris, to predict thermodynamically stabilizing mutations in proteins. We validate the ability of our pipeline to identify protein stabilization mutants through known prefusion spike protein mutants. We finally utilize the pipeline to identify new prefusion spike protein stabilization mutants.

The ongoing outbreak of the novel coronavirus [2] [3] [4] , which causes fever, severe respiratory illness, and pneumonia, poses a major public health and governance challenges. The emerging pathogen has been characterized as a new member of the betacoronavirus genus (SARS-CoV-2) 5,6 , closely related to several bat coronaviruses and to severe acute respiratory syndrome coronavirus (SARS-CoV) 7, 8 . Compared with SARS-COV, SARS-CoV-2 appears to be more readily transmitted from human to human, spreading to multiple continents and leading to the World Health Organization (WHO)'s declaration of a pandemic on March 11, 2020 9 . According to the WHO, as of June 9, 2020, there had been >7,000,000 confirmed cases globally, leading to >400,000 deaths.

In the initial step of the infection, the coronavirus enters the host cell by binding to a cellular receptor and fusing the viral membrane with the target cell membrane 10, 11 . For SARS-CoV-2, this process is mediated by spike glycoprotein which is a homo-trimer 12 . In the process of viruses fusing to host cells, the spike protein undergoes structural rearrangement and transits from a metastable prefusion conformational state to a highly stable post-fusion conformational state 13, 14 . The spike protein comprises of two functional units, S1 and S2 subunits; when fused to the host cell, the two subunits are cleaved. The S1 subunit is responsible for binding to the angiotensin-converting enzyme 2 (ACE2) [15] [16] [17] receptor on the host cell membrane and it contains the N-terminal domain (NTD), the receptor-binding domain (RBD) and the C-terminal domain (CTD). NTD in the S1 subunit assists recognize sugar receptors. RBD in the S1 subunit is critical for the binding of coronavirus to the ACE-2 receptor [18] [19] [20] [21] . CTD in the S1 subunit could recognize other receptors 22 . The binding of RBD to ACE2 facilitates the cleavage of the spike protein and promotes the dissociation of the S1 subunit from the S2 subunit 23 . S2 contains two heptad repeats (HR1 and HR2), a fusion peptide, and a protease cleavage site (S2'). The dissociation of S1 induces S2 to undergo a dramatic structural change to fuse the host and viral membranes. Thus, the spike protein serves as a target for development of antibodies, entry inhibitors and vaccines 24 . Coronavirus transits from a metastable prefusion state to a highly stable post-fusion state as part of the spike protein's role in membrane fusion. The instability of the prefusion state presents a significant challenge for the production of protein antigens for antigenic presentation of the prefusion antibody epitopes that are most likely to lead to neutralizing responses. Thus, since the prefusion spike protein exists in a thermodynamically metastable state 1 , a stabilized mutant conformation is critical for the development of vaccines and drugs.

Computational mutagenesis is an effective approach to finding mutations that are able to stabilize proteins. We have previously developed a protein design platform, Eris 25, 26 , which utilizes a physical force field 27 for modeling inter-atomic interactions, as well as fast side-chain packing and backbone relaxation algorithms to enable efficient and transferrable protein molecular design. Originally, Eris has been validated on 595 mutants from five proteins, corroborating the unbiased force field, side-chain packing and backbone relaxation algorithms. In many later studies, Eris has been validated through prediction of thermodynamically stabilizing or destabilizing mutations [28] [29] [30] [31] [32] [33] , and direct protein design efforts [34] [35] [36] [37] .

In this work, we propose a pipeline to automatically stabilize spike proteins through computational mutagenesis. Within the pipeline, we first analyze the conservation score and solvent accessible surface area (SASA) of residues in the protein. We then perform discrete molecular dynamics (DMD) [38] [39] [40] [41] simulations to calculate the root mean square fluctuation (RMSF) of residues to analyze their flexibility. Based on this information, we select appropriate residues (2 < Conservation Score < 5; SASA > 0.4; RMSF > 3.5 Å; see Discussion) as mutation sites. We subject the selected residues to computational redesign using Eris to find the stabilizing mutations by calculating the change in free energy ∆∆ = ∆ − ∆ , where ∆ and ∆ are the free energies of the mutant protein and wild type proteins correspondingly. We utilize this pipeline to identify stabilization mutants of the spike protein. Next, we describe our methods in detail and provide a list of stabilizing mutations for spike protein. 

We propose a pipeline to automatically find stabilized mutants of proteins ( Figure   1 ). The pipeline can be divided into two stages. In the first stage, users designate the protein of interest, and then the pipeline will analyze the 3D structure of the protein by using different metrics. In the second stage, users designate the mutation sites according to the analysis of the 3D structure of the protein, and finally the pipeline will determine the stabilizing mutations of the protein. Users can either upload the 3D structure of the protein or input the PDB ID of the protein to designate the protein of interest. In the first stage, the first step is to remodel the 3D structure of the protein of interest to complete the missing atoms and residues.

We integrate MODELLER into our pipeline to remodel proteins. Next, the pipeline utilizes ConSurf 48 to calculate the conservation score of each residue in the protein of interest. The conservation score indicates the importance of the residue in maintaining protein structure and/or function. Subsequently, the pipeline utilizes DMD to analyze the flexibility of each residue in the spike protein through RMSF.

The technique has already been used to efficiently study the protein folding thermodynamics and protein oligomerization and allows for a good equilibration of the structures. Then, the pipeline will calculate SASA of residues in the protein.

In the second stage, users designate the mutation sites according to the conservation score, RMSF, and SASA. A high conservation score (≥ 7) indicates the residue may play important roles in the function or the stability of the structure of the protein; residues of high RMSF (> 3.5 Å) are likely the culprit to undermine the stability of the structure of the protein, hence we select residues that have a low conservation score or high RMSF; residues with SASA < 0.5 are considered buried and residues with SASA ≥ 0.5 are considered exposed to solvent. After the designation of the mutation sites, the pipeline utilizes Eris to determine the changes in free energies of the mutants. For each residue in the mutation sites, 

We utilize MODELLER 42 to re-model the 3D structure of the spike protein by using a structure deposited to Protein DataBank (PDB), PDBID:6VSB 12 , the cryo-EM structure of the prefusion state of the spike protein, as the template structure to complete the missing atoms and residues ( Figure 2A&B ). Next, we use Chiron 44 All following computational mutagenesis study are performed using the structure with the 3 RBDs in down conformation.

To validate the ability of the pipeline to identify stabilization mutants, we use the pipeline to calculate the free energy changes of several known prefusion spike protein stabilization mutants. The 2P mutation strategy (K986P and V987P) has been proved effective for the stabilization of spike protein of SARS-COV-2 and other betacoronavirus 12,20,50 . Hsieh and coworkers 51 

The spike protein is mainly composed of S1 and S2 subunits ( Figure S1 ). We select residues for mutation from the NTD and RBD domains in S1, and we also select residues from the HR1 (heptad repeat 1) and CH (central helix) domains in S2. We don't select residues from HR2 domain in S2 because the structure of the HR2 domain has not been solved in the cryo-EM structure 6VSB. At the outset, we calculate the conservation score ( Figure 3A&D ) of all residues by using ConSurf.

Based on the conservation score, most residues in HR1/CH are conservative, while residues in NTD and RBD are prone to mutation in evolution. Next, we use Pymol 47 to calculate SASA of all residues in the spike protein ( Figure 3B&E) . SASA indicates the level of residues exposed to the solvent in a protein and usually most of the functional residues are located on the protein structure's surface 52 . All four domains have both low SASA residues and high SASA residues. Then, we perform 1,000,000 steps DMD simulation for the spike protein to calculate the RMSF of each residue ( Figure 3C&F ). The residues in HR1/CH have extremely low RMSF, while the residues in NTD and RBD domains have moderate to high RMSF.

We select residues that have different conservation scores in these four domains for mutagenesis. In the NTD, we select 5 residues (Table S1 ) with the conservation score ranging from 1 (highly variable) to 9 (highly conservative). The SASA of these residues range from 0.01 (buried) to 0.75 (exposed). The RMSF range from 1.61 Å (frozen) to 4.88 Å (flexible). Likewise, in the other three domains (Table S2- 3), we select 5 to 7 residues, respectively. These residues also have diverse conservation scores, SASA, and RMSF. Of note, to avoid affecting the function of the spike protein, these residues are all not chosen from the functional sites of the spike protein, such as the ACE2 binding site in RBD.

We utilize Eris to calculate the free energy changes of mutants relative to the wild type ( Figure 4A , Figure S2&3 , and Table S4-6). In the NTD, 5 residues, E169, K113, I203, R246, and L270 are selected for mutagenesis. Among them, the free energy changes of nearly all mutations of residues I203, R246, and L270 are positive, indicating that they are destabilizing the structure. In contrast, most mutations on residues E169 and K113 have negative free energy changes, suggesting that they are stabilizing the structure. The mutant that has the most negative free energy change is R246C. However, cysteine is prone to forming disulfide bond with other cysteine, which may affect the correct folding of the protein structure, so we recommend K113M to be a better choice as the stabilization mutant.

In the RBD, 5 residues (A411, T415, Y505, N439, and D428) (Table S2) are selected for mutagenesis. The free energy changes calculated by Eris ( Figure S2 , In the HR1/CH domain, 7 residues (L948, I1018, A1026, Y1007, S1003, T961, and V976) are selected for mutagenesis as shown in Table S3 . In stark contrast to NTD and RBD, most residues have extremely high free energy changes (> 30 kcal/mol),

suggesting that these residues are not very good choices for mutagenesis. The high free energy changes also implicate that they may play important roles in stabilizing the structure so that they are irreplaceable to some extent. This finding is also in concert with the high conservation scores of these residues. That said, we can still find stabilization mutants for these residues, such as T961A and S1003M ( Figure S3 ).

Compared to experimental mutagenesis, such as random mutagenesis 53, 54 and site-directed mutagenesis 55, 56 , computational mutagenesis 25 is an efficient alternative that lays the foundation of large-scale mutation screening. However, performing computational mutation screening for all residues in the spike protein trimeric structure, which consists of 1288 residues in each monomer, is still timeconsuming and inefficient, so we seek to select critical residues to perform mutagenesis. In this work, to interrogate how to select residues as mutation sites, we select residues that have different conservation scores, SASA, and RMSF. We find that the probability of stabilizing mutations for specific residues is correlated with the conservation score, SASA, and RMSF of these residues ( Figure 4B -D).

We calculate the average and the minimum free energy change of all 19 mutations of each residue. We find that the average free energy change of mutations of residues with either high (>5) or low (<2) conservation score are typically larger than 0. Only mutations in residues that have moderate (2~5) conservation score have negative average free energy change, thus indicating a possibility to find stabilizing mutations. We posit that conservative residues are typically playing critical roles in structural stability or protein functioning 57 , making them a likely target for finding stabilizing mutations. However, if the conservation score of the residue is too high, the residue may be irreplaceable and any mutation will destabilize the structure. On the other hand, residues with low conservation scores may be less critical to the structural stability, reducing the chances for finding stabilizing mutations at these positions. Thus, we select residues that have moderate conservation score (2~5) for mutagenesis. Similarly, we find that residues with large SASA or large RMSF have more stabilization mutants than residues with low SASA or low RMSF, respectively. Overall, we select residues that have moderate conservation score (2~5), high SASA (>0.4), and high RMSF (>3.5 Å) for mutagenesis. In addition, although we only select residues in NTD, RBD, and HR1/CH domains to perform mutagenesis, residues in other regions can also be used as mutation sites. For example, the known 2P mutation strategy (K986P and V987P) has been proved effective for the stabilization of spike protein of SARS-COV-2 and other betacoronavirus 12, 20, 50 .

In this work, we propose a pipeline to automatically stabilize proteins through computational mutagenesis. We analyze the conservation score, RMSF, and SASA of residues in the spike protein through the pipeline. We propose criteria based on the conservation score, RMSF, and SASA to identify residues for mutation. Finally, we utilize Eris to calculate the free energy change and find stabilizing mutants.

All source codes are deposited in: https://bitbucket.org/dokhlab/proteinstabilization. The 3D structure of the spike protein colored SASA. The red/blue colors indicate exposed/buried residues. (F) The 3D structure of the spike protein colored by RMSF. Red means flexible and blue means frozen.

Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV

A novel coronavirus from patients with pneumonia in China

Features, evaluation and treatment coronavirus (COVID-19)

Coronavirus (COVID-19) outbreak: what the department of endoscopy should know

The proximal origin of SARS-CoV-2

A novel coronavirus genome identified in a cluster of pneumonia cases-Wuhan

Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges

World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19)

Fusion mechanism of 2019-nCoV and fusion inhibitors targeting HR1 domain in spike protein

SARS-CoV-2 infects T lymphocytes through its spike protein-mediated membrane fusion

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science (80-. )

Structure, Function, and Evolution of Coronavirus Spike Proteins

The coronavirus spike protein is a class I virus fusion protein: structural and functional characterization of the fusion core complex

A novel angiotensin-converting enzyme-related carboxypeptidase (ACE2) converts angiotensin I to angiotensin 1-9

SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor

Structural basis for the recognition of SARS-CoV-2 by fulllength human ACE2. Science (80-. )

Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteins reveal the dynamic receptor binding domains

Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite conformational state for receptor binding

Immunogenicity and structures of a rationally designed prefusion MERS-CoV spike antigen

Unexpected Receptor Functional Mimicry Elucidates Activation of Coronavirus Fusion

Structure of mouse coronavirus spike protein complexed with receptor reveals mechanism for viral entry

Tectonic conformational changes of a coronavirus spike glycoprotein promote membrane fusion

COVID-19, an emerging coronavirus infection: advances and prospects in designing and developing vaccines, immunotherapeutics, and therapeutics

Eris: an automated estimator of protein stability

Modeling backbone flexibility improves protein stability estimation

Emergence of Protein Fold Families through Rational Design

G protein mono-ubiquitination by the Rsp5 ubiquitin ligase

Tyrosine phosphorylation switching of a G protein

Stabilization of μ-opioid receptor facilitates its cellular translocation and signaling

Large SOD1 aggregates, unlike trimeric SOD1, do not impact cell viability in a model of amyotrophic lateral sclerosis

A Phosphomimetic Mutation Stabilizes SOD1 and Rescues Cell Viability in the Context of an ALS-Associated Mutation

Nonnative SOD1 trimer is toxic to motor neurons in a model of amyotrophic lateral sclerosis

Computational design of chemogenetic and optogenetic split proteins

Engineering extrinsic disorder to control protein activity in living cells

Rational design of a ligand-controlled protein conformational switch

Rationally designed carbohydrate-occluded epitopes elicit HIV-1 Env-specific antibodies

Discrete molecular dynamics studies of the folding of a protein-like model

Discrete molecular dynamics

Applications of Discrete Molecular Dynamics in biology and medicine

Ab Initio Folding of Proteins with All-Atom Discrete Molecular Dynamics

Comparative protein modelling by satisfaction of spatial restraints

Statistical potential for assessment and prediction of protein structures

Computational Modeling of Small Molecule Ligand Binding Interactions and Affinities

Solving protein structures using short-distance cross-linking constraints as a guide for discrete molecular dynamics simulations

Ab initio RNA folding by discrete molecular dynamics: from structure prediction to folding mechanisms

Convergent solutions to binding at a protein-protein interface

ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information

MDAnalysis: a toolkit for the analysis of molecular dynamics simulations

Stabilized coronavirus spikes are resistant to conformational changes induced by receptor recognition or proteolysis

Structure-based Design of Prefusion-stabilized SARS-CoV-2 Spikes

Improving prediction of secondary structure, local backbone angles and solvent accessible surface area of proteins by iterative deep learning

Sequence saturation mutagenesis (SeSaM): a novel method for directed evolution

Evolving strategies for enzyme engineering

Directed mutagenesis

Site-directed mutagenesis: effect of an extracistronic mutation on the in vitro propagation of bacteriophage Qbeta RNA

Understanding hierarchical protein evolution from first principles

We acknowledge support from the National Institutes for Health 1R35 GM134864,The Huck Institutes of the Life Sciences, and the Passan Foundation. The project described was also supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant UL1 TR002014. The content

is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

The author declares no potential conflict of interest.