key: cord-0860208-srgi9jc6 authors: Gonçalves, Ricardo Lemes; Rodrigues Leite, Túlio César; de Paula Dias, Bruna; da Silva Caetano, Camila Carla; de Souza, Ana Clara Gomes; da Silva Batista, Ubiratan; Barbosa, Camila Cavadas; Reyes-Sandoval, Arturo; Leomil Coelho, Luiz Felipe; de Mello Silva, Breno title: SARS-CoV-2 mutations and where to find them: An in silico perspective of structural changes and antigenicity of the Spike protein date: 2020-05-28 journal: bioRxiv DOI: 10.1101/2020.05.21.108563 sha: bf16fdf0903a8937652757416abbb3ba4b75a7ac doc_id: 860208 cord_uid: srgi9jc6 The recent emergence of a novel coronavirus (SARS-CoV-2) is causing a severe global health threat characterized by severe acute respiratory syndrome (Covid-19). At the moment, there is no specific treatment for this disease, and vaccines are still under development. The structural protein Spike is essential for virus infection and has been used as the main target for vaccine and serological diagnosis test development. We analysed 2363 sequences of the Spike protein from SARS-CoV-2 isolates and identified variability in 44 amino acid residues and their worldwide distribution in all continents. We used the three-dimensional structure of the homo-trimer model to predict conformational epitopes of B-cell, and sequence of Spike protein Wuhan-Hu-1 to predict linear epitopes of T-Cytotoxic and T-Helper cells. We identified 45 epitopes with amino acid variations. Finally, we showed the distribution of mutations within the epitopes. Our findings can help researches to identify more efficient strategies for the development of vaccines, therapies, and serological diagnostic tests based on the Spike protein of Sars-Cov-2. The SARS-CoV-2, a novel coronavirus, is highly transmissible, leading to high 2 infection rates and human mortality around the world, turning the disease caused by this 3 virus (COVID-2019) a huge public health concern . Up to now, SARS-4 CoV-2 has been spreading in several continents and causing more than 4,629,000 5 confirmed cases, with mortality rates of 6,74% (World Health Organization, 2020 -6 access at 18/05/2020). 7 Coronaviruses have the largest genome among RNA viruses (26 to 32 kilobases in 8 length) with 14 ORFs encoding 27 proteins. At 5' end of the genome, there are the 1ab 9 and 1a ORFs, which encode 16 mature non-structural proteins (nsp1 to nsp16). These 10 proteins play crucial functions during viral RNA replication and transcription 1 . At 3' 11 end of the genome, there are genes encoding four structural proteins: Spike (S), 12 Envelope (E), Membrane (M), and Nucleoprotein (N) as well as genes for 8 accessory 13 proteins named 3a, 3b, p6, 7a, 8b, 9b, and orf14 2 . 14 Although recent findings suggest that SARS-CoV-2 has stronger transmissibility when 15 compared to SARS-CoV, the molecular mechanisms responsible for this difference 16 remain unclear 3 . However, among the structural proteins, the Spike glycoprotein that is 17 present as a homo-trimer on the coronaviruses surface, has been pointed as the most 18 important factor responsible for this stronger transmissibility since this protein is able to 19 bind cell receptors. The S protein has two subunits named S1 and S2. The S1 subunit is 20 responsible for binding to the host cell receptor. It has a signal peptide, an N-terminal 21 domain (NTD) and a receptor-binding domain (RBD). The S2 subunit is responsible for 22 fusion of the viral and cellular membranes and consists of a conserved fusion peptide 23 (FP), two hepta-repeats (HR1 and HR2), a transmembrane (TM) and a cytoplasmic (CP) 24 domains 4 . Structural analysis of the receptor-binding domain (RBD) of SARS-CoV-2 25 7 Threshold for epitope identification (0.75). NetMHCpan 4.0 27 1 (http://www.cbs.dtu.dk/services/NetMHCpan/),was also used to predict MHC class I 2 epitopes for cytotoxic T cells. Peptides with mers of 8-11 was pointed through artificial 3 neural networks, with the threshold of <2% better in the binding score rank. The 4 NetMHCII 2.3 27 (http://www.cbs.dtu.dk/services/NetMHCII/), was used to prediction 5 epitopes for T Helper cells with 15 mers. The lociHLA-DR was used, with standard 6 threshold (<2% better in the binding rank affinity) to identify the peptides that best 7 bound to MHCII. 8 Linear epitopes for B cells of different sizes were predicted using BepiPred-2.0 9 (http://www.cbs.dtu.dk/services/BepiPred/) 28 . The standard threshold of 0.5 was used to 10 ensure the better Specificity/Sensitivity ratio of the epitope. Finally, the conformational 11 epitopes for B cells were predicted using the validated model of the three-dimensional 12 structure of the spike protein homo-trimer through two web servers (DiscoTope and 13 hroughElliPro). DiscoTope (http://tools.iedb.org/discotope/) 29 was used with the 14 threshold for specify epitope identification (-3.7). The prediction of conformational 15 epitopes using ThroughElliPro 30 (http://tools.iedb.org/ellipro/) was made with the 16 maximum score threshold of 0.5 and the maximum distance for ligation of 6 angstroms. 17 Only those epitopes located in the "extracellular" region of the protein (13-1213) were 18 Results 1 The monomeric Spike protein model was represented according to their respective 3 domains, disulfide bridges, and glycosylation points indicated by Expasy's P0DTC2 4 annotations reports (https://viralzone.expasy.org/resources/Coronav/P0DTC2.txt) 5 (figure 2 and video 1 in supplemental material). The FP region, reported from Expasy's, 6 has a little difference than that is reported for Sars-Cov. The three-dimensional Spike 7 homo-trimer model showed a clash score of 1.13 and a molprobity 8 (http://molprobity.biochem.duke.edu/) score of 1.05 corresponding to the 99th and 9 100th percentiles, respectively, when compared to high-resolution 3D structures. The 10 model revealed 99.9% of its residues in acceptable/favorable regions on the 11 Ramachandram Plot (http://molprobity.biochem.duke.edu/), where only 59PHE residues 12 from chains A, B and C continue as an outlier (supplementary figure 1). After ensuring 13 the model quality, the homo-trimer 3D structure was used to the structural/antigenic 14 SARS-CoV-2 Spike protein variability mapping. 15 After Spike protein sequences retrieval and comparison, 856 amino acid variations were 17 identified in 32.8% of sequences. Europe and South America showed the highest 18 variation rates among worldwide sequences, showing variation on 47.4% and 44.1% of 19 sequences, respectively. (data not showed). Sequence comparison analysis identified 42 20 amino acid residues with variation that occurred at least twice on SARS-Cov-2 21 sequences. Twenty three variations was mapped in the S1 and 19 in S2 domain (table 22 1). Some of these variations (28) are represented by residues with non-homologous 23 physical-chemical characteristics in their respective continents of occurrence (figure 3). 24 25 In the present analysis, it was predicted 282 epitopes: 95 for T-cytotoxic cells, 135 for 2 T-Helper cells and 52 for B cells (30 linear and 22 conformational epitopes). All 3 epitopes are shown in supplementary table 1. Only those variations whose showed 4 amino acid changes to residues with non-homologous physical-chemical characteristics 5 was considered and represented in table 2. As a result, there are 11 predicted epitopes 6 for T-Cytotoxic cells, 16 for T-Helper cells, 18 for B cells (10 linear and 8 structural 7 epitopes). Forty-five epitopes with mapped variations were represented in the 3D model 8 of the SARS-CoV-2 protein Spike structure (figure 4). The S1 domain gathers the 9 majority counts (33) while the NTD region showed 20 epitopes, of which 7 were 10 predicted for B cells (table 2) . 11 Our findings suggest that the majority of residue variations arose punctually on their 14 respective continents, with the exception of position 614, which occurs in all continents. 15 The 614-residue variation has the most expressive frequency and has been reported by 16 recent published/preprinted studies 33-35 . The H49Y variation has been identified in 17 Asia, Europe, and North America. In contrast, L5F, V367F, R765L, and S940F 18 variations, with the exception the last, were observed in at least two continents with 19 <1% frequency. (table 1) . 20 Overall, the largest number of variants was found in S1 protein subunit, with 10 variant 21 positions found in Asia, 9 in Europe, 9 in North America, and 2 in Oceania (table 1) . 22 South America and Africa did not show any other variation in the S1 domain besides 23 that identified in residue 614, probably due to the few numbers of analyzed genomes 24 from these regions. The sequences from Europe showed a higher number of variant 25 residues in the S2 domain (12) in comparison to the average from the other continents 1 (1.8). However, the S1 domain is the region more propense to appear new variations 2 due its most prominent variability between coronaviruses species 36 . 3 Due to the short period since its dissemination, it is difficult to find a specific variation 4 pattern in Spike protein of SARS-Cov-2 on the continents. But even so, the variant 5 position D1259H in S2 domain was seen in North America sequences but not in Asia 6 and Europe (table 1) . Future works with a higher number of high-quality sequences are 7 needed to better understand the possible heterogeneous distribution of South/Central 8 Americas, Africa and Oceania amino acid variations on these regions. 9 Our analysis showed that the NTD domain has many variations (figure 3), where 12 changes on teen residues were identified. However, only two residues variations showed 13 frequency above 1% (S50L and S247R). The S50L residue variation was found in 14 2.37% of sequences from Asia and is internally located in the homo-trimer structure. A 15 recent study demonstrated that the L50 residue found in SARS-Cov / SARS-like CoV's 16 was replaced by S50 in SARS-CoV-2 2 . However, some SARS-Cov-2 sequences remain 17 with the L50 residue. Another particular residue variation (S247) showed 5.26% 18 frequency on Oceania. This residue is located at the peripheral end of the NTD portion 19 and was reported in a recent preprint study that five variations of Spike protein may be 20 related to variation of replication of SARS-CoV-2 in Vero-E6 cells 37 . 21 In the RBD region, four residues variations were identified (table 1) .A recent in silico 22 preprint study shows the variation V367F as a probable factor to the high increase of 23 binding affinity with the ACE2 receptor 40 . The V483A variation, found only in North 24 American sequences, are co-located in the binding region of the main residues involved 1 in the interaction of the SARS-Cov Spike protein with the ACE2 receptor 38,39 .The 2 R408I variation was identified only in the Asian continent and it was reported in a 3 recent in silico preprint study as a probable factor of decreased binding affinity with the 4 ACE2 receptor 40 ( figure 3-front & table 1) . Additionally, the arginine at position 408 in 5 RBD core region has already been cited as an important point of interaction with N-6 glycan, for both SARS and SARS-CoV-2 41,42 . Thus, future investigations should be 7 done in order to evaluate if these changes on Spike protein residues could be correlated 8 to functional changes. 9 A second G476S variation observed in the North American genomic sequences was 10 found in peripheral and accessible region of the structure (figure1-front) and is also co-11 located with the region of the main interaction residues between the SARS-CoV-2 12 protein Spike and the ACE2 receptor 43 . This was also observed inSARS-CoV 39 . As 13 mentioned before, the highest frequency mutation rate was identified at position 614 in 14 almost all continents (except Central America). This variation could be used to group 15 the continents into either D or G predominantly variations. The group D includes Asia, 16 North America, and Oceania while group G includes Europe, South America, and 17 Africa (table 1) . However, although residue D614 in SARS and SARS-CoV-2 reference 18 sequences indicates a positive selection of G614 mutation in those continents, there is 19 no evidence of how this mutation frequency could impact virus functionality. 20 Coronavirus SARS-CoV-2 sequencing analysis from diverse continents demonstrates 21 that the genes encoding the Spike proteins undergo diverse and frequent mutations 44 . 22 First, it was identified a mutation (241C>T), which was developed gradually, and 23 reported with three other co-mutations (3037C>T, 14408C>T, and 23403A>G) 33 . These 24 mutations culminated on amino acid variations in Spike protein, nsp3, and RNA 25 primase, which are responsible for RNA replication 33 . Moreover, all associated-1 mutations observed were prevalent in Europe isolates, giving insights for SARS-CoV-2 2 severity in this region. Therefore, the SNP mutation (23403A>G/D614G) in Spike 3 protein D614G, was pointed out by Yin, C. 33 . This residue is near to this furin 4 recognition site (S1/S2 region) of SARS and could affect enzyme activity. However, the 5 distance between alpha carbons of residues 614 and 685 turns the cleavage site for 6 SARS-CoV-2 (S1/S2) greater than 35Å (supplementary figure 1), which makes it 7 difficult to infer that this mutation might have a direct action on cleavage process on 8 S1/S2 domains. Furthermore, a recent study showed possible changes in pathogenicity 9 derived from mutations 37 . However, the D614G variation was not observed in the 10 sequences analyzed by this study. 11 Five amino acid variations (T791I, D839Y, V1176F, C1254F, and P1263L) were found 13 in the S2 domain with frequency >1%. These variations do not happen more than one 14 continent. Located on the HR1 region periphery (figure 3 -side in), S940F is present in 15 Europe (0.69% of sequences) and Oceania (3.51% of sequences). T791I that have 16 1,09% of frequency is occurs on FP region and it is present only in Asian sequences. 17 The V1176 variation has been found in South America with considerable frequency 18 (6.98% of sequences). Furthermore, C1254F and P1263L variations in the IC region at 19 the cytoplasmic end structure (figure 3 -side in) appear in Oceania (3.51% of 20 sequences) and Europe (1.97% of sequences), respectively. These latter verified 21 variations in IC region are residues with non-homologous physical-chemical 22 characteristics and they might carry changes on intra-cellular portion interaction with 23 the cellular components as well as lipid bilayer. Thus, further studies must be performed 24 to verify the influence of these changes on SARS-CoV-2 replication. 25 Our antigenic predictions show that the ELIPRO server identify the entire NTD domain 2 as an antigen (table 2 and figure 4) . In this domain, seven variations of amino acid 3 residues were identified with non-homologous physical-chemical characteristics. The 4 remaining 19 epitopes in the NTD had one or two variations (table 2). In contrast, the 5 RBD domain showed only five epitopes (table 2) . Recently the cryo-electron 6 microscopy structure of the SARS-CoV-2 protein Spike trimer was determined 8,16 , 7 showing that RBD can undergo movements between its "up" or "down" conformations. 8 This suggest that the target epitope of the neutralizing antibody (CR3022) is accessible 9 when the RBD is in the "up" conformation. Additionally, the ACE2 host can interact 10 with the RBD when protein Spike is in the "up" conformation 45 . Consequently, the 11 RBD domain has been identified as the most promising target for vaccine prototypes 12 development 14 . However, up to now, little information about the natural variability of 13 SARS-CoV-2 residues involved in ACE2 binding has been elucidated in the literature. 14 Finally, the S2 domain presented 14 epitopes (three are shared with the S1 domain), two 15 of these S2's epitopes where found on the fusion peptide (FP) region, and another two 16 where located on the transmembrane (TM) region. 17 Among all epitopes identified, only S 610-620 and S 612-620 epitopes have amino acid 19 variation at high frequencies and worldwide prevalence (residue 614 in table1). These 20 two epitopes have been identified by two prediction methods used in this study for T-21 cytotoxic cells. However, future studies are needed to find out if there is the importance 22 of this variation in the response to the SARS-CoV-2 infection. 23 Three more predicted epitopes for T-cytotoxic cells with variations (>1% frequency) 24 were identified in our analyzes, two at NTD (S 50-58 in Asia and S 240-249 in Oceania) and a 25 14 third epitope S 786-794 in S2 domain on Asia (Highlighted in Tab 2). The prediction for T-1 Helper cells epitopes identified seven epitopes with variations that haven >1%frequency 2 (Highlighted in Tab 2). The first T-Helper epitope S 50-64 is found in Asia. The last six 3 epitopes are co-located among S 233-253 region in Oceania sequences ( figure 4) . This 4 analysis identified the overlapping of antigenic regions for T cells in the NTD domain, 5 but not in the RBD domain ( figure 4) . 6 Conversely, most of epitope predictions with variations >1% frequency was observed 7 on a single continent. This suggests the existence of a random Spike protein 8 diversification on these regions. However, the limited available literature of serological 9 responses diversity to SARS-CoV-2 does not allow to better clarify this context at this 10 time. 11 B cell epitopes 12 The total of six epitopes predicted for B cells was identified with variations at 13 frequencies> 1%. Three linear and three conformational (highlighted in table 2). Of 14 these, four epitopes are present in the RBD domain (only in North America), one linear 15 (S 455-478 ), and three conformational (S 434-511 , S 434-580 , and S 434-584 ). The RBD polypeptide 16 chain is classically recognized for its potential to generate neutralizing antibodies to 17 SARS 39,46 and recently observed also being the target of a neutralizing response for 18 SARS-CoV-2 38,47 . A recent in silico study highlighted the region 524-598, which is 19 partially present in the RBD, as one of the dominant epitopes for SARS and SARS-20 CoV-2 sharing an 80% identity 15 . Finally, the last two linear epitopes identified by our 21 analysis are in domain S2 (S 786-800 , S 828-843 ) with variations in Asia and Europe, 22 respectively. Furthermore, the fusion peptide structure has been shown as a target for 23 neutralizing immune response to SARS and SARS-CoV-2 16 . Variations in the S 786-24 800 epitope of FP region described in this study may also cause changes in its 25 antigenicity, but, as already mentioned, new studies should be done to verify this 1 hypothesis. 2 The potential variations in physicochemical characteristics brought by a amino acid 3 residue exchange can generate an eventual condition for the viral particle escaping from 4 the immune system via a non-neutralizing cross-reactive response from previous 5 infections. As example, previous mutational assays with the polypeptide chain of the 6 SARS-Cov Spike protein have shown that changes in the physical-chemical 7 characteristics of various residues have resulted in impaired functions 5,7,48 . 8 Thus, there is a clear need to monitor the antigenic variation by the newly emerged 9 coronavirus, as variations in the RBD domain may change over time, due to selective 10 pressure on the virus through future therapies given to host populations. Although we 11 did not identify variations in potential epitopes capable of inducing humoral response 12 against SARS-CoV-2 with high frequencies and/or wide distribution on the continents, 13 these slight variations covered by four months of SARS-CoV-2 spread just strengths the 14 necessity to understand the potential of antigenic variation that the Spike protein might 15 present. However, given the small variation in the epitopes pointed out in this work, we 16 can suggest that vaccine approaches using the spike protein structure are unlikely to be 17 impaired by the variability presented by theSARS-CoV-2. Still, it would be interesting 18 to formulate therapies that focus on the S2 domain since it presented the lower number 19 of variants. Moreover, among the S1 domain, RBD is the less affected by the variants, 20 and therefore, more promising than NTD. 21 The great global engagement in tackling this pandemic has generated hundreds of 22 vaccine prototypes and potential drugs in the short/medium term. However, prototype 23 generation needs to rely on and be developed based on antigenic and structural 24 variability studies, especially protein S, which is essential for virus/host interactions.. Middle East Respiratory Syndrome Coronavirus nsp1 Inhibits Host Gene Expression by Selectively Targeting mRNAs Transcribed in the Nucleus while Sparing 4 mRNAs of Cytoplasmic Origin Genome Composition and Divergence of the Novel Coronavirus (2019-6 nCoV) Originating in China Clinical Characteristics of Coronavirus Disease 2019 in China Architecture of the SARS coronavirus prefusion spike CD209L (L-SIGN) is a receptor for severe acute respiratory 12 syndrome coronavirus Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein 14 reveal a prerequisite conformational state for receptor binding Activation of the SARS coronavirus spike protein via sequential 16 proteolytic cleavage at two distinct sites Genomic characterization of the 2019 novel human-pathogenic 18 coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan Pre-fusion structure of a human coronavirus spike protein. 21 Evolution of the novel coronavirus from the ongoing Wuhan outbreak and 12 modeling of its spike protein for risk of human transmission Molecular Evolutionary Genetics Analysis 14 across Computing Platforms MUSCLE: multiple sequence alignment with high accuracy and high throughput Protein structure prediction and analysis using the Robetta server Structure, Function, and Antigenicity 20 of the SARS-CoV-2 Spike Glycoprotein Discovery Studio Modeling Environment A new coronavirus associated with human respiratory disease in China Large-scale validation of methods 26 for cytotoxic T-lymphocyte epitope prediction Improved methods for predicting peptide binding affinity to 28 MHC class II molecules BepiPred-2.0: Improving sequence-based B-cell epitope 30 prediction using conformational epitopes Reliable B Cell Epitope Predictions: Impacts of Method 32 A new structure-based tool for the prediction of antibody VMD: visual molecular dynamics Genotyping coronavirus SARS-CoV-2: methods and implications Exploring the genomic and proteomic 7 variations of SARS-CoV-2 spike glycoprotein: a computational biology approach Genetic diversity and evolution of SARS-CoV-2 Characterization of a Highly Conserved Domain within the 12 Severe Acute Respiratory Syndrome Coronavirus Spike Protein S2 Domain with Characteristics of a Viral 13 Receptor Recognition by the Novel Coronavirus from Wuhan: 17 an Analysis Based on Decade-Long Structural Studies of SARS Coronavirus Bats are natural reservoirs of SARS-like coronaviruses RBD mutations from circulating SARS-CoV-2 strains enhance the structure 22 stability and infectivity of the spike protein Potent binding of 2019 novel coronavirus spike protein by a SARS 24 coronavirus-specific human monoclonal antibody Structural basis of receptor recognition by SARS-CoV-2 Unrevealing sequence and structural features of novel 28 coronavirus using in silico approaches: The main protease as molecular target Early transmission dynamics in Wuhan, China, of novel coronavirus-infected 30 pneumonia A highly conserved cryptic epitope in the receptor-binding domains of 32 SARS-CoV-2 and SARS-CoV Receptor-binding domain of SARS-CoV spike protein induces highly potent