key: cord-343517-vf32wxkx authors: Lokman, Syed Mohammad; Rasheduzzaman, Md.; Salauddin, Asma; Barua, Rocktim; Tanzina, Afsana Yeasmin; Rumi, Meheadi Hasan; Hossain, Md. Imran; Siddiki, Amam Zonaed; Mannan, Adnan; Hasan, Md. Mahbub title: Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: a computational biology approach date: 2020-04-11 journal: bioRxiv DOI: 10.1101/2020.04.07.030924 sha: doc_id: 343517 cord_uid: vf32wxkx The newly identified SARS-CoV-2 has now been reported from around 183 countries with more than a million confirmed human cases including more than 68000 deaths. The genomes of SARS-COV-2 strains isolated from different parts of the world are now available and the unique features of constituent genes and proteins have gotten substantial attention recently. Spike glycoprotein is widely considered as a possible target to be explored because of its role during the entry of coronaviruses into host cells. We analyzed 320 whole-genome sequences and 320 spike protein sequences of SARS-CoV-2 using multiple sequence alignment tools. In this study, 483 unique variations have been identified among the genomes including 25 non-synonymous mutations and one deletion in the spike protein of SARS-CoV-2. Among the 26 variations detected, 12 variations were located at the N-terminal domain and 6 variations at the receptor-binding domain (RBD) which might alter the interaction with receptor molecules. In addition, 22 amino acid insertions were identified in the spike protein of SARS-CoV-2 in comparison with that of SARS-CoV. Phylogenetic analyses of spike protein revealed that Bat coronavirus have a close evolutionary relationship with circulating SARS-CoV-2. The genetic variation analysis data presented in this study can help a better understanding of SARS-CoV-2 pathogenesis. Based on our findings, potential inhibitors can be designed and tested targeting these proposed sites of variation. Wuhan, Hubei province of China in December 2019. The death toll rose to more than 68,000 among 1,250,000 confirmed cases around the Globe (until April 4, 2020) [1] . The virus causing COVID-19 is named as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Based on the phylogenetic studies, the SARS-CoV-2 is categorized as a member of the genus Betacoronavirus, the same lineage that includes SARS coronavirus (SARS-CoV) [2] that caused SARS (Severe Acute Respiratory Syndrome) in China during 2002 [3] . Recent studies showed that SARS-CoV-2 has a close relationship with bat SARS-like CoVs [4, 5] [7] ]. Interestingly, S glycoprotein is characterized as the critical determinant for viral entry into host cells which consists of two functional subunits namely S1 and S2. The S1 subunit recognizes and binds to the host receptor through the receptor-binding domain (RBD) whereas S2 is responsible for fusion with the host cell membrane [ [8] , [9] , [10] ]. MERS-CoV uses dipeptidyl peptidase-4 (DPP4) as entry receptor [11] whereas SARS-CoV and SARS-CoV-2 utilize ACE-2 (angiotensin converting enzyme-2) [12] , abundantly available in lung alveolar epithelial cells and enterocytes, suggesting S glycoprotein as a potential drug target to halt the entry of SARS-CoV-2 [13] . According to recent reports, neutralizing antibodies are generated in response to the entry and fusion of surface-exposed S protein (mainly RBD domain) which is predicted to be an 4 important target for vaccine candidates [ [10] , [14] , [15] ]. However, SARS-CoV-2 has emerged with remarkable properties like glutamine-rich 42 aa long exclusive molecular signature (DSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIE) in position 983-1024 of polyprotein 1ab (pp1ab) [16] , diversified receptor-binding domain (RBD), unique furin cleavage site (PRRAR↓SV) at S1/S2 boundary in S glycoprotein which could play roles in viral pathogenesis, diagnosis and treatment [17] . To date, few genomic variations of SARS-CoV-2 are reported [ [18] , [19] ]. There is growing evidence that spike protein, a 1273 amino acid long glycoprotein having multiple domains, possibly plays a major role in SARS-CoV-2 pathogenesis. Viral entry to the host cell is initiated by the receptor-binding domain (RBD) of S1 head. Upon receptor-binding, proteolytic cleavage occurs at S1/S2 cleavage site and two heptad repeats (HR) of S2 stalk form a six-helix bundle structure triggering the release of the fusion peptide. As it comes into close proximity to the transmembrane anchor (TM), the TM domain facilitates membrane destabilization required for fusion between virus-host membranes [ [20] , [21] ]. Insights into the sequence variations of S glycoprotein among available genomes are key to understanding the biology of SARS-CoV-2 infection, developing antiviral treatments and vaccines. In this study, we have analyzed 320 genomic sequences of SARS-CoV-2 to identify mutations between the available genomes followed by the amino acid variations in the glycoprotein S to foresee their impact on the viral entry to host cell from structural biology viewpoint. All available sequences (320 whole genome and surface glycoprotein sequences of SARS-CoV-2) related to the COVID-19 pandemic were retrieved from NCBI Virus Variation Resource repository (https://www.ncbi.nlm.nih.gov/labs/virus/) [22] . In addition, all 40 S glycoprotein sequences from different coronavirus families were retrieved for phylogenetic 5 analysis. The NCBI reference sequence of SARS-CoV-2 S glycoprotein, accession number YP_009724390 was used as the canonical sequence for the analyses of spike protein variants. Variant analyses of SARS-CoV-2 genomes were performed in the Genome Detective Coronavirus Typing Tool Version 1.13 which is specially designed for this virus (https://www.genomedetective.com/app/typingtool/cov/) [23] . For multiple sequence alignment (MSA), Genome Detective Coronavirus Typing Tool uses a reference dataset of 431 whole genome sequences (WGS) where 386 WGS were from known nine coronavirus species. The dataset was then aligned with MUSCLE [24] . Entropy (H(x)) plot of nucleotide variations in SARS-CoV-2 genome was constructed using BioEdit [25] . MEGA X (version 10.1.7) was used to construct the MSAs and the phylogenetic tree using pairwise alignment and neighborjoining methods in ClustalW [26, 27] . Tree structure was validated by running the analysis on 1000 bootstraps [28] replications dataset and the evolutionary distances were calculated using the Poisson correction method [29] . Variant sequences of SARS-CoV-2 were modeled in Swiss-Model [30] using the Cryo-EM spike protein structure of SARS-CoV-2 (PDB ID 6VSB) as a template. The overall quality of models was assessed in RAMPAGE server [31] by generating Ramachandran plots (Supplementary Table 1 ). PyMol and BIOVIA Discovery Studio were used for structure visualization and superpose [32, 33] . 6 Multiple sequence alignment of the available 320 genomes of SARS-CoV-2 were performed and 483 variations were found throughout the 29,903 bp long SARS-CoV-2 genome with in total 115 variations in UTR region, 130 synonymous variations that cause no amino acid alteration, 228 non-synonymous variations causing change in amino acid residue, 16 INDELs, and 2 variations in non-coding region (Supplementary Table 2 ). Among the 483 variations, 40 variations (14 synonymous, 25 non-synonymous mutations and one deletion) were observed in the region of ORF S that encodes S glycoprotein which is responsible for viral fusion and entry into the host cell [34] . Notable that, most of the SARS-CoV-2 genome sequences were deposited from the USA (250) and China (50) (Supplementary Fig. 1 ). Positional variability of the SARS-CoV-2 genome was calculated from the MSA of 320 SARS-CoV-2 whole genomes as a measure of Entropy value (H(x)) [35] . Excluding 5′ and 3′ UTR, ten hotspot of hypervariable position were identified, of which seven were located at ORF1ab (1059C>T, 3037C>T, 8782C>T, 14408C>T, 17747C>T, 17858A>G, 18060C>T) and one at ORF S (23403A>G), ORF3a (25563G>T), and ORF8 (28144T>C) respectively. The variability at position 8782 and 28144 were found to be the highest among the other hotspots ( Fig. 1 ). The phylogenetic analysis of a total of 66 sequences (26 unique SARS-CoV-2 and 40 different coronavirus S glycoprotein sequences) was performed. The evolutionary distances showed that all the SARS-CoV-2 spike proteins cluster in the same node of the phylogenetic tree confirming the sequences are similar to Refseq YP_009724390 (Fig. 2) . Bat coronaviruses has a close evolutionary relationship as different strains were found in the nearest outgroups and clades (Bat coronavirus BM48-31, Bat hp-beta coronavirus, Bat coronavirus HKU9) conferring that 7 coronavirus has vast geographical spread and bat is the most prevalent host (Fig. 2) . In other clades, the clusters were speculated through different hosts which may describe the evolutionary changes of surface glycoprotein due to cross species transmission. Viral hosts reported from different spots at different times is indicative of possible recombination. The S glycoprotein sequences of SARS-CoV-2 were retrieved from the NCBI Virus Fig. 2 ). 8 Alterations of amino acid residual charge from positive to neutral (H49Y, R408I, H519Q), negative to neutral (D111N, D614G, D936Y), negative to positive (D1168H, D1259H), and neutral to positive (N74K, S247R) were seen in variants QHW06059, QHS34546, QIS61422, QIS61338, QIK50427, QIS30615, QIS60978, QIS60582, QIO04367, and QHR84449 respectively due to substitution of amino acid that differ in charge. The remaining 15 variants were mutated with the amino acids that are similar in charge (Fig. 4 A) . The SARS-CoV-2 spike protein variants were superposed with the cryo-electron microscopic structure of SARS-CoV-2 spike protein [8] . L5F, N74K, E96D, F157L, G181V, S247R, G476S, V483A, D1168H, and D1259H variants were excluded from superposition due to absence of respective residues in the 3D structure of template (PDB: 6VSB). The superposition showed that most of the residual change were causing incorporation of bulky amino acid residues (T29I, H49Y, L54F, S221W, A348T, H519Q, A520S, A930V, D936Y, and A1078V) in place of smaller size residue except Y28N, D111N, R408I, D614G, and F797C (Fig. 4 B-P) . Fig. 3) . The S2 subunit of spike protein, especially the heptad repeat region 2, fusion peptide domain, transmembrane domain, and cytoplasmic tail were found to be highly conserved in the SARS-CoV and the SARS-CoV-2 variants while the S1 subunit was more diverse, specifically the N-terminal domain (NTD) and receptor-binding domain (RBD). COVID 19 is one of the most contagious pandemics the world has ever had with 1,250,000 confirmed cases to date (April 4, 2020) and the cases have increased as high as 5 times in less than a month [1] . Phylogenetic analysis showed that the SARS-CoV-2 is a unique coronavirus presumably related to Bat coronavirus (BM48-31, Hp-betacoronavirus). During this study, we [15] , [38] , [39] , [40] ]. Likewise, a number of studies targeting SARS-CoV-2 spike protein have been undertaken for the therapeutic measures [41] , but the unique structural and functional details of SARS-CoV-2 spike protein are still under scrutiny. We also found a variant (R408I) at receptor binding domain (RBD) that mutated from positively charged Arginine residue to neutral and smaller sized Isoleucine residue (Fig. 4 I) . This change might alter the interaction of viral RBD with the host receptor because the R408 residue of SARS-CoV-2 is known to interact with the ACE2 receptor for viral entry [42] . Similarly, alterations of RBD (G476S, V483A, H519Q, and A520S) also could affect the interaction of SARS-CoV-2 spike protein with other molecules which require further investigations. QIA98583 and QIS30615 variants were found to have an alteration of Alanine to Valine (A930V), and Aspartic acid to Tyrosine (D936Y) respectively in the alpha helix of the HR1 domain. Previous reports have indicated that HR1 domain plays a significant role in viral fusion and entry by forming helical bundles with HR2, and mutations including alanine substitution by valine (A1168V) in HR1 region are predominantly responsible for conferring resistance to mouse hepatitis coronaviruses against HR2 derived peptide entry inhibitors [43] . This study hypothesizes the mutation (A930V) found in that of SARS-CoV-2 might also have a role in the emergence of drug-resistance virus strains. Also, the mutation (D1168H) found in the heptad repeat 2 (HR) SARS-CoV-2 could play a vital role in viral pathogenesis. The SARS-CoV-2 S protein contains additional furin protease cleavage site, PRRARS, in S1/S2 domain which is conserved among all 320 sequences as revealed during this study ( Supplementary Fig. 3 ). This unique signature is thought to make the SARS-CoV-2 more virulent than SARS-CoV and regarded as novel features of the viral pathogenesis (ref 11). According to previous reports the more the host cell protease can process the coronavirus S can accelerate viral tropism accordingly in influenza virus [[9] , [44] , [45] , [46] ]. Apart from that, this could also promote viruses to escape antiviral therapies targeting transmembrane protease TMPRSS2 (ClinicalTrials.gov, NCT04321096) which is well reported protease to cleave at S1/S2 of S glycoprotein [47] . Comparative analyses between SARS-CoV and SARS-CoV-2 spike glycoprotein showed 77% similarity between them where the most diverse region was . Coronavirus disease (COVID-2019) situation reports Severe acute respiratory syndrome-related coronavirus--The species and its viruses, a statement of the Coronavirus Study Group Lim, others, A novel coronavirus associated with severe acute respiratory syndrome Bats are natural reservoirs of SARS-like coronaviruses, Science (80-. ) Huang, others, A pneumonia outbreak associated with a new coronavirus of probable bat origin Pei, others, A new coronavirus associated with human respiratory disease in China Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Structure analysis of the receptor binding of 2019-nCoV Fouchier, others, Dipeptidyl peptidase 4 is a functional receptor for the emerging human coronavirus-EMC Greenough, others, Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses A. Nitsche, others, SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor Wu, others, Potent binding of 2019 novel coronavirus spike protein by a SARS coronavirus-specific human monoclonal antibody An exclusive 42 amino acid signature in pp1ab protein provides 13 insights into the evolutive history of the 2019 novel human-pathogenic coronavirus (SARS-CoV2) The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade Genomic variance of the 2019-nCoV coronavirus Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding The coronavirus spike protein is a class I virus fusion protein: structural and functional characterization of the fusion core complex Interaction between heptad repeat 1 and 2 regions in spike protein of SARS-associated coronavirus: implications for virus fusogenic mechanism and identification of fusion inhibitors Virus Variation Resource--improved response to emergent viral outbreaks Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes MUSCLE: multiple sequence alignment with improved accuracy and 14 Proceedings. 2004 IEEE Comput. Syst. Bioinforma. Conf. 2004. CSB BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT MEGA X: molecular evolutionary genetics analysis across computing platforms The neighbor-joining method: a new method for reconstructing phylogenetic trees Bootstrap confidence levels for phylogenetic trees Evolutionary divergence and convergence in proteins SWISS-MODEL: homology modelling of protein structures and complexes Structure validation by Calpha geometry: Phi, psi and Cbeta deviation Pymol: An open-source molecular graphics tool Receptor recognition mechanisms of coronaviruses: a decade of structural studies A Parvovirus B19 synthetic genome: Sequence features and functional competence Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteins reveal the dynamic receptor binding domains Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite conformational state for receptor binding Long-term protection from SARS coronavirus infection conferred by a single immunization with an attenuated VSV-based vaccine Human monoclonal antibodies against highly conserved HR1 and HR2 domains of the SARS-CoV spike protein are more broadly neutralizing A truncated receptor-binding domain of MERS-CoV spike protein potently inhibits MERS-CoV infection and induces strong neutralizing antibody responses: Implication for developing therapeutics and vaccines Fusion mechanism of 2019-nCoV and fusion inhibitors targeting HR1 domain in spike protein Role of changes in SARS-CoV-2 spike protein in the interaction with the human ACE2 receptor: An in silico analysis Coronavirus escape from heptad repeat 2 (HR2)-derived peptide entry inhibition as a result of mutations in the HR1 domain of the spike fusion protein Host cell proteases controlling virus pathogenicity Role of hemagglutinin cleavage for the pathogenicity of influenza virus Host cell proteases: Critical determinants of coronavirus tropism and pathogenesis Coronaviruses: an overview of their replication and pathogenesis Receptor for mouse hepatitis virus is a member of the carcinoembryonic antigen family of glycoproteins Laude, others, Aminopeptidase N is a major receptor for the enteropathogenic coronavirus TGEV