key: cord-0950630-wh3sd3rb authors: Rao, R Shyama Prasad; Ahsan, Nagib; Xu, Chunhui; Su, Lingtao; Verburgt, Jacob; Fornelli, Luca; Kihara, Daisuke; Xu, Dong title: Evolutionary Dynamics of Indels in SARS-CoV-2 Spike Glycoprotein date: 2021-12-06 journal: Evol Bioinform Online DOI: 10.1177/11769343211064616 sha: 94818073981cbc2080c0e56cf797e608d79552b1 doc_id: 950630 cord_uid: wh3sd3rb SARS-CoV-2, responsible for the current COVID-19 pandemic that claimed over 5.0 million lives, belongs to a class of enveloped viruses that undergo quick evolutionary adjustments under selection pressure. Numerous variants have emerged in SARS-CoV-2, posing a serious challenge to the global vaccination effort and COVID-19 management. The evolutionary dynamics of this virus are only beginning to be explored. In this work, we have analysed 1.79 million spike glycoprotein sequences of SARS-CoV-2 and found that the virus is fine-tuning the spike with numerous amino acid insertions and deletions (indels). Indels seem to have a selective advantage as the proportions of sequences with indels steadily increased over time, currently at over 89%, with similar trends across countries/variants. There were as many as 420 unique indel positions and 447 unique combinations of indels. Despite their high frequency, indels resulted in only minimal alteration of N-glycosylation sites, including both gain and loss. As indels and point mutations are positively correlated and sequences with indels have significantly more point mutations, they have implications in the evolutionary dynamics of the SARS-CoV-2 spike glycoprotein. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), responsible for the currently ongoing pandemic of coronavirus disease 2019 (COVID-19), 1 has infected more than 246 million people and killed over 5.0 million. 2 Related coronaviruses -SARS-CoV-1 and Middle East respiratory syndrome coronavirus (MERS-CoV) -have also caused pandemics in the recent past. SARS-CoV-2 belongs to the general class of enveloped viruses (which include influenza and human immunodeficiency viruses, among others) that show great plasticity and immune evasiveness due to a protective lipid bilayer and embedded glycoproteins that are heavily N-glycosylated and used as a 'glycan shield'. [3] [4] [5] The lipid bilayer envelope of these viruses is particularly sensitive to desiccation, heat and detergents. 6 However, the envelope glycoproteins of many enveloped viruses are known to be particularly variable and found to evolve quickly under selection pressure. As a result, the patterns and drivers of envelope glycoprotein variations in these viruses have been studied keenly. 4, [7] [8] [9] Yet, as the 'enveloped viruses' is a broad category, the key features that lead to differences in infectivity and antigenicity among different members, for example HIV-1 versus SARS-CoV-2, is of particular interest. 5 Given the nature of the pandemic, the genomic architecture and its evolutionary dynamics are being keenly explored in coronaviruses 10, 11 and SARS-CoV-2 in particular. [12] [13] [14] [15] Pachetti et al 16 have shown the emergence of mutations in the SARS-CoV-2 genome and RNA-dependent-RNA polymerase as a mutational hotspot. Mercatelli and Giorgi 17 have analysed 48 635 SARS-CoV-2 complete genomes and found 7.23 mutations per sample on average. Given its importance as a key interactor of angiotensin-converting enzyme 2 (ACE2) for host cell entry and as a target for neutralising antibodies, spike glycoprotein in the envelope of SARS-CoV-2 is special and therefore its variants are keenly watched (https://www.gisaid. org/hcov19-variants/). D614G was found to be a prevalent spike mutation. However, its precise effect is unclear as it was Evolutionary Dynamics of Indels in SARS-CoV-2 Spike Glycoprotein 2 Evolutionary Bioinformatics known to increase infectivity 18 as well as increase susceptibility to neutralisation. 19 Numerous other mutations in the spike glycoprotein have also been documented. 20 Studies on variants of SARS-CoV-2 mainly focus on point mutations. This is because there is a massive prevalence of single-nucleotide polymorphisms (SNPs) compared to short indels, which only account for 0.8% of mutations. 17 A key reason for indels being less common is that they are more deleterious due to frame-shifting than SNPs. [21] [22] [23] As SARS-CoV-2 has been shown to accumulate indels, 5 we are only beginning to explore them and appreciate their myriad roles. For example, Chrisman et al 24 have looked at indels in the SARS-CoV-2 genome and mapped it to regions of discontinuous transcription breakpoints. Lee et al 25 have shown a novel indel in nucleocapsid (N) gene leading to negative results for N gene-based RT-PCR that was approved by US/FDA and EU/CE-IVD. Their work also emphasised the genetic variability and rapid evolution of SARS-CoV-2. 25 While indels have been explored predominantly at the genomic level, 12,17 they were less emphatically examined at the proteomic level and even less so in the spike glycoprotein. Despite their rarity, indels accelerate protein evolution, 26 and could be especially interesting and important in the spike glycoprotein. For example, indels can be beneficial as recurrent deletions in SARS-CoV-2 spike glycoprotein are shown to drive antibody escape and accelerate antigenic evolution. 27 Garry and Gallaher 28 have explored naturally occurring indels in multiple coronavirus spike proteins and provided evidence against a laboratory origin of SARS-CoV-2. Garry et al 29 have also shown that the mutations in 'variants of concern' (VOC) commonly occur near indels. Despite these revelations, given an accumulating wealth of SARS-CoV-2 genomic data, large-scale patterns of indels in spike glycoproteins have not been fully explored and appreciated. With this background, we sought to answer some of the open questions: (1) What is the prevalence and pattern of indels in the spike protein? (2) What are the evolutionary dynamics of sequences with indels? (3) Is there any relationship with point mutations? and (4) What is the effect of indels on N-glycosylation sites? We analysed a large set of 1.79 million SARS-CoV-2 spike protein sequences and showed that over 50% contained 1 or more indels. The proportion of sequences with indels has risen sharply, and currently over 76% of unique sequence variants and 89% of total sequences have indels. Indels and point mutations are positively correlated and sequences with indels seem to have more point mutations overall. Further, indels had minimal effect on N-glycosylation sites. We discuss these findings in the context of the evolutionary dynamics of the viral protein. The SARS-CoV-2 spike protein sequences and related metadata were obtained from the GISAID website (https://www. gisaid.org/; accessed on June 3, 2021). 30 The spike protein sequences were based on the translation of the genome after alignment to the reference hCoV-19/Wuhan/WIV04/2019 (EPI_ISL_402124) and were in the fasta format. The associated tsv metadata included date of sample collection, location/country of origin, and clade/lineage information of the virus, among other details. There were a total of 1 790 224 sequences in the database at the time of access. However, as there were numerous quality issues with the data, 31 many sequences were filtered out. For example, 465 419 sequences containing X (on average of 82.4 X per sequence) that arose from the translation of low-quality regions and/or ambiguous bases in the genome were excluded. Incomplete sequences based on missing N-terminal and/or C-terminal codons were ignored. As our interest was to look for the pattern of short indels, 22 disproportionately short sequences (eg, 3744 sequences were very short -less than 1000 residues in length) that were missing internal parts possibly due to sequencing/annotation issues were ignored. Finally, sequences with incomplete metadata on the date of sample collection were also excluded. In the final set of 1 311 545 spike protein sequences, there were a total of 49 118 unique sequences based on 100% identity cut-off 32 that included all possible variants. Average identity (based on CD-HIT, http://weizhong-lab.ucsd.edu/ cd-hit/) of sequences was 99.47% (median = 99.53%). For each sequence, a pair-wise alignment with the reference (EPI_ISL_402124) was done using Biopython. 33 The BLOSUM62 was used as the substitution matrix, and gap open and extension penalties were set at 11 and 1, respectively. 34 Any gap in the query sequence was considered a deletion and a gap in the reference was considered an insertion. 21 Finally, a mismatched residue was considered a point mutation (or substitution). To see the temporal dynamics, the proportions of sequences with indels were plotted against the date of sample collection (month-wise). Based on the sequence metadata, country and clade/lineage-specific dynamics of indels were also plotted. For each alignment, the number of indels was enumerated, and the positions of indels were mapped to the reference sequence. Further, potential N-glycosylation sites were identified based on the regular pattern of tripeptide sequons NXS or NXT where N is asparagine, S is serine, T is threonine and X is any amino acid residue, and compared with potential N-glycosylation sites found in the reference sequence (Table 1 and Supplemental Table S1 ). 4, 9, 35 Multiple sequence alignments, where required, were done using Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/). All sequence analysis and data handling, where specifically not mentioned, were performed in Python; and visualisation/graphs were created in Microsoft Excel. 3 The positions of indels and N-glycosylation sites were visualised on the 3D structure of SARS-CoV-2 spike glycoprotein (PDB ID: 6VXX or 6XR8) using the Visual Molecular Dynamics (VMD) programme (https://www.ks.uiuc.edu/ Research/vmd/). To see if indels have any preference for sequence/structural/functional features such as surface-exposed regions, solvent accessible surface area (SASA) information was obtained using the DSSP programme, which calculates an accessibility score (ranged from 0 to 277) from the 3D structure (http://swift.cmbi.ru.nl/gv/dssp/). 36 Protein disorder was calculated (values ranged from 0 to 0.41) using DISpro (http:// scratch.proteomics.ics.uci.edu/), and shorter disorder regions known as molecular recognition features (MoRF) were quantified (values ranged from 0.21 to 0.8) using MoRFCHiBi_Web Table S2 ). 38 Where required, a 1-proportion Z-test was used to check if the observed proportion was significantly different from the expected. A chi-square test for independence was performed to check whether (multiple) sample proportions were significantly different. 39 Correlations between indels and other variables (such as point mutations, accessibility scores, etc.) were measured using a more robust Kendall τ coefficient. The significance of correlation coefficient was tested using cor.test(), which is based on t-distribution or approximation. A t-test was used to compare the means of 2 groups (eg, mutations in sequences with or without indels). Where required, the P-values were corrected for multiple comparisons using Benjamini-Hochberg (BH) method. 40 All statistical tests were done using R. The SARS-CoV-2 spike glycoprotein reference sequence (EPI_ISL_402124) contains 1273 amino acid residues. The distribution of sequences with short indels was plotted based on the number of sequences in each length category and shown in Figure 1A as a bar diagram. Overall, the distribution is similar for all sequences (n = 1 311 545, filled bars) and unique sequences (n = 49 118, open bars). Over 50% of all sequences (inset pie chart) had at least 1 indel. Sequences with deletions (50.5%) were far more common than sequences with insertions (0.07%). Further, sequences containing 3-residue deletions were very frequent. A small number of sequences (0.16%) had deletions of more than 3 residues. However, these proportions (36.4%, 0.33% and 1.35%) are significantly different (χ 2 = 6991.5, P ≈ 0) if only unique sequences were considered. The proportions of sequences with indels over time (monthwise) are given in Figure 1B , which shows an increasing trend. For example, in August 2020 the proportion of sequences with 1 to 3 deletions were about 1.4% which increased to 89.3% in May 2021. The proportions of sequences with insertions, and deletions of more than 3 residues also increased (Supplemental Figure S1A) . A similar increasing trend is seen even if only unique sequences were considered (Supplemental Figure S1B ). In May 2021, 76.3% of unique sequences have 1 to 3 deletions, and 1.8% of unique sequences with deletions of more than 3 residues. This increasing trend of sequences with indels holds true across countries (Supplemental Figure S1C ). Almost all sequence variants currently present in the United Kingdom and South Africa have indels. On the other hand, only a small proportion of sequences from Brazil currently have indels. It should be noted that some patterns were a bit noisy due to small sample sizes (Supplemental Figure S2A ), for example, in early months and/or country-wise trends. The proportions of sequence contributions from many countries that have reported variants of concern such as India, Brazil, South Africa and Nigeria were very small (Supplemental Figure S2B ). In particular, just 0.74% of sequences were from India. However, it becomes 2.7% if only unique sequences were considered (Supplemental Figure S2C ). The distribution of proportions of (C) Month-wise proportions of variants of concern/interest coming from sequences with indels compared to that of without indels. sequences with indels over time ( Figure 1B and Supplemental Figure S1B ) is heavily influenced by the dominant variant Alpha (Supplemental Figure S1D) . However, the proportions of sequences with indels are also increasing in all other variants (Supplemental Figure S1D -lower panel). It is interesting to note that a far higher proportion of variants of concern/interest (that have been sharply increasing in the past 6 months) come from sequences with indels compared to sequences without indels (0.793 vs 0.093, P ≈ 0, 2-proportion Z-test). A month-wise trend is shown in Figure 1C . Overall, 84.1% of unique sequences from VOC/I have 1 or more indels. Table S2 ). However, it may be noted that indels were far less common (odds = 0.47) in the C-terminal half of the spike protein sequence. Further, indels were frequent only in a few residue-positions. Figure S3D) , there is variability in the position of 3-residue deletion leading to the emergence of new deletion variants. Figure 2B shows the multiple sequence alignment of Delta (B.1.617.2) and Kappa (B.1.617.1) variants that contain as many as 17 combinations of deletions (deletion of residues 156 (E) and 157 (F) being the most common). It may be noted Evolutionary Bioinformatics that 79% of sequences from Delta and Kappa variants of concern/interest (n = 9133, currently represent 0.7% of the total) contain 1 or more deletions. Based on the correlation, indels showed a low (but significant) preference for surface-exposed regions (τ = 0.054 and P = .027 for insertions, and τ = 0.104 and P = 1.2E-5 for deletions). Correlation with overall protein disorder was not significant (τ = 0.155 and P = .56 for insertions, and τ = 0.091 and P = .66 for deletions), possibly because long disordered regions were very few and far apart. Correlation with shorter disordered regions (MoRF) was low but significant (τ = 0.122 and P = 6.7E-8 for insertions, and τ = 0.107 and P = 7.1E-7 for deletions). On the 3D structure of SARS-CoV-2 spike glycoprotein (Supplemental Figure S4 ), indels were prevalent in much of the outer side of the N-terminal domain (NTD). This was reflected in the domain analysis wherein NTD and terminal regions showed a high overlap coefficient (Supplemental Table S2 ). Deletions, in particular, were also more frequent at the flanks of the receptor-binding domain (RBD), but were far less common in the S2 subunit region and were almost absent at the inner side of the subunits (Supplemental Figure S4 ). The spike sequences with indels had more (over 1.81 times; 7.9 ± 2.1 vs 4.3 ± 2.5, mean ± standard deviation) point mutations than sequences without indels. Similar patterns were seen even when different GISAID clades or lineages were taken separately ( Figure 3) . However, they were not significant (t-test with BH correction) in some groups when the proportion of sequences with indels was very small or due to a small sample size. Overall, VOC/I had more point mutations compared to non-VOC/I, but in both groups sequences with indels had significantly more point mutations. The distribution of point mutations along the sequence is shown in Supplemental Figure S5 . Sequences with indels had, apart from D614G, 6 more prevalent mutations. There were 96 positions in sequences with indels (n = 18 813) and 101 in sequences without indels (n = 30 305, scaled to the number of sequences with indels) that had 100 or more instances of point mutations. Overall, the N-terminal region had longer stretches of residues with more than 100 occurrences of point mutations. There was a small but significant positive correlation (τ = 0.224, P = 4.0E-25) between the distribution of indels and point mutations along the primary sequence (τ = 0.136, P = 1.9E-9 for insertions; τ = 0.22, P = 5.5E-24 for deletions). Many point mutations in VOC Delta were differentially abundant (2-proportion Z-test with BH correction) in sequences with indels compared to sequences without indels (Supplemental Figure S5C ) and were more common in the N-terminal half where indels are also present ( Figure 2B ). Based on the occurrence of sequons, there are 22 potential N-glycosylation sites (7 NXS and 15 NXT) in the SARS-CoV-2 spike glycoprotein reference sequence (EPI_ ISL_402124). Despite indels being present in over 50% of the total 1 311 545 sequences, there was remarkably minimal effect on N-glycosylation sites due to indels. The list of 67 Bar plot shows the average number of point mutations (mean ± SD) in sequences with indels compared to sequences without indels. Point mutations are significantly more (t-test with BH correction) in sequences with indels across clades and lineages of variants of concern/interest. They are not significant (ns, or opposite as in clade V) when the sample size and/or proportion of sequences with indels are too small (lower panel of bar plot and sample size). instances of N-glycosylation sites that have been altered by the indels (in 65 unique sequences) is given in Table 1 (Supplemental Table S1 ), and their positions are shown in Figure 4A to C. There were 7 instances of gain of sites (Table 1 , green) -due to insertions (eg, near position 27, A---YT to AYTNYT), or deletions with substitution (eg, near position 246, RSYLT to N---LT). There were also many more instances of loss of sites (Table 1 , blue) -mostly due to deletions (eg, at position 17, VNLT to V--T), but also due to insertions (eg, at position 61, N-VT to NSVT). While the loss of sites occurred mostly at the N-terminal part of the spike, the gain of sites was a bit more scattered. It is important to note that all gains of sites were of NXT types. However, 3 sites (at around 290, 437 and 871) were completely buried in the 3-D structure ( Figure 4B and C) . There were also a few alterations of existing sites due to insertions or deletions ( Table 1 , orange). It may be noted that these indel-based N-glycosylation site alternations occurred in sequences belonging to many clades/lineages -many of them were of variants of concern (Table 1 and Supplemental Table S1 ). There was great interest and urgency to unravel the architecture of the SARS-CoV-2 genome. 1 Given the severity of the pandemic, there is currently an explosion of viral sequencing bringing along the concerns of data accessibility and ownership, 41, 42 data integrity/quality, 31 and inequality of sequencing effort/data collection among countries. 43, 44 Nonetheless, the availability of a vast amount of sequences has allowed the scientific community to track the changes in the SARS-CoV-2 genome as the pandemic is progressing. Numerous studies have shown the emergence and dynamics of new variants, although the main emphasis was on non-synonymous substitutions at the genomic level. 12, 14, 16, 18, 20, 29 Indel variants were underexplored and unappreciated due to their relative rarity, 12,17,23 but it is becoming evident that they play key roles in the SARS-CoV-2 genome. 24, 25 Indels are even less explored at the proteomic level because they are primarily found in untranslated regions. 12 In this work, we showed that there is an incursion of short indels at numerous positions in the SARS-CoV-2 spike glycoprotein. Of these, 2 very common deletions (ΔH69/ΔV70, primarily found in UK variants) were well known and shown to have recurrent emergence and transmission. 45 While deletions facilitate antibody escape, it was found that BNT162b2 vaccine-elicited sera can still neutralise 69/70 deletion variant. 46 One reason could be concurrent substitutions that offset this effect. For example, D614G substitution, also found in the deletion variant (clade G, prevalent in Europe), was shown to increase SARS-CoV-2 susceptibility to neutralisation. 19 Some independent deletions of 5 to 7 residues were known to occur in and near the furin-like cleavage site (around residue position 681). It was hypothesised that those deletions might be involved in viral infection. 47 However, at present, the functional implications of numerous other indels are completely unexplored/unknown. The proportion of proteins/sequences with indels has risen sharply over time. Currently, 78.4% of the viral variants have indels; and while Δ69-70 and Δ144 were present in the majority of the variants, there also seems to be an increasing trend for longer indels. Thus, indels seem to have a selective advantage, although random drift cannot be ignored. Recurrent deletions in the SARS-CoV-2 spike glycoprotein are known to drive antibody escape. 27 For example, recurrent deletions (Δ141-144 and Δ146, and Δ243-244) in spike N-terminal domain (NTD) abolished its binding with neutralising antibody 4A8, and Δ140 caused a 4-fold reduction in neutralisationtitre. 48 The emergence of novel indels leading to variants of concern could be a challenge to vaccines and COVID-19 management. [49] [50] [51] This could be further exacerbated as sequencing efforts in many (developing) countries were minimal, 43, 44 but there seem to be disproportionally more variants, for example, in India. The viral diversity/variants may only be fully appreciated if there is better sequencing effort in these countries, many of which are reporting the emergence of variants of concern. For example, Resende et al 52 have found convergent indels in the NTD of spike/SARS-CoV-2 lineages with mutations of concern circulating in Brazil, while Tegally et al 53 found Δ242-244 in the SARS-CoV-2 variant of concern in South Africa. The recurrent emergence of insertions (between R214 and D215) in the NTD, and their progressive increase in multiple lineages, including VOC have been recently documented. 54 It is important to note that many indel positions are highly variable due to independent/multiple origins of indels 27, 53 and/or nearby substitutions that affect alignment. Indels have special relevance as they can fine-tune the 3D structure beyond point mutations and are known to occur in surface-exposed loops. 26 As the SARS-CoV-2 will be with us forever, 55 there is a need, equitably across countries to monitor the dynamics of variants, including indels. Point mutations (or substitutions) tend to accumulate near indels. 56, 57 In fact, indels are the driving forces as heterozygosity of indels was proposed as mutagenic to surrounding sequences. 56 As indels are less constrained and have higher structural influence than substitutions, 58 they are frequently under positive selection, for example, in cancer. 57 While the spike mutations in 'variants of concern' (VOC) were known to occur near indels, 29 here, we showed a large-scale relationship between indels and point mutations. GISAID clades are based on a statistical distribution of genome distances in phylogenetic clusters. 59 They have the evolutionary relationship of S > L (and O, V) > G > GH > GR (and GV) > GRY. 60 Advanced clades seem to have comparatively more mutations. Overall, mutations were more frequent in sequences with indels. This relation holds true even in variants of concern that already have extensive mutations. 61 It is interesting and important to note that sequences with indels have several differentially abundant point mutations in VOC Delta, posing a global challenge. Mutations in RBD, in particular, were shown to affect ACE2 interaction. For example, deep mutational scanning of RBD found that most of the 3804 individual mutations were deleterious for ACE2 binding. 62 However, RBD with mutations such as V367F, Y453F, and N501Y showed stronger interaction -faster association and slower dissociation ratewith ACE2 63,64 and possibly had minimal effect on antibody neutralisation. 65 Shah et al. 66 showed that insertion of Gly at 482 hinders antibody neutralisation. However, the effect of indels on RBD and ACE2 interaction is not explored. Despite numerous instances of indels, there were only a few instances of alterations of N-glycosylation sites in the spike protein. While some existing sites were modified, there were a few more instances of gain and loss of sites. Interestingly, all gains of sites in spike were of NXT type, which were known to be preferred by viral glycoproteins. 4, 35, 67 However, given that some gains of sites were buried in the 3-D structure, they are unlikely to get selected/fixed. Proteins of other enveloped viruses, for example, haemagglutinin (HA) of influenza virus A/H1N1 (since 1918), A/H3N2 (since 1968), and recent A/ H5N1 are all accumulating more N-glycosylation sites 4 and/or modifying the existing sites. 7, 35 It is important to watch the dynamics of N-glycosylation sites in spike as SARS-CoV-2 transforms the vulnerabilities of its glycan shield. 68 For instance, the spike protein has 25% glycans by weight, which shield approximately 40% of the surface 69 as against 50% glycans by weight which shield 71% to 97% in gp120 of HIV-1, countering vaccine development and/or neutralisation by antibody. 70 On the other hand, loss of N-glycosylation sites has a selective disadvantage as removal of N331 and N343 drastically reduced infectivity, revealing the importance of glycosylation for viral infectivity. 20 While the SARS-CoV-2 spike utilises a glycan shield, it also modulates conformational dynamics of the receptor-binding domain by glycosylation. For example, deletion of sites by N165A and N234A mutations reduces spike binding to its receptor ACE2. 3 To mention a key limitation of this study, we looked at the patterns of indels only in spike protein. Further, we do not give any explicit mechanistic insights into the emergence/dynamics of indels and their relationship with substitutions. 9 In conclusion, we show that SARS-CoV-2 is fine-tuning the spike with numerous indels. There seems to be a selective advantage as the proportions of indel variants steadily increased over time with similar trends across countries/variants. As many as 420 unique indel positions and 447 unique combinations of indels were present. Indels and point mutations are positively correlated and sequences with indels had significantly more point mutations. Despite their frequency, indels resulted in only minimal alteration of N-glycosylation sites. RSPR initiated the work and wrote the paper. All authors contributed and were involved in the revision. The work is in compliance with ethical standards. No ethical clearance was necessary. R Shyama Prasad Rao https://orcid.org/0000-0002-2285-6788 The SARS-CoV-2 sequences and metadata used in this work are available upon registration, as per the terms of the Database Access Agreement, at GISAID (https://www.gisaid.org/). Addendum: a pneumonia outbreak associated with a new coronavirus of probable bat origin Beyond shielding: the roles of glycans in the SARS-CoV-2 spike protein Darwinian selection for sites of Asnlinked glycosylation in phylogenetically disparate eukaryotes and viruses HIV-1 and SARS-CoV-2: patterns in the evolution of two pandemic pathogens Survival of enveloped and non-enveloped viruses on inanimate surfaces Subtle evolutionary changes in the distribution of N-glycosylation sequons in the HIV-1 envelope glycoprotein 120 Evolutionary dynamics of N-glycosylation sites in hemorrhagic fever viral envelope proteins Tracking global patterns of N-linked glycosylation site variation in highly variable viral glycoproteins: HIV, SIV, and HCV envelopes and influenza hemagglutinin SARS-CoV genome polymorphism: a bioinformatics study Coronavirus genomics and bioinformatics analysis Genomic and proteomic mutation landscapes of SARS-CoV-2 Emergence of SARS-CoV-2 through recombination and strong purifying selection Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: a computational biology approach SARS-CoV-2 one year on: evidence for ongoing viral adaptation Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Geographic and genomic distribution of SARS-CoV-2 mutations Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus D614G spike mutation increases SARS CoV-2 susceptibility to neutralization The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity Predicting the functional effect of amino acid substitutions and indels Effects of short indels on protein structure and function in human genomes An initial map of insertion and deletion (INDEL) variation in the human genome Indels in SARS-CoV-2 occur at template-switching hotspots Novel indel mutation in the N gene of SARS-CoV-2 clinical samples that were diagnosed positive in a commercial RT-PCR assay Protein expansion is primarily due to indels in intrinsically disordered regions Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape Naturally occurring indels in multiple coronavirus spikes Spike protein mutations in novel SARS-CoV-2 'variants of concern' commonly occur in or near indels GISAID: global initiative on sharing all influenza datafrom vision to reality Issues with SARS-CoV-2 sequencing data. Virological. 2020. Accessed CD-HIT suite: a web server for clustering and comparing biological sequences Biopython: freely available Python tools for computational molecular biology and bioinformatics BLAST: At the core of a powerful and diverse set of sequence analysis tools Do N-glycoproteins have preference for specific sequons? Bioinformation A series of PDB-related databanks for everyday needs MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences A survey on similarity measures in text mining An Introduction to Categorical Data Analysis Controlling the false discovery rate: a practical and powerful approach to multiple testing Why some researchers oppose unrestricted sharing of coronavirus genome data Scientists call for fully open sharing of coronavirus genome data Global discrepancies between numbers of available SARS-CoV-2 genomes and human development indexes at country scales Are all nations doing enough on SARS-CoV-2 sequencing? Clearly not. Down to Earth Recurrent emergence of SARS-CoV-2 spike deletion ΔH69/ΔV70 and its role in the alpha variant B.1.1.7 Neutralization of SARS-CoV-2 spike 69/70 deletion, E484K and N501Y variants by BNT162b2 vaccine-elicited sera Identification of common deletions in the spike protein of severe acute respiratory syndrome coronavirus 2 SARS-CoV-2 variants, spike mutations and immune escape Effects of SARS-CoV-2 variants on vaccine efficacy and response strategies Will SARS-CoV-2 variants of concern affect the promise of vaccines? Evidence of escape of SARS-CoV-2 variant B.1.351 from natural and vaccine-induced sera The ongoing evolution of variants of concern and interest of SARS-CoV-2 in Brazil revealed by convergent indels in the amino (N)-terminal domain of the spike protein Detection of a SARS-CoV-2 variant of concern in South Africa Emergence of a recurrent insertion in the N-terminal domain of the SARS-CoV-2 spike glycoprotein The coronavirus is here to stay -here's what that means Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes Important role of indels in somatic mutations of human cancer genes The combined effects of amino acid substitutions and indels on the evolution of structure within protein families Phylogenetic clustering by linear integer programming (PhyCLIP) Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7 Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding V367F mutation in SARS-CoV-2 spike RBD emerging during the early transmission phase enhances viral infectivity through increased human ACE2 receptor binding affinity N501Y mutation of spike protein in SARS-CoV-2 strengthens its binding to receptor ACE2 The SARS-CoV-2 Y453F mink variant displays a pronounced increase in ACE-2 affinity but does not challenge antibody neutralization Mutations in the SARS-CoV-2 spike RBD are responsible for stronger ACE2 binding and poor anti-SARS-CoV mAbs cross-neutralization Distribution of N-glycosylation sequons in proteins: how apart are they? Vulnerabilities in coronavirus glycan shields despite extensive glycosylation Analysis of the SARS-CoV-2 spike protein glycan shield reveals implications for immune recognition Structure and immune recognition of trimeric prefusion HIV-1 env Supplemental material for this article is available online.