key: cord-0979085-o7brl98h authors: Yuan, Fangfeng; Wang, Liping; Fang, Ying; Wang, Leyi title: Global SNP analysis of 11,183 SARS‐CoV‐2 strains reveals high genetic diversity date: 2020-12-08 journal: Transbound Emerg Dis DOI: 10.1111/tbed.13931 sha: 2808ae348ef82beb32b48b90cbf0274a50c3efcc doc_id: 979085 cord_uid: o7brl98h Since first identified in December of 2019, COVID‐19 has been quickly spreading to the world in few months and COVID‐19 cases are still undergoing rapid surge in most countries worldwide. The causative agent, severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2), adapts and evolves rapidly in nature. With the availability of 16,092 SARS‐CoV‐2 full genomes in GISAID as of 13 May, we removed the poor‐quality genomes and performed mutational profiling analysis for the remaining 11,183 viral genomes. Global analysis of all sequences identified all single nucleotide polymorphisms (SNPs) across the whole genome and critical SNPs with high mutation frequency that contributes to five‐clade classification of global strains. A total of 119 SNPs were found with 74 non‐synonymous mutations, 43 synonymous mutations and two mutations in intergenic regions. Analysis of geographic pattern of mutational profiling for the whole genome reveals differences between each continent. A transition mutation from C to T represents the most mutation types across the genome, suggesting rapid evolution and adaptation of the virus in host. Amino acid (AA) deletions and insertions found across the genome results in changes in viral protein length and potential function alteration. Mutational profiling for each gene was analysed, and results show that nucleocapsid gene demonstrates the highest mutational frequency, followed by Nsp2, Nsp3 and Spike gene. We further focused on non‐synonymous mutational distributions on four key viral proteins, spike with 75 mutations, RNA‐dependent‐RNA‐polymerase with 41 mutations, 3C‐like protease with 22 mutations and Papain‐like protease with 10 mutations. Results show that non‐synonymous mutations on critical sites of these four proteins pose great challenge for development of anti‐viral drugs and other countering measures. Overall, this study provides more understanding of genetic diversity/variability of SARS‐CoV‐2 and insights for development of anti‐viral therapeutics. East Respiratory Syndrome CoV and SARS-CoV-2) in β genus Ye et al., 2020) . SARS-CoV-2 was reported to be the causative agent for the novel respiratory disease, COVID-19 . The disease was declared to be a pandemic by WHO early this year and has led to more than 32 million infected and 981,000 dead. SARS-CoV-2 RNA genome encodes 16 non-structural proteins (Nsp) and at least 10 structural proteins including spike (S), ORF3a, envelop (E), membrane (M), open reading frame 6 (ORF6), ORF7a, ORF7b, ORF8, nucleocapsid (N) and ORF10 (Cagliani et al., 2020; Kim et al., 2020) . S protein contains receptor-binding domain (RBD) that directly binds to human receptor angiotensin-converting enzyme 2 (ACE2) and induces neutralizing antibody response against SARS-CoV-2 Lan et al., 2020) . Previous studies showed that antibody response against SARS-CoV-2 is mainly against S and N proteins (Erasmus et al., 2020; . RNA viruses possess a high mutation rate of genome and readily adapt to changing environmental conditions (Elena & Sanjuán, 2005) . Thus, a swarm of variants exist in RNA virus populations. A systemic tracking of SARS-CoV-2 mutations allows monitoring of circulating strains around the world (Guan et al., 2020) and provides guidance for development of countering measures. Since the first report of SARS-CoV-2, whole-genome sequences of the virus have been uploaded to the public available website, GISAID. Nextstrain employed nomenclature through designation of SARS-CoV-2 clades to label well-defined clades that reached geographic spread with significant frequency. Major clades were named by the year that emerged and a letter. Current clades on Nextstrain tree include 19A, 19B, 20A, 20B and 20C (Hadfield et al., 2018) . Another clade definition in GISAID used genetic markers and defined six clades including S, L, V, G, GH and GR. L was split into G and V in March . In order to characterize the mutational patterns and distributions across the whole genome, we performed a mega data analysis of 11,183 high-quality sequences from GISAID as of 13 May. Geographical distribution of mutations was analysed, and we further focused on four key viral proteins including S, RNAdependent-RNA-polymerase (RdRp), 3C-like protease (3CL pro ) and Papain-like protease (PL pro ). Potential functional impacts of mutations were evaluated. This study provides more evidence of SARS-CoV-2 genetic diversity, and mutations on key viral proteins may affect development of anti-viral therapeutics. As of 13 May, there are 16,092 high coverage full genomes available in GISAID. All were downloaded and of which 4,909 were removed due to their poor assembly quality resulting in 11,183 complete genomes used for subsequent analysis. MAFFT was employed for sequence alignment referenced to Wuhan-hu-1 strain (MN908947.3). Alignment results were further processed and analysed through CLC Genomics Workbench 11 (QIAGEN) and UGene (http://ugene.net). Statistical data analysis was performed on Excel (Microsoft) and GraphPad Prism software (GraphPad Software, Inc.). To determine the viral diversity and credibility of mutations across the genome, the entropy of nucleotide sequences was calculated using BioEdit software version 7.0.9.0 (Hall, 1999) . Protein structures for RdRp, S and 3CL pro were obtained from the Protein Data Bank (PDB). For SARS-CoV-2 PL pro structure, homology modelling was carried out by using I-TASSER (Yang et al., 2015) based on SARS-CoV PL pro structure. Structural homology with highest C scores was selected for analysis. Visualization of protein structures was performed through PyMOL (PyMOL Molecular Graphics System, version 1.7; Schrödinger, LLC). A total of 16,092 complete genomes with high coverage as of 13 May were downloaded from GISAID. After removal of 4,909 problematic sequences using stringent inclusion criteria (any N in the genome), 11,183 sequences were included for analysis. Since a large number of sequences do not have authentic or high-quality sequences for both 5' and 3' un-translational region (Singh et al., 2020) , terminal sequences for both ends were removed and only regions (266-29674nt) from polyprotein to the last open reading frame (Bal et al., 2020) sequences were included. Alignment against the reference strain, Wuhan-hu-1 (MN908947.3), was performed using MAFFT (Katoh et al., 2017; Rozewicki et al., 2019) . For global sequences analysed, an initial threshold setting of 1% (>111) was made to identify classified clades around the globe (Table 1) . A low threshold of 0.3% (>33) was also set to identify a site of interest (Table S1) . A 0.3% threshold was also applied to countries/regions with more than 333 sequences, and for those countries/regions with less than 333 sequences, single nucleotide polymorphisms (SNPs) with at least two sequences were recorded. Globally, with a threshold above 0.3%, we observed a total of 119 SNPs across the genome with 74 non-synonymous mutations, 43 synonymous mutations and two mutations in intergenic region (Table 1 and Table S1 ). A new major clade can be proposed if it reaches 20% frequency globally. Five major clades (19A, 19B, 20A, 20B and 20C) are classified based on nomenclature data provided by Nextstrain ( Figure S1 ). As shown in Table 1 and File S1, top SNPs with most counts include A23403G in S gene (Clade 19A, Count: 7, 590, entropy: 0.63444 (Saha et al., 2020) , and the pattern of entropy was found to be consistent with that of the SNP count ( Figure High-frequent C to T mutation found in this study further demonstrates CpG deficiency and SARS-CoV-2 has adapted to new host with high zinc-finger anti-viral protein expression and evolved new ways for immune evasion. More than a third of SNPs across the genome are synonymous mutations (43), and among all non-synonymous mutation sites, 9 were mutated from T to I, 6 from A to V and 6 also from S to L (Data not shown). Although synonymous mutation does not result in change in amino acid sequence, accumulation of these mutations has the capability to erase the characteristic compactness imprint of the single-stranded viral RNA genomes (Tubiana et al., 2015) . We also summarized SNPs in each gene of the viral genome. As shown in Figure 2 , N gene has 15 nucleotide positions mutated, then nsp2 (13), nsp3 (13), S gene (10), nsp14 (8), nsp12 (7), ORF3a (7), nsp13 (6) and nsp5 (5). To illustrate SNPs landscapes in each country/region, we further did analysis on countries/regions with the number of se- and J (A23403G) contribute to 19A clade, and E (C14408T) contributes to 20A clade (Table 1) Figure S2 ). We are trying to find a genetic determinant causing different case-fatality rates among different countries, but we did not find one. According to CDC report, clinical outcomes of COVID-19 patients relate to a variety of factors, such as age, gender, poverty, medical conditions and even blood types (Ellinghaus et al., 2020; Li et al., 2020) . Genetic variation/SNPs contribute to alterations of protein translation. We observed multiple deletions and insertions across the genome in different countries/regions (Table 2 ). Three nucleotide deletion in 1605-1607nt region result in amino acid N deletion in position 267 of nsp2. Twenty-nine counts of ninenucleotides deletion in 686-694nt lead to three amino acids deletions in nsp1 region. Another 9nt deletion (515-520nt) also occurs in nsp1 region, resulting in two amino acids (72V, 73M) missing. Deletion was also found in S gene with thre nucleotides deletion in 21991-21993nt. Accordingly, the single amino acid (Y) was missed in position 144 of S S: SARS-CoV-2 S protein is a major target of neutralizing antibodies and contributes to ACE2 binding and entry into host cells. SNPs on S gene potentially impact protein antigenicity and cellular tropism. In this study, there are total 75 non-synonymous mutations found on Spike protein ( countries. An increasing trend of D614G was observed globally, and it was reported that strains with this mutation lead to reduced S1 shedding and increased viral infectivity . However, its impact on therapeutic and vaccine design is limited . Instead of presence in the receptor-binding domain (RBD), D614G is located in the interface between the spike Singapore Fingers fingers (10 SNPs), palm (7 SNPs) and thumb (4 SNPs) (Table 5) . Notably, SNP C14408T (P323L) with 7,517 counts locates in interface domain and was distributed in 27 countries globally. Interface domain is still poorly studied and presumably interacts with other proteins regulating catalytic activity of RdRp. In most cases, spike D614G was accompanied by RdRp P323L. Structural analysis shows that P323L mutation results in considerable changes in secondary structure at this site and the substitution from proline to leucine could cause damage of structural integrity conferred by proline ( Figure 5) . Similarly, substitution of valine with a larger side chain at position 97 changes secondary structure of RdRp. It has been reported that A97V and P323L result in alteration of protein stability and intramolecular interactions, thus affecting RdRp functions (Chand et al., 2020) . Studies have put more efforts on spike D614G impacts, whereas RdRp P323L may also play a role in viral genome replication and transcription. Other more frequent 3CL pro : 3CL pro serve as a potential target by anti-viral inhibitors due to its crucial cleavage activity and functions in viral replication. As shown in Table 6 . Mutations were not found in these two sites, suggesting that pharmacological inhibitors of 3CL pro may still serve as therapeutics for SARS-CoV-2. However, with multiple high-frequent mutations found in 3CL pro especially G15S, K90R and D248E, more studies about their impacts on cleavage activity and 3CL pro drug efficacies are needed. PL pro : Same as 3CL pro , proteolytic processing of polyprotein is also mediated by PL pro . In this study, a total of 10 non-synonymous SNPs were found in PL pro region ( By analysis of 11,183 whole genomes of SARS-CoV-2, we demonstrated a high genetic variability between different regions and detailed mutational profiling across the genome and for key viral proteins (S, RdRp, 3CL pro and PL pro ). In the present study, 60 out of 119 SNPs are nucleotide substitutions from C to T, representing the most abundant transition. Consistent with previous studies, this observation increases the frequency of codons for hydrophobic amino acids and provides evidence of potential anti-viral editing mechanisms driven by host (Matyasek & Kovarik, 2020; Mercatelli & Giorgi, 2020; Simmonds, 2020) . On the other hand, more C to T transitions indicates less CpG abundancy, which is resulted from cytosine methylation and deamination into T. This mutational pattern was also observed in Bat RaTG13 and other coronaviruses, indicating rapid adaptation and evolution of the virus in the host (Matyasek & Kovarik, 2020; Simmonds, 2020) . Among all known betacoronaviruses, SARS-CoV-2 represents the most extreme CpG deficiency, which contributes to evasion of host anti-viral defence mechanisms (Xia, 2020) . SARS-CoV-2 mutational pattern in each region varies from each other with North American and European countries more stability and Asian countries more variability (Figure 3 ). In addition, we did not observe a consistent mutational pattern contributing to the degree of case mortality/morbidity rate although some countries such as France, Belgium and UK do have a much higher fatality rate while countries such as Singapore and Iceland have a much lower fatality rate ( Figure S2 ). Multiple factors were reported to impact the course of COVID-19 pandemic. Stringent measures such as quarantine, social distancing and isolation of infected patients have been implemented in China and result in successful containment of the epidemic (Anderson et al., 2020) . Different social and economic factors among different countries also influence spread and outcomes of the disease (Qiu et al., 2020) . In addition, according to WHO, the mortality is higher in people older than 65 years and those with underlying comorbidities, such as serious heart conditions, chronic lung disease, high blood pressure, obesity and diabetes (Lai et al., 2020; Ruan, 2020; Weiss & Murdoch, 2020 (Gallaher, 2020; Paraskevis et al., 2020; Sashittal et al., 2020) . We also observed that N gene has 15 nucleotide positions mutated, then nsp2 and nsp13 (13), S gene (10), nsp14 (8), nsp7 and ORF3a (7), nsp13 (6) and nsp5 (5). This pattern is consistent with previous results claiming that ORF1a, ORF1b, S and N gene were detected at high frequency (Kim et al., 2020) . N represents the most abundant protein expressed by viral genome and is able to induce high level of antibody response which ease serological diagnosis (Azkur et al., 2020; . Non-synonymous mutations on N gene (C28311T, C28854T, G28881A and G28883C), especially G28881A and G28883C with vast majority of counts that contribute to clade classification, may have impacts on antigenicity of N protein. Further studies are needed to determine the impacts. We also observed here that nsp2 and nsp3 possess high mutation frequency ( Figure 2 ). SARS coronavirus nsp1 and nsp2 are the most variable protein (Graham et al., 2005) . However, previous research found that nsp2 are dispensable for SARS viral replication, but attenuates viral growth and genome synthesis (Graham et al., 2005) . Nsp3 possesses PL pro domain with protease-cleavage activities and serves as a target for anti-viral development (Rut et al., 2020) . With high variability and high-frequency mutations including G2891A, C3037T, C3177T and C6312A, cautions and considerations should be taken for anti-viral therapeutic development. Multiple single nucleotide mutations lead to protein codon change to start/stop codons, which results in protein length change (Table 3) . Mutations on TRS sites also may affect viral RNA transcription, thus affecting protein expression. Amino acids deletions and insertions were also observed (Table 2) , and protein functions may get changed. A detailed mutational profiling was performed for multiple key viral proteins including S, RdRp, 3CL pro and PL pro (Tables 4-7 and Figure 5 ). S protein mediates virus binding and entry to host cells, and is able to elicit high level of neutralizing antibody response (BalcioGlu et al., 2020; P. Liu, Cai, et al., 2020; Schmidt et al., 2020) . Utilizing monoclonal antibodies (mAbs) to target RBD region as therapeutics have gained promising results and are currently under clinical trials for COVID-19 patients (Alsoussi et al., 2020; Chi et al., 2020; Shi et al., 2020) . RdRp, 3CL pro and PL pro are conserved among all strains and play critical roles in viral genome replication and polyprotein cleavage to form functional viral proteins (Aftab et al., 2020; Chand et al., 2020; Gao et al., 2020; Rut et al., 2020; Ul Qamar et al., 2020; Yin et al., 2020) . Due to their critical feature of polymerase and protease, structures for RdRp, 3CL pro have been decoded Ul Qamar et al., 2020; Yin et al., 2020) . Antiviral drugs targeting these proteins are currently under development. Here, we described a detailed mutational profile of these four proteins. Critical mutations potentially impacting protein functions were observed and shown on their structures ( Figure 5 ). Although counts for some of the mutations are not high, it provides insights that SARS-CoV-2 may adapt to environmental changes and gain replicative advantages/fitness to escape anti-viral treatment and being drug-resistant. Thus, further studies are needed to determine whether mutations on key sites affect viral replication and infectivity or not. In summary, a detailed mutational profiling was described in this study. Landscape of genome-wide mutations across the countries provides insights for SARS-CoV-2 transmission and adaptation as different regions have different mutational patterns. Mutations with high frequency contribute to clade classification of SARS-CoV-2 strains. This study provides more evidence for SARS-CoV-2 genomic diversity around the globe and rapid evolution/ adaptation of the virus. Given the detailed mutational profiles of key viral proteins including S, RdRp, 3CL pro and PL pro , it also gives some guidance for better design of anti-viral therapeutic to tackle the disease. The authors declare that there is no competing interests. Ethical statement is not applicable since no human/animal sample handling and gathering were involved in this study. The data used to support the findings of the manuscript are included within the article. Fangfeng Yuan https://orcid.org/0000-0001-9310-0382 Leyi Wang https://orcid.org/0000-0001-5813-9505 Analysis of SARS-CoV-2 RNA-dependent RNA polymerase as a potential therapeutic drug target using a computational approach How will country-based mitigation measures influence the course of the COVID-19 epidemic? Potential inhibitors against papain-like protease of novel coronavirus (SARS-CoV-2) from FDA approved drugs Immune response to SARS-CoV-2 and mechanisms of immunopathological changes in COVID-19 Molecular characterization of SARS-CoV-2 in the first COVID-19 cluster in France reveals an amino acid deletion in nsp2 (Asp268del) SARS-CoV-2 neutralizing antibody development strategies Coding potential and sequence conservation of SARS-CoV-2 and related animal viruses. Infection Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients' B cells CpG-creating mutations are costly in many human viruses. bioRxiv Identification of novel mutations in RNA-dependent RNA polymerases of SARS-CoV-2 and their implications on its protein structure Prediction of the SARS-CoV-2 (2019-nCoV) 3C-like protease (3CLpro) structure: virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates A neutralizing human antibody binds to the N-terminal domain of the Spike protein of SARS-CoV-2 Evolving geographic diversity in SARS-CoV2 and in silico analysis of replicating enzyme 3CL(pro) targeting repurposed drug candidates Evolving geographic diversity in SARS-CoV2 and in silico analysis of replicating enzyme 3CLpro targeting repurposed drug candidates The D614G mutation in SARS-CoV-2 Spike increases transduction of multiple human cell types The heterogeneous landscape and early evolution of pathogen-associated CpG and UpA dinucleotides in SARS-CoV-2 Adaptive value of high mutation rates of RNA viruses: Separating causes from consequences Genomewide association study of severe Covid-19 with respiratory failure An Alphavirus-derived replicon RNA vaccine induces SARS-CoV-2 neutralizing antibody and T cell responses in mice and nonhuman primates Coronaviruses: An overview of their replication and pathogenesis A palindromic RNA sequence as a common breakpoint contributor to copy-choice recombination in SARS-COV-2 Structure of the RNA-dependent RNA polymerase from COVID-19 virus The nsp2 replicase proteins of murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable for viral replication Making sense of mutation: What D614G means for the COVID-19 pandemic remains unclear The genomic variation landscape of globally-circulating clades of SARS-CoV-2 defines a genetic barcoding scheme. bioRxiv Nextstrain: Realtime tracking of pathogen evolution BioEdit: A user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) Structure of replicating SARS-CoV-2 polymerase The D614G mutation of SARS-CoV-2 spike protein enhances viral infectivity and decreases neutralization sensitivity to individual convalescent sera. bioRxiv MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization Genome-Wide Identification and characterization of point mutations in the SARS-CoV-2 genome. Osong Public Health and Research Perspectives Tracking changes in SARS-CoV-2 Spike: Evidence that D614G increases infectivity of the COVID-19 virus Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor Association between ABO blood groups and risk of SARS-CoV-2 pneumonia Human Coronavirus-229E, -OC43, -NL63, and -HKU1. Reference Module in Life Sciences Dynamic surveillance of SARS-CoV-2 shedding and neutralizing antibody in children with COVID-19 Mutation patterns of human SARS-CoV-2 and bat RaTG13 coronavirus genomes are strongly biased towards C>U transitions, indicating rapid evolution in their hosts Geographic and genomic distribution of SARS-CoV-2 mutations Protein structure analysis of the interactions between SARS-CoV-2 spike protein and the human ACE2 receptor: From conformational changes to novel neutralizing antibodies The D614G mutation in the SARS-CoV2 Spike protein increases infectivity in an ACE2 receptor dependent manner Naturally mutated spike proteins of SARS-CoV-2 variants show differential levels of cell entry. bioRxiv Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event Impacts of social and economic factors on the transmission of coronavirus disease 2019 (COVID-19) in China A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology MAFFT-DASH: Integrated protein sequence and structural alignment Likelihood of survival of coronavirus disease 2019. The Lancet Infectious Diseases Activity profiling and structures of inhibitor-bound SARS-CoV-2-PLpro protease provides a framework for anti-COVID-19 drug design Genome-wide analysis of Indian SARS-CoV-2 genomes for the identification of genetic mutation and SNP. Infection Characterization of SARS-CoV-2 viral diversity within and across hosts. bioRxiv Measuring SARS-CoV-2 neutralizing antibody activity using pseudotyped and chimeric viruses A human neutralizing antibody targets the receptor-binding site of SARS-CoV-2 Rampant C->U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: Causes and consequences for their short-and long-term evolutionary trajectories. mSphere Structurebased drug repositioning over the human TMPRSS2 protease domain: Search for chemical probes able to repress SARS-CoV-2 Spike protein cleavages On the origin and continuing evolution of SARS-CoV-2 Temporal profiles of viral load in posterior oropharyngeal saliva samples and serum antibody responses during infection by SARS-CoV-2: An observational cohort study. The Lancet Infectious Diseases Synonymous mutations reduce genome compactness in icosahedral ssRNA viruses Structural basis of SARS-CoV-2 3CL(pro) and anti-COVID-19 drug discovery from medicinal plants Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Clinical course and mortality risk of severe COVID-19 Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense The I-TASSER Suite: Protein structure and function prediction Zoonotic origins of human coronaviruses Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir SARS-CoV-2 Spike protein variant D614G increases infectivity and retains sensitivity to antibodies that target the receptor binding domain The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity A Novel Coronavirus from Patients with Pneumonia in China