key: cord-297786-jz1d1m2e authors: Hasan, Md. Mahbub; Das, Rasel; Rasheduzzaman, Md.; Hussain, Md Hamed; Muzahid, Nazmul Hasan; Salauddin, Asma; Rumi, Meheadi Hasan; Rashid, S M Mahbubur; Siddiki, AMAM Zonaed; Mannan, Adnan title: Global and Local Mutations in Bangladeshi SARS-CoV-2 Genomes date: 2020-08-26 journal: bioRxiv DOI: 10.1101/2020.08.25.267658 sha: doc_id: 297786 cord_uid: jz1d1m2e Corona Virus Disease-2019 (COVID-19) warrants comprehensive investigations of publicly available Severe Acute Respiratory Syndrome-CoronaVirus-2 (SARS-CoV-2) genomes to gain new insight about their epidemiology, mutations and pathogenesis. Nearly 0.4 million mutations were identified so far in ∼60,000 SARS-CoV-2 genomic sequences. In this study, we compared 207 of SARS-CoV-2 genomes reported from different parts of Bangladesh and their comparison with 467 globally reported sequences to understand the origin of viruses, possible patterns of mutations, availability of unique mutations, and their apparent impact on pathogenicity of the virus in victims of Bangladeshi population. Phylogenetic analyses indicates that in Bangladesh, SARS-CoV-2 viruses might arrived through infected travelers from European countries, and the GR clade was found as predominant in this region. We found 2602 mutations including 1602 missense mutations, 612 synonymous mutations, 36 insertions and deletions with 352 other mutations types. In line with the global trend, D614G mutation in spike glycoprotein was predominantly high (95.6%) in Bangladeshi isolates. Interestingly, we found the average number of mutations in ORF1ab, S, ORF3a, M and N of genomes, having nucleotide shift at G614 (n=459), were significantly higher (p≤0.001) than those having mutation at D614 (n=215). Previously reported frequent mutations such as P4715L, D614G, R203K, G204R and I300F were also prevalent in Bangladeshi isolates. Additionally, 87 unique amino acid changes were revealed and were categorized as originating from different cities of Bangladesh. The analyses would increase our understanding of variations in virus genomes circulating in Bangladesh and elsewhere and help develop novel therapeutic targets against SARS-CoV-2. Severe Acute Respiratory Syndrome-CoronaVirus-2 (SARS-CoV-2) has become an etiological agent of the disease called CoronaVirus Disease-2019 (COVID- 19) . As of 21 August 2020, globally there have been 22,492,312 confirmed cases of COVID-19, including 788,503 deaths, reported to WHO. To explore the viral pathogenesis, modern genomics tools are highly crucial and has been employed by researchers around the world. Hundreds of virus whole genomes are now submitted in publicly accessible databases from different parts of the globe everyday. It is hightime to analyze the variations among those sequences which will help future strategic efforts for its preventive measures such as vaccine design and therapeutics. SARS-CoV-2 consists of positive-sense single-stranded RNA with a genome size ranging from ~27 to 34 Kb. It contains a variable number (6) (7) (8) (9) (10) (11) of open-reading frames (ORFs). The first ORF is almost two-third of the whole genome and encodes four structural, 16 non-structural and eight accessory proteins [1, 2] . According to a recent study on 48,635 of SARS CoV-2 genome sequences, 353,341 mutation events have been observed globally in comparison to the reference genome of Wuhan. Among these sequences, India, Congo, Bangladesh and Kazakhstan have significantly high numbers of mutations per sample compared to the global average [3] . Out of these mutations, D614G mutation (causing aspartate to glycine in S protein position 614) is reported to be the most prevalent mutations reported from Europe, Oceania, South America, Africa [4] . Zhang et al. (2020) reported that the level of angiotensin-converting enzyme 2 (ACE2) expression was distinctly higher by the retroviruses pseudotyped with G614 compared to that of D614 [4, 5] . The functional properties of the (D614) and (G614) were compared in this study and G614 was found to be more stable than D614 with more transmission efficiency, supporting the previous epidemiological data [5] . Another reported mutation of ORF1ab is P4715L linked with D614G, that has been reported to have a strong relationship with higher fatality rates in 28 countries and 17 states of the United States [3, 6] . Three other mutations namely C14408T (ORF1b), C241T (5' UTR), C3037T (ORF1a) is reported to be common and co-occurring in the same genome while G11083T has been found mostly in Asian countries [6] . Hassan et al. (2020) investigated the accumulation of ORF3a mutations of SARS-CoV-2 from India where they revealed four types of mutations (Q>H, D>Y and S>L) near TRAF, ion channel and caveolin binding domain, respectively [7] . Notable that all these mutations might have implications in maintaining the virulence of the virus and NLRP3 inflammasome activation. In Bangladesh, as of 21 August, 2020, nearly 287,959 people are infected and 3822 people have died due to COVID-19 (https://iedcr.gov.bd/). Among them, 329 of SARS-CoV-2 genome sequences were deposited in GISAID database (https://www.gisaid.org). Analysis of these sequences is sparsely reported in the literature. analyzed 64 sequences and identified the presence of 180 mutations in the coding regions of the viruses and mutations at nsp2 was the most prevalent [11] . Due to the small numbers of genome sequence analyzed, most of these findings were not conclusive and representative. Further comprehensive analyses is therefore necessary to better understand the circulating virus in the country. In this study, we first compared 207 genome sequences isolated from Bangladesh with time-resolved phylogenetic analysis and investigated the origin of imported COVID-19 cases to Bangladesh. Then, we studied the variants present in different isolates of Bangladesh to investigate the pattern of mutations, identify UMs, and discuss the pseudo-effect of these mutations on the structure and function of encoded proteins, with their role in pathogenicity. Most interestingly, we found 87 UMs with a total count of 113 in Bangladeshi isolates which will increase our understanding of distribution of SARS-CoV-2 virus in different regions and associated pathogenicity. As of 30 June, 2020, a total of 35,723 whole genomic RNA sequences of SARS-CoV-2 had been submitted to GISAID. From these downloaded sequences, a custom python script was used to retrieve unique sequences. The same script also removed any sequence containing "N" and other ambiguous IUPAC codes [12] . This resulted in a total of 8723 complete genomic sequences. To select representative sequences from curated 8723 sequences and make a comparison against sequences from Bangladesh, priorities were given to those countries that had a higher number of infections in each continent (source: https://www.worldometers.info/coronavirus/). We selected these sequences in such a way that each continent must contain at least one sequence from each GISAID clade. The number of sequences selected from a country was based on the total number of unique sequences retrieved. This resulted in a total of 467 unique representative sequences from 42 countries (see Supporting File Table S1 ). From Bangladesh, 215 whole-genome RNA sequences of SARS-CoV-2 were uploaded to GISAID as of 15 July, 2020. Only high coverage complete sequences (n=207) were kept for analysis. All these 207 sequences were aligned with the previously selected 467 representative sequences along with that of Wuhan-1 (Accession ID MN908947) as a reference sequence [13] . To ensure comparability, the flanks of all the sequences were truncated to the consensus range from 56 to 29,797 [14] , with nucleotide position numbering according to the Wuhan 1 reference sequence, prior to alignment. Multiple Sequence Alignment (MSA) and phylogenetic tree construction were carried out using Molecular Evolutionary Genetic Analysis (MEGA X) software [15] . All selected sequences were aligned using MUSCLE software tool [16] . Later, an NJ (Neighbor-Joining) phylogenetic tree [17] was constructed using the Tamura-Nei method. Tree topology was assessed using a fast bootstrapping function with 1000 replicates. Tree visualization and annotations were performed in the Interactive Tree of Life (iTOL) v5 [18] . The Genome Detective Coronavirus Typing Tool Version 1.13 was used for variant analyses of SARS-CoV-2 genome which is specially designed for this virus analysis (https://www.genomedetective.com/app/typingtool/cov/) [19] . For analysis of (UM) among 207 genomic sequences from Bangladesh, we used a CoV server hosted in GISAID server (https://www.gisaid.org/epiflu-applications/covsurver-app/). The server analyzed our dataset against all available genomic sequences of SARS-CoV-2 including the Wuhan reference sequence deposited on GISAID until July 16, 2020. Descriptive and inferential statistics were used to analyze different mutations and their correlation with different categorical variables. For correlation, we used one-way analysis of variance using SPSS Statistics 25 (IBM, Armonk, New York) licensed to King's College London. To understand the SARS-CoV-2 viral transmission in Bangladesh, we performed phylogenetic analysis on the selected 207 viral genomes reported from different districts of Bangladesh along with selected 467 globally submitted sequences as reported from 42 countries and 6 continents ( Figure 1 ). This apparently represents the overall clade distribution of all global sequences along with sequences from Bangladeshi isolates. GR clade was found predominant in Bangladesh as about 84% of the sequences were grouped to this clade followed by GH and G with ~6 and ~5%, respectively. Similar clade distribution has been found in isolates submitted from European countries. We also attempted to compare the sequence data among different districts of Bangladesh from where patient samples were collected for sequencing, Looking at the districtwise distribution of clade (Figure 2) , it was found that the sequences from three districts, namely Dhaka, Narayanganj and Rangpur primarily belong to GR clade. Conversely, only sequences from Chattogram district were from S and GH clades. On another note, phylogenetic analyses of the clade distribution of isolates from countries like Saudi Arabia, where GH clade was predominant [3] it is highly likely that the introduction of GH and S clades in Bangladesh could be of Middle-Eastern origin. Based on the datsets, we hypothesized that Bangladeshi SARS-CoV-2 isolates belonging to different clades might have critical implications concerning viral transmission rate, virulence, severity and other aspects of disease pathogenesis. In addition, the presence of different clades of SARS-CoV-2 strains in different districts could also have implications in the accuracy of diagnostic tests that are underway. Our analyses revealed a total of 2602 mutations observed among 207 Bangladeshi genomic sequences that constitutes a number of 1602 missense, 612 synonymous, 36 insertion/deletions and 352 other mutations (Table 1) . We identified mutations like I300F (nsp2), P4715L (nsp12), D614G (S glycoprotein), R203K and G204R (N protein) are the most frequently occurring common mutations found in Bangladesh with a frequency of 138, 199, 198, 177 and 177, respectively ( Figure 3 ). Notable that, no particular mutations occurred at any specific time period rather they have been observed over the whole period of disease incidence. Firstly, 1163A>T (I300F), can be a destabilizing factor for nsp2 and thus modulating the strategy of host cell survival [11, 20] . This mutation can lead to a reduction of conformational entropy due to the presence of the side chains that can result in charge neutralization of the phosphorylated serine residues [21] . Secondly, RNA dependent RNA polymerase (nsp12) is significant for replication and transcription of the viral RNA genome. Therefore, P > L at 4715 in nsp12 may have some effects on RNA transcription. This mutation was also observed in most of the USA states (28 out of 31). The same mutation was prevalent in European countries like Spain, France etc. This alteration could affect the pathogenesis triggered by antibody escape variants with the epitope loss [22, 23] . Thirdly, Korber et al. (2020) stated that the G614 type might have originated either in Europe or China [24] . They also reported that the original Wuhan D614 form was also predominant in Asian samples. Meanwhile, the G614 form had clearly established and started expanding in countries outside of China. We also noticed 95.6% of genomes from Bangladesh have D614G mutation, which is also dominant in the world. However, the average number of mutations per ORF is varied among D614 and G614 containing genomes that we have studied (n=674) as revealed by Table 2 . The average number of mutations in ORF1ab, S, ORF3a, M and N of genomes having mutated G614 (n=459) are significantly higher (p≤0.001) than those having wild D614 (n=215) in S glycoprotein (Table 2; Figure 1 ). Interestingly, the average mutation number is declined in ORF8 of genomes having G614 mutation (p<0.001). This correlation indicates that the genomes containing D614G mutation are more prone to bear other mutations which may facilitate the notion that the link of this mutation with the transmission and pathogenesis of SARS-CoV-2 [4, 12, 25] . Finally, R203K and G204R mutations in N protein were previously reported in Indian, Spanish, Italian, and French samples [21, 26] . These mutations are located in the site of the SR-rich region which has been reported to be intrinsically disordered [27] . This region further incorporates a few phosphorylation sites [28] , including the GSK3 phosphorylation at Ser202 and a CDK phosphorylation site at Ser206 which are located close to the position of this mutation. The 'SRGTS' (202-206) and 'SPAR' (206-209) sequence motifs are dependent on GSK3 and CDK phosphorylation motifs, respectively. Other variations (28881G>A and 28882G>A) together convert polar to non-polar amino acid (R203K) and 28883G>C variation converts nonpolar to polar amino acid (G204R). We observed 87 unique shifts from the different proteins of SARS-CoV-2 genomes found in Bangladesh (Table S2 ). Details of their pseudo-effects on viral replication, assembly, transmissivity and pathogenicity are corroborated in Table S2 . Surprisingly, most of these UM were localized except a few exceptions. For example, the UMs E8D (E), S180T (N), L139J and is involved in the transcription and replication of CoV RNAs. We observed 4 UMs in nsp1, but the location of these amino acids was not in the KH domain (K164 and H165 of nsp1), which binds with ribosome 40s subunit [30] . However, nsp1 acts as a primary virulence factor in SARS-CoV-2 infection, and mutation in this protein could affect the structure and functional properties, thereby altering its virulence properties. Nine UMs were seen in nsp2, but their effects on host cells were merely reported in the literature. Since nsp2 interacts with the host proteins and disrupts the host cell survival signaling pathway [31] , any mutation in nsp2 may play a crucial role in SARS-CoV infections. In a recent study, it has been found that, compared to Bat SARS-CoV, SARS-CoV-2 has a stabilizing mutation at amino acid position 501, T501Q, which alters the viral pathogenicity and makes the virus more contagious [32] . We did not observe this mutation in our study. The mutations of nsp3 are responsible for affecting the virus assembly and hence their replication. It is due to the disruption of replicase polyprotein processing into nsps. These nsps assemble with cellular membranes and facilitate virus replication. The UMs found in nsp3 might have some probable effects. Firstly, the UMs (A889V, R883G, S1038F and V843F) were found in the main domain of nsp3 that is important for processing endopeptidases from coronaviruses. Secondly, many UM (e.g. A1803V, A1819S, G1691C, I1672S, A602S, D782B, K462R, L373M, N1337S, N51D, R883G, T1363Is, Y246C and Y272H etc) are found in topological (cytoplasmic) domain rather than transmembrane domain which could interfere less on cytoplasmic double-membrane vesicle formation, necessary for viral replication. Thirdly, we observed three UMs (L373M, Y246C and Y272H) in ADP-ribose-1′-phosphatase (ADRP) or (Macro) domain. It has been shown that mutations of the ADRP domains does not diminish virus replication in mice, but reduces the production of the cytokine IL-6, which is an important pro-inflammatory molecule [33] . However, we did not observe any mutation in the active sites and zinc finger motif, attributing normal catalytic activity of nsp3. The general opinion is that SARS-CoV PL-PRO domain is important for the development of antiviral drugs and of the actual role of this enzyme in the biogenesis of the COVID-19 replicase complex is yet not explored. It is proposed that the proteins nsp3, nsp4 and nsp6, through their transmembrane domains, are involved in the replicative and transcription complex [34] . In our study, we observed only 3 UMs in nsp4 and nsp6, respectively. Meanwhile, nsp5 encodes 3C-like proteinase which cleaves the C-terminus of replicase polyprotein at 11 sites. The five UMs that we found in nsp5 did not fall in its active sites (3304 and 3408 in ORF1ab) requiring further investigation with large number of sequence datasets. The present global outbreak of COVID-19, caused by SARS-CoV-2, has already taken ~0.8 million lives. To combat this deadly disease, we need a greater understanding of the pathobiology of the virus. Hence, it is essential to minimize the translational gap between viral genomic information and its clinical consequences for developing effective therapeutic strategies. In this study, we have attempted to explore genomic variations of Bangladeshi SARS-CoV-2 viral isolates while comparing with a large cohort of global isolates. Our analyses will facilitate the understanding of the origin, mutation patterns and their possible effect on viral pathogenicity. This study tries to address the importance of the variations in the viral genomes and their necessity for therapeutic interventions. The unique insights from this study will undoubtedly be supportive for a better understanding of SARS-CoV-2 molecular mechanism and to draw an end to the current life-threatening pandemic. Emerging coronaviruses: genome structure, replication, and pathogenesis Geographic and Genomic Distribution of SARS-CoV-2 The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Polymorphism and selection pressure of SARS-CoV-2 vaccine and diagnostic antigens: implications for immune evasion and serologic diagnostic performance SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Molecular conservation and Differential mutation on ORF3a gene in Indian SARS-CoV2 genomes Emergence of European and North American mutant variants of SARS-CoV-2 in South-East Asia Genetic analysis of SARS-CoV-2 isolates collected from Bangladesh: insights into the origin, mutation spectrum Genome Analysis of SARS-CoV-2 Isolate from Bangladesh In silico comparative genomics of SARS-CoV-2 to determine the source and diversity of the pathogen in Bangladesh Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2, bioRxiv A new coronavirus associated with human respiratory disease in China Phylogenetic network analysis of SARS-CoV-2 genomes MEGA X: molecular evolutionary genetics analysis across computing platforms MUSCLE: multiple sequence alignment with high accuracy and high throughput The neighbor-joining method: a new method for reconstructing phylogenetic trees Interactive Tree Of Life (iTOL) v4: recent updates and new developments Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes Complete Genome Sequence of a Novel Coronavirus (SARS-CoV-2) Isolate from Bangladesh Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility Mutational spectra of SARS CoV 2 orf1ab polyprotein and signature mutations in the United States of Non-synonymous Mutations of SARS-Cov-2 Leads Epitope Loss and Segregates its Varaints Spike: evidence that D614G increases infectivity of the COVID-19 virus Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear Variant analysis of SARS-CoV-2 genomes The SARS coronavirus nucleocapsid protein-forms and functions The severe acute respiratory syndrome coronavirus nucleocapsid protein is phosphorylated and localizes in the cytoplasm by 14-3-3-mediated translocation Emergence of RBD and D614G Mutations in Spike Protein: An Insight from Indian SARS-CoV-2 Genome Analysis Structural basis for translational shutdown and immune evasion by the Nsp1 protein of SARS-CoV-2, bioRxiv Mutational screening of the proteome of Sars-Cov-2 isolates: mutability of ORF3a COVID 2019: the role of the nsp2 and nsp3 in its pathogenesis Murine coronavirus ubiquitin-like domain is important for papainlike protease stability and viral pathogenesis SARS-CoV-2: virus mutations in specific European populations