key: cord-0817215-ze33iu2v authors: Islam, Ovinu Kibria; Al‐Emran, Hassan M.; Hasan, Md. Shazid; Anwar, Azraf; Jahid, Md. Iqbal Kabir; Hossain, M. Anwar title: Emergence of European and North American mutant variants of SARS‐CoV‐2 in Southeast Asia date: 2020-07-23 journal: Transbound Emerg Dis DOI: 10.1111/tbed.13748 sha: c4cb55bd994d47a62f392dae9057435f54511170 doc_id: 817215 cord_uid: ze33iu2v The SARS‐CoV‐2 coronavirus is responsible for the current COVID‐19 pandemic, with an ongoing toll of over 5 million infections and 333 thousand deaths worldwide within the first 5 months. Insight into the phylodynamics and mutation variants of this virus is vital to understanding the nature of its spread in different climate conditions. The incidence rate of COVID‐19 is increasing at an alarming pace within subtropical Southeast Asian nations with high temperatures and humidity. To understand this spread, we analyzed 444 genome sequences of SARS‐CoV‐2 available on the GISAID platform from 6 Southeast Asian countries. Multiple sequence alignments and maximum likelihood phylogenetic analyses were performed to analyze and characterize the non‐synonymous mutant variants circulating in this region. Global mutation distribution analysis showed that the majority of the mutations found in this region are also prevalent in Europe and North America, and the concurrent presence of these mutations at a high frequency in other countries indicate possible transmission routes. Unique spike protein and non‐structural protein mutations were observed circulating within confined area of a given country. We divided the circulating viral strains into 4 major groups and 3 sub‐groups on the basis of the most frequent non‐synonymous (NS) mutations. Strains with a unique set of 4 co‐evolving mutations were found to be circulating at a high frequency within India, specifically. Group 2 strains characterized by two co‐evolving NS mutants which alter in RdRp (P323L) and spike (S) protein (D614G), were found to be common in Europe and North America. These European and North American variants have rapidly emerged as dominant strains within Southeast Asia, increasing from a 0% prevalence in January to an 81% by May 2020. These variants may have an evolutionary advantage over their ancestral types and could present a large threat to Southeast Asia for the coming winter. The SARS-CoV-2 ssRNA virus was initially detected in December 2019 within China's Hubei province and its associated COVID-19 pandemic is currently ongoing, with a global toll of over 5 million infections and 333 thousand deaths so far (WHO, Corona virus disease (COVID-19) situation report-124, published on 23 rd May, 2020). While China encountered the pandemic first, North American and European nations have also suffered crippling blows from this pandemic. Moreover, the infection numbers within Southeast Asian, South American and African nations are growing. The This article is protected by copyright. All rights reserved number of infections in resource-poor, low-income countries may be underestimated due to their limited testing capacity. A number of earlier studies indicated that high temperatures and high humidity could decrease the spread of the virus (Demongeot, Flet-Berliac & Seligmann, 2020; Sajadi et al., 2020; Wang, Tang, Feng, & Lv, 2020) . Despite having such climate conditions, cases are increasing at an alarming rate in relatively hot and humid subtropical Southeast Asian countries. As of 22 nd May, a total of 182,278 cases of SARS-CoV-2 infected cases were identified in Southeast Asia, with a death toll of over 5,500 (WHO, Corona virus disease (COVID-19) situation report-124, published on 23 rd May, 2020). Scientists from all over the world are making an unprecedented effort to expose the genomic profiles, and characterize the mutation variants of the circulating virus to get insight into its evolutionary patterns and driving force. Tang, X.L. et al. analyzed 103 genomic sequences and indicated that the circulating SARS-CoV-2 strains have two major lineages, one with a synonymous mutation (NSP4_S75S) and the other with a non-synonymous mutation (NS8_L84S) . In another study, Forster, Peter et al. found 3 central variants by analyzing 160 sequences and claimed that B type viruses (with amino acid (aa) substitution, NS8_L84S) were common in East Asia, whereas A type (ancestral lineage) and C type (NS3_G251V variant) were prevalent in Europe and North America (Forster, Forster, Renfrew, & Forster, 2020) . Changchuan Yin analyzed 558 genome sequences and found 15 high frequency single SNP genotypes (Yin, 2020). He suggested four major groups with either single or co-evolving mutations; group 1 with NSP6_L37F, group 2 with NS3_G251V, group 3 with co-evolving mutation at 2 sites (NSP4_S75S and NS8_L84S) and group 4 with co-evolving mutations at 4 sites (241C > T-leader sequence, NSP3_F105F, NSP12_P323L and S_D614G). In addition, GISAID differentiated COVID-19 into three major clades; Clade S (prevalent in North America), Clade V (prevalent in Asia and Europe) and Clade G (prevalent in Europe), having non-synonymous mutations with aa substitutions at NS8_L84S, NS3_G251V and S_D614G respectively (Fuertes et al., 2020) . Later on they divided the G clade into GR and GH clade that contain N_203-204: RG> KR and NS3_Q57H aa substitutions, respectively. Finally, based on genomic data available from the GISAID, Nextstrain's SARS-CoV-2 global genomic epidemiology analysis show 10 major clades for this virus. In this study, we investigated 444 sequenced genomes of This article is protected by copyright. All rights reserved SARS-CoV-2 available from Southeast Asia to provide an insight into the genomic divergence, phylogeny and recurrent mutations in this region. In Southeast Asia, a total of 444 high coverage SARS-CoV-2 genomes from India (n=327), Bangladesh (n=20), Indonesia (n=8), Thailand (n=83), Sri Lanka (n=4), Nepal (n=1) and Myanmar (n=1) have been submitted to GISAID (www.gisaid.org) as of 23 rd May, 2020. This study analyzed all 444 genome sequences using hCoV19 / Wuhan / WIV04 / 2019 (Accession: EPI_ISL_ 402124) as a reference genome for phylogenetic and mutation analysis. Genome sequences of all 444 SARS-CoV-2 were aligned using the MAFFT (Multiple Alignment using Fast Fourier Transform) algorithm (Katoh, Misawa, Kuma, & Miyata, 2002) This article is protected by copyright. All rights reserved The S protein sequences were compared with the reference S protein to identify non-synonymous mutation sites. The 3D structure of the S protein was constructed using Phyre 2 (Kelley, Mezulis, Yates, Wass, & Sternberg, 2015) and the mutation sites were visualized with PyMOL(TM) 2.4.0 software (Schrodinger, 2017) The aligned sequences were used for the construction of a phylogenetic tree using the maximumlikelihood method in Molecular Evolutionary Genetics Analysis (MEGA X) software (Kumar, Stecher, Li, Knyaz, & Tamura, 2018) . Interactive Tree Of Life (iTOL) v5 (Letunic & Bork, 2019 ) was used to adjust the branch and label color of the phylogenic tree. A mutation-time plot for Southeast Asia was prepared by analyzing all 444 sequences available in GISAID. A proportion of mutation based clusters were plotted monthly from 1 st of January until 23 rd of May, 2020. Additionally, the numbers of COVID-19 infections and deaths were added in the same time plot collected from WHO situation reports. Transmission patterns (performed on 23 nd May, 2020) were observed in Southeast Asia using country filters on Nextstrain, an online based real-time pathogen evolution tracking tool (Hadfield et al., 2018) . Nextstrain server filtered 329 sequences from 444 available sequences to create the transmission map. The frequencies of non-synonymous mutations found after in depth analysis of 60 genomes were analyzed globally using GISAID based SARS CoV-2 mutation statistics server. Countries with a high frequency of those specific mutations were marked on a geographic heat map using the online tool Maptive. This article is protected by copyright. All rights reserved Preliminary mutation analysis with 60 of 444 SARS-CoV-2 sequences from Southeast Asia revealed 78 non-synonymous mutations. Most of them (n=52) were found in non-structural proteins. Other mutations with amino acid (aa) substitutions were present in S protein (n=13), N protein (n=9), M protein (n=3) and E protein (n=1). The Nepal SARS-CoV-2 genome, which was sequenced early, had no non-synonymous mutations compared to our reference sequence (Table-1 ). This subset of data identified 21 NS mutations including 4 aa alterations in S proteins that solely observed in Southeast Asia ( Table 2 ). The majority of these unique mutations (n=15) have arisen only once to-date. The remaining 6 were present more than once, but each of these variants circulated in a specific country or region. Moreover, we found 13 mutations with aa replacement in S protein. We also identified the 10 most frequent mutations with aa substitution (N_R203K, N_G204R, N_P13L, This article is protected by copyright. All rights reserved S494P), two of them resided in critical receptor binding motif. All the aa substitution sites were marked in 3D structure of S protein (Figure 3 ).In the cluster based time-plot, 444 available sequences were studied from January until May 2020. The group 2 cluster was not observed until February (0%, n=0), but was36% (n=58) in March, 48% (n=84) in April and 81% (n=61) in May. The group 3 cluster was 0% to 7% in those months, with a sudden increase of more than 40% in March and April. The infections increased from 17 cases in January up to 128,257 in May, 2020 (Figure 4) .During those months, the number of deaths increased up to 3468 cases by May. A geographic heat map ( Figure 5 ) revealed that most of the 78 NS mutations found from 60 sequences were also common in Europe and North America (according to global mutation statistics database, supplementary data-3). The transmission map also revealed that Group 2 sequences (A2clade of Nextstrain) from this study were found to be dominant among viruses circulating in Southeast Asia ( Figure 6 ). This clade contain characteristic S D614G mutation also found to be prevalent in Europe and North America (Cao et al., 2020; Gonzalez-Reiche et al., 2020) . The SARS-CoV-2 pandemic has cost more than 333 thousand lives already, with many more deaths being reported each day. During this period, the global community has yet to predict its virulence, seasonal variation, transmission properties and immunity. However, it is apparent that the fatality rate varies by region and that the degree of virulence varies from person to person (Khafaie & Rahim, 2020; Yao et al., 2020) . Some regions in Europe and North America were affected the more than others, while most of Asia, Africa and Australia remain less affected. Analysis of the ssRNA genome is now crucial to understand the pathogenesis and transmission of the virus. This study has characterized the SARS-CoV-2 virus circulating in Southeast Asia into 4 major groups and 3 sub-groups by studying common non-synonymous mutations. Group 1 consists of 40 strains that do not have the recurrent mutations. Those strains are mostly found in January and February in this subcontinent (Figure 2a and 2b) . Group 2 involves 46% (n=203) of the variants in this study. Strains belonging to this group coevolved with characteristic NS mutation, NSP12_P323L and S_D614G. These variants were initially prevalent in Europe and North America, and now constitute 68% of the virus all over the world. A recent study analyzed 95 sequences and also found NSP12_P323L variants to be at a higher frequency, and reported that this variant was mostly found This article is protected by copyright. All rights reserved outside of Wuhan, China (Khailany, Safdar, & Ozaslan, 2020) . Another study suggests that RNA dependent RNA-polymerase (RdRp) aa substitution at the 323 rd position (NSP12_P323L) causes RdRp fidelity, which, in turn, increases the number of mutations within the virus and causes coevolved mutations (Pachetti et al., 2020) . NSP12_P323L was co-evolved with S_D614G; this particular non-silent S protein mutation generates an additional elastase cleavage site near the S1-S2 junction and thus facilitates fusion and cell entry (Koyama, Weeraratne, Snowdon, & Parida, 2020) . This variant (S_D614G) was first observed in January 28, 2020 and was initially prevalent in Europe. Within 4 months, this variant has now rapidly outcompeted its ancestral subtype all over the world (Bhattacharyya et al., 2020) . These studies explain the frequency of Group 2 variants in Southeast Asia and why these variants have subdivided into additional sub-groups involving co-evolving mutations. We differentiated Group 2 into 2 subgroups, 2a and 2b, which involve N_203-204: RG> KR and Unlike the others, Group 3a was unique, with 4 coevolving mutations. Of these, the NSP6_L37F mutation variant was common (Mercatelli & Giorgi, 2020) ; this mutation variant has also been frequently found in the UK, USA, Australia and India. The other 3 mutations are relatively less common and found mostly in India and Australia. Group 4, on the other hand, consists of a characteristic NS8_L84S mutation variant, which was declared as S type by Tang, X.L. et al. . This mutation was later reported as C type by another group (Forster, Forster, Renfrew, & Forster, 2020) and were clustered as S clade by GISAID (Fuertes et al., 2020) . Group 4 included 45 variants prevalent mostly in Thailand (43%).The analysis of S proteins revealed 49 aa substitution sites. Among them, 4 sites reside in receptor binding domain. The most abundant aa substitution This article is protected by copyright. All rights reserved D614G was found 203 times in all countries except Nepal. The effect of D614G mutation is described earlier, and the effect of other S protein mutations is yet to be evaluated. A recent study conducted with 10,014 sequences identified 13 frequent non-synonymous mutations (Mercatelli & Giorgi, 2020) , while we found only 7 of them, along with 3 less common mutation, at high frequency in this region (Figure 2a) . Most of the S protein mutations identified in this region (supplementary data-2) were also observed in Europe and North America according to the GISAID global mutation database. S protein mutations with aa substitution at 614 position, found in 46% of the studied strains in this region, were also prevalent in Europe and North America. On the contrary, the aa substitutions found at the 1109th position of the S protein found in one Bangladeshi strain was found in another strain from Switzerland. We observed another aa substitution in the S protein at the 76th (S_T76I) position in an Indonesian strain, which was also found in two strains from West Bengal, India (data non shown). This specific aa substitution was identified on 55 occasions according to the global database. Among them, 49 were from Australia, suggesting that this variant might have been transmitted from Australia. Additionally, global mutation distribution statistics showed that S_A829T mutation was observed in 31 sequences, all of them from Thailand (Table 2 , supplementary data-3). NSP2_I120F mutation was found in 9 of the 12 cases from Dhaka, Bangladesh and NSP2_D92G mutation was present in 4 out of the 5 sequences from Chittagong, Bangladesh (Data not shown). These cities are separated by a distance of 250 kilometers, suggesting that those viruses carrying novel mutations were circulating in an area-specific manner. NS mutation and phylogenetic analysis conducted through the Nextstrain database was particularly useful in getting a closer look at mutation variants and their possible routes of transmission. We found a common N_203-204: RG> KR aa substitution (9 out of 12 strains) in Dhaka, Bangladesh. However instead of the common N_203-204: RG> KR aa substitution, a less common aa substitution was observed at the 202 nd position of N protein (N_S202N) among the most (5 out of 7) strains of Chittagong. The mutation distribution database showed that strains having trinucleotide block mutation in N protein were prevalent in Europe and that the N_S202N mutant was found more commonly in recent strains of Saudi Arabia. Phylogenetic analysis by Nextstrain also revealed that the Chittagong strains (belongs to Nextstrain B4 clade and group 4 of our study) have close This article is protected by copyright. All rights reserved relationship with the Saudi Arabian strains, while Dhaka strains (A2 clade in nextstrain, group 2 in this study) are similar to the European ones. The geographical heat-map ( Figure 6 ) of these non-synonymous mutations indicate that most of these mutations were also frequently found in the UK, USA, Australia, Saudi Arabia and other European countries, revealing possible transmission routes to Southeast Asia. Phylogenetic analysis by Nextstrain produced a similar transmission route map ( Figure 5 ). This study also confirmed, through phylogenetic and mutation analysis, that a high percentage of Group 2 strains are linked to European and North American strains (A2 clade in Nextstrain analysis) in India and Bangladesh. We could not analyze the strains from Maldives, Bhutan and Timor-Leste because they do not have whole genome sequence data of the virus at the time of our analysis. Among the sevent countries with available genome sequences, only India, Bangladesh and Indonesia have reported a higher number of SARS-CoV-2 infections. The frequencies of infection have increased exponentially from mid-April, 2020. In our study, it was also shown that Group 2 variants containing NSP12_P323L and S_D614G mutations were not found earlier than March. The time plot data (Figure 4 ) delineates that this Group 2 cluster is emerging rapidly from 0% in January and February, to 81% in May 2020, suggesting that the European and North American strains are the most recent predominant strains in this subcontinent. A study conducted in early March reported that NSP12_P323L (14408C>T) and S_D614G (23403A>G) mutations were recurrent in Europe and had not been detected in Asia until then, supporting our statement (Pachetti et al., 2020) . Along with other co-evolving mutations, NSP12_P323L and S_D614G probably provide variants with an evolutionary advantage over their ancestral types, allowing them to survive and circulate in this densely populated region. Time plot is also indicating that, the infection and death cases increased along with the increased presence of group 2 strains in this region. Manuel Becerra-Flores and Timothy Cardozo recently reported that prevalence of D614G mutant correlates with higher case fatality rate in different states of USA (Becerra-Flores & Cardozo, 2020) . In our study, we focused on the non-synonymous mutations found in SARS-CoV-2 variants in Southeast Asia. More experimental work is required to determine the biological effect of these variants on transmission and pathogenesis of this virus. A number of earlier studies hypothesized that high temperatures and high humidity could result in reduced SARS-CoV-2 transmission. A recent Yao, H.-P., Lu, X., Chen, Q., Xu, K., Chen, Y., Cheng, L., . . . Jin, C. (2020). Patient-derived mutations impact pathogenicity of SARS-CoV-2. CELL-D-20-01124. This article is protected by copyright. All rights reserved Genomic characterization of a novel SARS-CoV-2 Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment MEGA X: molecular evolutionary genetics analysis across computing platforms Interactive Tree Of Life (iTOL) v4: recent updates and new developments A "One-Health" approach for diagnosis and molecular characterization of SARS-CoV-2 in Italy. One Health Geographic and Genomic Distribution of SARS-CoV-2 Mutations Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Temperature and latitude analysis to predict potential spread and seasonality for COVID-19 The PyMol molecular graphics system, version 2.0. Schrödinger, LLC On the origin and continuing evolution of SARS-CoV-2 High temperature and high humidity reduce the transmission of COVID-19 Accepted Article