key: cord-254636-3lr008th authors: Shishir, Tushar Ahmed; Naser, Iftekhar Bin; Faruque, Shah M. title: In silico comparative genomics of SARS-CoV-2 to determine the source and diversity of the pathogen in Bangladesh date: 2020-08-16 journal: bioRxiv DOI: 10.1101/2020.07.20.212563 sha: doc_id: 254636 cord_uid: 3lr008th The COVID19 pandemic caused by SARS-CoV-2 virus has severely affected most countries of the world including Bangladesh. We conducted comparative analysis of publicly available whole-genome sequences of 64 SARS-CoV-2 isolates in Bangladesh and 371 isolates from another 27 countries to predict possible transmission routes of COVID19 to Bangladesh and genomic variations among the viruses. Phylogenetic analysis indicated that the pathogen was imported in Bangladesh from multiple countries. The viruses found in the southern district of Chattogram were closely related to strains from Saudi Arabia whereas those in Dhaka were similar to that of United Kingdom and France. The 64 SARS-CoV-2 sequences from Bangladesh belonged to three clusters. Compared to the ancestral SARS-CoV-2 sequence reported from China, the isolates in Bangladesh had a total of 180 mutations in the coding region of the genome, and 110 of these were missense. Among these, 99 missense mutations (90%) were predicted to destabilize protein structures. Remarkably, a mutation that leads to an I300F change in the nsp2 protein and a mutation leading to D614G change in the spike protein were prevalent in SARS-CoV-2 genomic sequences, and might have influenced the epidemiological properties of the virus in Bangladesh. The pandemic of coronavirus disease referred to as COVID-19 pandemic, which originated in Wuhan, China in December 2019 is ongoing and has now spread to 213 countries and territories. As of July 2020, the pandemic has caused about 16 million cases and over half a million death. This novel virus of the Coronaviridae family and Betacoronavirus genus (1, 2) designated as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is the causative agent of COVID-19. Previously two other coronaviruses, namely SARS-CoV and MERS-CoV have demonstrated high pathogenicity and caused epidemic with mortality rate ~10% and ~34% respectively affecting more than 25 countries each time (3) (4) (5) . However, SARS-CoV-2 has proven to be highly infectious and caused pandemic spread to over 213 countries and territories. Besides its devastating impact in North America and Europe, the disease is now rapidly spreading in South America including Brazil, and in South Asian countries particularly India, Pakistan and Bangladesh (6) . The virus was first detected in Bangladesh in March 2020 (7) . Although infections remained low until the end of March it began to rise steeply in April. By the end of June, new cases in Bangladesh grew to nearly 150,000 and the rate of detection of cases compared to the total number of samples tested increased to about 22% which was highest in Asia (6) . SARS-CoV-2 is a positive-sense single-stranded RNA virus with a genome size of nearly 30kb. The 5' end of the genome codes for a polyprotein which is further cleaved to viral nonstructural proteins whereas the 3' end encodes for structural proteins of the virus including surface glycoproteins spike (S), membrane protein (M), envelop protein (E) and nucleocapsid protein (N) (8) . Like other RNA viruses, SARS-CoV-2 is also inherently prone to mutations due to high recombination frequency resulting in genomic diversity (9) (10) (11) . Due to the rapid evolution of the virus, development of vaccines and therapies may be challenging. To monitor the emergence of diversity, it is important to conduct comparative genomics of viruses isolated over time and in various geographical locations. Comparative analysis of genome sequences of various isolates of SARS-CoV-2 would allow to identify and characterize the variable and conserved regions of the genome and this knowledge may be useful for developing effective vaccines, as well as in molecular epidemiological tracking. Thousands of SARS-CoV-2 virus genomes have been sequenced and submitted in public databases for further study. This include 66 SARS-CoV-2 genomic sequences submitted from Bangladesh in the Global initiative on sharing all influenza data (GISAID) database, till 11th June 2020 (12) . We conducted comparative analysis of publicly available genome sequences of SARS-CoV-2 from 27 countries to predict the origin of viruses in Bangladesh by studying a time-4 resolved phylogenetic relationship. Later, we analyzed the variants present in different isolates of Bangladesh to understand the pattern of mutations in relation to the ancestral Wuhan strain, find unique mutations, and possible effect of these mutations on the stability of encoded proteins, and selection pressure on genes. A total of 435 whole genome sequences of SARS-CoV-2 including 64 sequences isolated in Bangladesh (detail information in S1 Table) , and that of 5 isolates of each month between January to May, 2020 isolated in 27 different countries with high frequency of infection were included in this analysis. Source and number of sequence are presented in Table 1 (detail information are provided in S2 Table) . However, since only number of sequences were reported from different African countries, we included all sequences from the African countries and categorized collectively as African sequence (12) . Reference sequence included in various analysis was the sequence of the ancestral strain from Wuhan, China (NC_045512.2) (13). Selected sequences were annotated by Viral Annotation Pipeline and iDentification (VAPiD) tool (14) , and multiple sequence alignment was carried out using Mafft algorithm (15) . Maximum likelihood phylogenetic tree was constructed with IQ-TREE (16), the generated tree was reconstructed based on time-calibration by TreeTime (17) , and visualized on iTOL server (18) . For analysis of mutations, sequence were mapped with minimap2 (19) , and variants were detected using SAMtools (20) and snp-sites (21) . A haplotype network was generated based on mutations in genome using PopArt (22) . Sequences were then put into different clades based on specific mutations proposed in GISAID (23) and further classified as D614G type (24, 25) . Subsequently, another phylogenetic tree and haplotype network containing only SARS-CoV-2 sequences from Bangladeshi was constructed and categorized using the same tools, and additionally one step further clustered with TreeCluster (26) . The direction of selection in sequences from Bangladesh was calculated by the SLAC algorithm (27) in the Datamonkey server (28) . Finally, the effects of the mutations on protein stability were predicted using DeepDDG (29). A total of 435 Genomic Sequences of SARS-CoV-2 reported from various countries (Table 1) which included 64 sequences from Bangladesh and the sequence of the ancestral SARS-CoV-2 isolated in Wuhan, China were analyzed in the time-resolved phylogenetic tree. Sequences from Bangladesh belonged to three different clusters of which one cluster carried 43 of the 64 sequences, and shared the same node with sequence from Germany while they had a common ancestry with isolates from the United Kingdom (Fig 1) . The remaining two clusters of SARS-CoV-2 sequences contained 4 and 5 sequences respectively from Bangladesh, and they shared the same node with sequence of SARS-CoV-2 reported from India, and also shared a common ancestry with isolates from Saudi Arabia. Besides, 12 lone sequences that did not belong to any of these clusters were found to have similarity with sequences from Europe including United Kingdom, Germany, France, Italy, and Russia. One of these sequences was closely related to sequence reported from the USA. Subsequently, all SARS-CoV-2 sequences from representative countries were clustered based on some specific mutations sustained, into 7 different clades as mentioned by GISAID. In this analysis, the sequences from Bangladesh were found to be distributed in all clades except V (Figs 1 and 2) . classification. In order to understand the evolutionary relationship and possible transmission dynamics of SARS-CoV-2 in Bangladesh at a higher resolution, another time-resolved phylogenetic tree carrying only sequences of the pathogen isolated in various regions of Bangladesh was generated using the sequence of the first SARS-CoV-2 reported from Yuhan, China as a reference. Of the three clusters produced in this analysis, cluster-1 included mostly isolates from Chattogram and one isolate from Dhaka, cluster-2 included isolates from Dhaka, Narayanganj and Chattogram districts, whereas cluster-3 included isolates from Chattogram only. As mentioned above, the isolates from Bangladesh were found to be distributed in all 7 GISAID clades based on specific mutations, except in clade V (Fig 2) . Most isolates of Dhaka and Narayanganj (47 of 52) belonged to the GR clade, whereas those of Chattogram belonged to five different GSID clades (G, GH, GR, O, and S). The major international airport in Bangladesh is situated in the capital city Dhaka, whereas the major seaport is located in Chattogram. Based on the phylogenetic analysis, all isolates of Dhaka were the descendant of SARS-CoV-2 found in European countries, more specifically France and the United Kingdom. On the other hand, most isolates of Chattogram were found closely related to Saudi Arabian isolates. Moreover, considering the GSID clades, the presence of S clade was absent among Dhaka whereas most isolates of Chattogram was found to belong to the S clade. Clearly these two genomic variants of SARS-CoV-2 were initially imported by travelers from different countries, and the two variants initially spread in the two areas. That the isolates of Narayanganj and two isolates of Dhaka are closely related, indicates that the SARS-CoV-2 strain imported initially through international traveler to Dhaka later spread to Narayanganj, which is a densely populated city with river ports and large business centres. The SARS-CoV-2 sequences were also categorized according to D614G type mutation (Fig 1) . This particular subtype with a non-silent (Aspartate to Glycine) mutation at 614th position of the Spike protein is presumed to have rapidly outcompeted other preexisting subtypes, including the ancestral strain. The D614G mutation generates an additional serine protease (Elastase) cleavage site near the S1-S2 junction of the Spike protein (25) . All but one sequence from Dhaka and Narayanganj were found to be of G type which carries Glycine at position 614 whereas sequences of Chattogram carry sequences of both types (Fig 2) . In addition, the first sequence from Bangladesh carried G614 type of surface glycoprotein, which indicate that this dominant variant was present since the first isolation of SARS-CoV-2 in Bangladesh and the mutant virus might have been imported to the country from Europe, and the presence of the mutation might have facilitated viral transmission. Relationships among DNA sequences within a population are often studied by constructing and visualizing a haplotype network. We constructed a haplotype network by the Median Joining algorithm and found that 338 of 434 SARS-CoV-2 sequences from representative countries were alike, therefore formed a large haplo group (Fig 3A) . However, there were presences of a significant number of unique lineages too consisting of a single or multiple SARS-CoV-2 sequences (Fig 3A) . This network demonstrated the closeness of the sequences and their pattern of mutation beyond the geographical boundary. Several SARS-CoV-2 isolates appeared to have sustained certain common mutations along with certain unique mutations. Although a large proportion of sequences from Bangladesh belonged to the common cluster (Fig 3A) , there was a significant number of unique nodes as well due to mutations overtime subsequent to being carried into Bangladesh (Fig 3A) . Therefore analysis of the sequences from Bangladesh provided further insight of their mutation patterns. The haplotype network revealed that viruses isolated in Bangladesh had certain unique mutations in them, and as a result they belonged to different haplo groups and no significant cig group (Fig. 3A) . Most of the isolates sustained a significant number of mutations compared with each other. In addition it further confirms that most isolates from Chattogram ( Fig 3B) were not directly related to those isolated in Dhaka or Narayanganj. We detected the presence of 209 point mutations in 64 SARS-CoV-2 isolates from Bangladesh when compared to the reference sequence from Yuhan, China. In addition, 19 isolates were found to have lost significant portions of their genome, and as a result lost sequences for some non-structural proteins such as ORF7 and ORF8 while other deletions were upstream or downstream gene variants (S3 Table) . Among the point mutations, 29 mutations were in the non-coding region of the genome and 180 were in coding regions. Ten of the 29 non-coding mutations were in upstream non-coding region and rest was in downstream non-coding region of the genome. Seventy mutations in the coding region were synonymous and 110 mutations predicted substituted amino acids. Among twelve predicted ORFs, ORF1ab which comprises approximately 67% of the genome encoding 16 nonstructural proteins had more than 60 percent of the total mutations while gene E encoding envelope protein and ORF7b were conserved and did not carry any mutation. Though ORF1 harbored the highest number of mutations, mutation density was highest in ORF10 considering ORF lengths. Details and distribution of the mutations are presented in Table 2 and full analysis report is placed in S3 table. In sequences from Bangladesh, 241C>T and 3037C>T changes were the two most abundant mutations found in 58 out of 64 isolates, and often found simultaneously (Table 3) . Position 241 is located in the non-coding region whereas the mutation in position 3037 was synonymous. On the other hand, 57 sequences were found to harbor 14408C>T and 23403A>G mutations which altered amino acid Pro>Leu and Asp>Gly respectively, and these two mutations were found to be present simultaneously as well. In addition, other co- Fig 4) . ORF6 was predicted to have dN/dS value of 17.8 due to the presence of higher number of missense than synonymous mutations. This finding indicates that ORF6 is rapidly evolving and is highly divergent. The ORF6 protein is an accessory protein whose function is yet to be fully elucidated (30) . The ORF7a and Nucleocapsid phosphoprotein had dN/dS values 1.08 and 1.55 respectively which confer their strong evolution to cope up with challenges under positive selection pressure. ORF10 is predicted to be conversed with dN/dS value 0 while envelope protein and ORF7b did not harbor any mutation and was conserved. On the other hand, ORF3a and surface glycoprotein might approach toward positive selection pressure and evolve but rests of the proteins were under negative selection pressure. Table 4) . None of the mutations in structural proteins were predicted to increase stability. All mutations were synonymous and found this gene conserved In summary, mutation analysis revealed point mutations as well as deletion of base pairs. Deletions of the base pairs were associated with missing non-structural proteins and predictably affected certain viral properties since ORF7a protein is the growth factor of the coronavirus family, induce apoptosis, and promotes viral encapsulation (31) (32) (33) while ORF8 is associated with viral adaptation by playing role in host-virus interaction (34, 35) . Furthermore, we have found that some genes are under positive selection pressure indicating that the virus is fast-evolving presumably to evade host cell's innate immunity; which should be taken into special consideration prior to vaccine development or other treatment strategies. Finally, a missense mutation at 1163A>T changing the amino acid isoleucine to phenylalanine in Nsp2 protein was found uniquely among 44 isolates in Bangladesh. Nsp2 is a methyltransferase like domain that interacts with PHB and PHB2, and modulates the host cell survival strategy by affecting cellular differentiation, mitochondrial biogenesis, and cell cycle progression to escape from innate immunity (35, 36) . This unique and high-frequency mutation might be a further interest of study, considering death rate against the infection rate in Bangladesh. Virus taxonomy: The database of the International Committee on Taxonomy of Viruses (ICTV) Epidemiology and cause of severe acute respiratory syndrome (SARS) in Guangdong, People's Republic of China SARS and MERS: recent insights into emerging coronaviruses SARS and other coronaviruses as causes of pneumonia Webmeter. Coronavirus Age, Sex, Demographics (COVID-19) -Worldometer. 2020. Available from: www.worldometers.info COVID-19 Outbreak Situations in Bangladesh: An Empirical Analysis Origin and evolution of pathogenic coronaviruses Coronaviruses: An RNA proofreading machine regulates replication fidelity and diversity Two Mutations Were Critical for Bat-to-Human Transmission of Middle East Respiratory Syndrome Coronavirus The proximal origin of SARS-CoV-2 Global initiative on sharing all influenza data -from vision to reality A new coronavirus associated with human respiratory disease in China VAPiD: A lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank MAFFT multiple sequence alignment software version 7: Improvements in performance and usability IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies TreeTime: Maximum-likelihood phylodynamic analysis Interactive Tree Of Life (iTOL) v4: recent updates and new developments Minimap2: Pairwise alignment for nucleotide sequences The Sequence Alignment/Map format and SAMtools SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb genomics POPART: Full-feature software for haplotype network construction Clade and lineage nomenclature aids in genomic epidemiology studies of active hCoV-19 viruses Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Clustering biological sequences using phylogenetic trees Not so different after all: A comparison of methods for detecting amino acid sites under selection Datamonkey: Rapid detection of selective pressure on individual sites of codon alignments DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks AputativediacidicmotifintheSARS-CoVORF6proteininfluencesitssubcellularlocalizationandsuppressionofexpressionofco -transfectedexpressionconstructs SARS-CoV accessory protein 7a directly interacts with human LFA-1 Severe Acute Respiratory Syndrome Coronavirus Gene 7 Products Contribute to Virus-Induced Apoptosis Severe Acute Respiratory Syndrome Coronavirus ORF7a Inhibits Bone Marrow Stromal Antigen 2 Virion Tethering through a Novel Mechanism of Glycosylation Interference The ORF8 Protein of SARS-CoV-2 Mediates Immune Evasion through Potently Downregulating MHC-I. bioRxiv The Proteins of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS CoV-2 or n-COV19), the Cause of COVID-19 COVID-2019: The role of the nsp2 and nsp3 in its pathogenesis