key: cord-0787429-gygi11gk authors: Hasan, Saam; Khan, Salim; Ahsan, Giasuddin; Hossain, Muhammad Maqsud title: Genome Analysis of SARS-CoV-2 Isolate from Bangladesh date: 2020-05-15 journal: bioRxiv DOI: 10.1101/2020.05.13.094441 sha: 15ee59993a2c9f7edd7c2001e469ae9334319fff doc_id: 787429 cord_uid: gygi11gk Recently the first genome sequence for a Severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2 isolate from Bangladesh became available. The sequencing was carried out by the Child Health Research Foundation and provided the first insight into the genetic details of the viral strain responsible for the SARS-CoV-2 infections in Bangladesh. Here we carried out a comparative study were we explored the phylogenetic relationship between the Bangladeshi isolate with other isolates from different parts of the world. Afterwards we identified single nucleotide variants in the Bangladeshi isolate, using the Wuhan virus reference sequence. We found a total of 9 variants in the Bangladeshi isolate using 2 separate tools. Barring 2, the rest of these variants were also observed in other isolates from different countries. Most of the variants occurred in the ORF1ab gen. Another noteworthy finding was a sequence of three consecutive variants in the N protein gene that were observed in other isolates as well. Lastly the phylogenetic analysis revealed a close relationship between the Bangladeshi isolate and those from Taiwan, Kazakhstan, Greece, California, Spain, Israel, and Sri Lanka. The Severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2 has been the causative agent behind the ongoing COVID-19 pandemic. The virus which primarily infects the respiratory tract, has spread to almost every country in the world. It has infected over 4 million individuals worldwide and led to the deaths of over 283,000 [1] . A considerable body of research on the virus has already accumulated. Its genome has been sequenced in many different parts of the world. In addition, a lot of effort has already gone into identifying genetic variants. Several studies have analysed and speculated on the genetic variability of the virus and whether or not this gives it a survival advantage [2, 3, 4, 5] . Recently, the first Bangladeshi isolate of the virus has been sequenced and its data made publicly available (EPI_ISL_437912). The virus was first reported in the country last March. Since then, the number of SARS-CoV-2 infections have risen steadily, currently standing at over 17,800 infected and 269 dead [6] . As part of the ongoing situation, it becomes important for researchers to analyse and study the genome for any possible clues regarding its evolution, mutation capacity and any changes in pathogenic potential. Here, we carried out a whole genome analysis of the local viral genome sequence in order to explore its relationship with other isolates from around the world, as well as to see how it differs from them. We tried to answer two very basic questions in this study. Firstly; what is the phylogenetic relationship between this isolate and others from around the world. This can shed light on the route taken by the virus from one country to another. As viral isolates located close to another in a phylogeny tree are more likely to share a common origin. And secondly, are there any specific genetic variants that differentiates the Bangladeshi isolate from others. This study had two main parts. The first was the phylogeny analysis. We selected a total of 18 genomes for this, including the Bangladeshi isolate and the SARS-CoV-2 reference sequence. The rest of the genomes were chosen so as to cover as wide a geographic range as possible. We chose isolates from USA, Australia, South Korea, Japan, Israel, South Africa, Taiwan, Kazakhstan, Italy, and Spain. The second part was the variant identification. The primary goal for this component was to identify variants unique to the Bangladeshi isolate and to identify common variants that do or do not occur in the Bangladeshi isolate, so that we can understand how this strain differs from the rest. We used NCBI Nucleotide for obtaining the SARS-CoV-2 genomes. Search term used was "SARS-CoV-2" and the filter parameters were set to "Genomic DNA/RNA" for sequence type, and 28000-29903 base pairs for sequence length. Only complete genomes were selected. An initial BLAST [7] run was also carried with the Bangladeshi isolate genome (EPI_ISL_437912) and the top nine hits were automatically chosen to be included in this analysis. These were isolates from USA (California, Arizona), South Africa, Spain (Barcelona), India (Hyderabad), Kazakhstan, Sri Lanka and Taiwan. Supplementary table 1 shows these hits. Along with these, we randomly selected another 8 genomes from different parts of the world. These were from India (Maharashtra), Italy, United States (Michigan), United States (Washington), Israel, Japan, South Korea and Australia. Supplementary table 2 shows the BLAST alignment summary for all these isolates, including the Bangladeshi one, with the SARS-CoV-2 RefSeq. After selected our genome, we carried out a pairwise alignment for all of them, using the Wuhan virus reference sequence as the reference. This was done by solving the Needleman-Wunsch alignment problem for these sequences, using the Biostrings package on R [8] . A multiple sequence alignment was also carried out using the NPhylogeny.fr tool. The latter being a MAFFT program [9] . The generated alignment file from the MAFFT alignment was then used to create the phylogenetic tree for all these viral isolates. This was done using the MEGA tool for phylogenetic analysis [10] . The ML heuristic model implemented was Nearest-Neighbour-Interchange (NNI), the model used was the Tamura-Nei model, and the starting tree used was BioNJ. For identifying SNPs, we used a combination of the MismatchTable function from Biostrings and the BasebyBase tool from the Viral Bioinformatics Research Centre [11] . We used the results from both the tools in our subsequent comparisons. A total of 9 variants were identified in the Bangladeshi isolate by the Biostrings method and an identical number from the BasebyBase method. Both the tools gave us identical variants in terms of position and base change for the Bangladeshi isolate, lending more credibility to the variant calls. Table 1 shows the position and base changes associated with these 9 variants. 7 out of these 9 variants were also found in other isolates. As for most common variants, the C to T variant at position 241, the C to T at position 3037, and the A to G at position 23403 had the highest occurrence. All three of these were present in 12 of the 18 isolates. Other common variants included the G to T variant at position 11083 (two isolates) and a consecutive series of three variants at positions 28881, 28882, and 28883 (G to A, G to A, and G to C respectively) that were found in 6 isolates. The Bangladeshi virus showed most similarity with the viral isolate from Sri Lanka, as was A total of 63 unique variants were identified across the two runs. 41 were identified with BasebyBase and another 22 unique variants were found using Biostrings. Table 3 shows all the variants found with Biostrings. Figure 1 displays a binary heatmap of all the variants identified with BasebyBase . T a b l e 2 : C o m p l e t e l i s t o f v a r i a n t s i d e n t i f i e d u s i n g B i o s t r i n g s . A t o t a l o f 6 3 u n i q u e v a r i a n t s w e r e i d e n t i f i e d a c r o s s t h e 1 8 i s o l a t e s . . T h e v a r i a n t a t p o s i t i o n s 2 4 1 , 3 0 3 7 , a n d 2 3 4 0 3 o c c u r r e d i n t h e m o s t n u m b e r o f g e n o m e s , a t o t a l o f 1 2 . I s o l a t e S e q u e n c e N u c Q U A L R e f S t a r t R e f E n d R e f N u c G e n e M T 3 0 8 7 0 0 . 1 -U S A -M i c h i g a n T 7 2 4 1 2 4 1 C I n t e r g e n i c T 7 1 0 5 As far as gene specific distribution of variants is concerned, the Bangladeshi isolate, much like all the others, contained most variants in the ORF1ab gene. This is to be expected as that is the largest gene in the SARS-CoV-2 genome, spanning over half its genome length. It also contained the previously discussed 3 variants in the N protein gene and 1 variant in the S protein gene. The latter, an A to G mutation at position 23403, has already been implicated in a number of previous studies. Previously it was believed this particular mutation is common the viral strains in Europe [12] . Although out analysis did show that isolates from other Asian countries (India, Kazakhstan, Taiwan, Sri Lanka, Israel), in addition to the Bangladeshi one, also contained this particular variant. Figure 2A and 2B show the number of variants per gene for the all the isolates overall and for the Bangladeshi isolate alone respectively. r e 2 A : F i g u r e 3 : T h e n u m b e r o f v a r i a n t s p e r g e n e f o r a l l t h e i s o l a t e s a n a l y z e d . O R F 1 a b h a d t h e m o s t n u m b e r o f v a r i a n t s a s s h o u l d b e e x p e c t e d b e c a u s e o f i t s l a r g e r s i z e . I n t e r e s t i n g l y t h e N p r o t e i n h a d a c o m p a r a t i v e l y h i g h n u m b e r o f v a r i a n t s ( 2 2 ) . T h i s i s s o m e w h a t u n e x p e c t e d f o r a s t r u c t u r a l p r o t e i n f o r w h o m t h e o r g a n i s m s h o u l d n o t b e a b l e t o t o l e r a t e a h i g h m u t a t i o n r a t e . I n t e r g e n If we observe the phylogenetic tree from the Wuhan RefSeq onwards, the first divergence event appears to give rise to the Japanese isolate, followed another that gives rise to three separate clades. One contains the Italian, South Korean and Australian isolates, another consists of the Indian (Maharashtra) and American (Washington) isolates and a third one that later gives rise to two more clusters. The first of these two clusters contains the isolates from Arizona and Michigan, while the second includes the Bangladeshi isolate. This observation that the Arizona and Michigan isolates and the Bangladesh isolate arise from the same divergence event and the fact that the two American isolates appear to be predate the Bangladeshi one, would seem to suggest that the virus first arrived in Bangladesh from USA. This seems to indicate that the initial speculation of the virus reaching Bangladesh through individuals arriving from Italy was perhaps incorrect. Our goal behind this study was to try and understand the phylogenetic relationship between the Bangladeshi virus and those causing infections in other parts of the world, as well to characterize the unique genetic signature of this local strain. The Bangladeshi appears to share a close sequence similarity with isolates from Taiwan California and Spain and the ones we analysed arose much more recently. This is something that requires further analysis using more genomes from these countries. Hence it is wise to not correlate the statistics in those areas with the other countries mentioned. In summation, we believe the relationship between the number of mutations and the weakening of the virus is something that should be investigated. And our study provides at least some correlational evidence in its support. The three N protein variants is another curious aspect of the Bangladeshi strain. It only appeared in the closely related isolates such as the Bangladeshi, Kazakh, Californian, Greek and Sri Lankan ones. The Israeli isolate also contained these. And the percentage of fatalities among infected patients is current a little over 1% for Israel; comparable with the Bangladeshi fatality percentage and considerably lower than the global. This invites the possibility of this, the three consecutive variants, possibly being a characteristic feature of a weaker strain of the virus. Again, the evidence right now is purely correlational. But this does still remain as another interesting future avenue to explore. Lastly, the Spike protein variant at position 23403 was also found in the Bangladeshi virus. We are as of yet uncertain over what this could mean. Especially since contrary to previous claims, we also found this variant in other Asian isolates. In conclusion, we believe the apparently higher rate of mutation in the Bangladeshi SARS-CoV-2 strain may hold possible clues pertaining to a mutation induced weakening of the virus. In addition we believe that the most likely route the virus used to get to Bangladesh was via either Michigan or Arizona in the USA. Sri Lanka also remains an outside possibility, but we believe the more likely scenario is that Sri Lanka and Bangladesh got the virus from the source, and that it arrived in Sri Lanka earlier than Bangladesh. Future studies should be putting more focus on how these variants impact the protein functions of the virus, so as to shed better light on how the pathogen is changing and what it means for our fight to bring the pandemic to an end. Who.int. 2020. Coronavirus Disease (COVID-2019) Situation Report-113 Whole genome and phylogenetic analysis of two SARS-CoV-2 strains isolated in Italy in Genomic characterization of a novel SARS-CoV-2 Phylogenetic network analysis of SARS-CoV-2 genomes Genetic diversity and evolution of SARS-CoV-2. Infection Covid-19 Status Bangladesh Basic local alignment search tool fr: new generation phylogenetic services for non-specialists Molecular Evolutionary Genetics Analysis across Computing Platforms Base-By-Base Version 3: New Comparative Tools for Large Virus Genomes Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant