key: cord-0831048-vvu3ad43 authors: Saif, Rashid; Mahmood, Tania; Ejaz, Aniqa; Zia, Saeeda; Qureshi, Abdul Rasheed title: Whole Genome Comparison of Pakistani Corona Virus with Chinese and US Strains along with its Predictive Severity of COVID-19 date: 2020-06-30 journal: bioRxiv DOI: 10.1101/2020.05.01.072942 sha: ab4fc318aaa6a70fd0dac1c3c586c2b6907a3363 doc_id: 831048 cord_uid: vvu3ad43 Recently submitted 784 SARS-nCoV2 whole genome sequences from NCBI Virus database were taken for constructing phylogenetic tree to look into their similarities. Pakistani strain MT240479 (Gilgit1-Pak) was found in close proximity to MT184913 (CruiseA-USA), while the second Pakistani strain MT262993 (Manga-Pak) was neighboring to MT039887 (WI-USA) strain in the constructed cladogram in this article. Afterward, four whole genome SARS-nCoV2 strain sequences were taken for variant calling analysis, those who appeared nearest relative in the earlier cladogram constructed a week time ago. Among those two Pakistani strains each of 29,836 bases were compared against MT263429 from (WI-USA) of 29,889 bases and MT259229 (Wuhan-China) of 29,864 bases. We identified 31 variants in both Pakistani strains, (Manga-Pak vs USA=2del+7SNPs, Manga-Pak vs Chinese=2del+2SNPs, Gilgit1-Pak vs USA=10SNPs, Gilgit1-Pak vs Chinese=8SNPs), which caused alteration in ORF1ab, ORF1a and N genes with having functions of viral replication and translation, host innate immunity and viral capsid formation respectively. These novel variants are assumed to be liable for low mortality rate in Pakistan with 385 as compared to USA with 63,871 and China with 4,633 deaths by May 01, 2020. However functional effects of these variants need further confirmatory studies. Moreover, mutated N & ORF1a proteins in Pakistani strains were also analyzed by 3D structure modelling, which give another dimension of comparing these alterations at amino acid level. In a nutshell, these novel variants are assumed to be linked with reduced mortality of COVID-19 in Pakistan along with other influencing factors, these novel variants would also be useful to understand the virulence of this virus and to develop indigenous vaccines and therapeutics. The SARS pandemic engendered new avenues to ponder and identify variations in this animal based virus Severe Acute Respiratory Syndrome novel Corona virus 2 (SARS-nCoV2) that how human receptor Angiotensin-converting enzyme 2 (ACE2) become ideally compatible with the spike region of this virus and as a result COVID-19 spread human population globally [1] . In the current century, the first wave of transmission started from SARS CoV in Guangdong, China and thereafter disseminated worldwide which resulted in 916 fatalities [2] . reported infected with SARS-nCoV2 strain in Wuhan-Hubei, China [4] . This infection got widespread and of May 01, 2020, 2,036,770 active cases, 234,279 (7.06%) deceased and 1,048,807 (31.59%) recovered has been reported worldwide [5] . This strain mainly belongs to the B-beta coronaviruses genus [6, 7] . It has been observed that mortality rate vary from country to country which pondered the scientists to look into linkage between different variants of the SARS-nCoV2 with its severity along with other influencing factors e.g. temperature, testing facility, lockdown measures, aging factor and hygienic practices. Recently published genomic characterization study speculated the proximal origin of human Corona virus from Bat (Rhinolophus affinis) and Pangolin (Maris javinica) strains by natural selection in an animal host before zoonotic transfer or natural selection in humans following zoonotic transfer due to the novel observed variants in Receptor Binding Domain (RBD) and polybasic cleavage site of the Spike region [8] . In this comparative genomics study, phylogenetic analysis was carried out with 784 whole genome sequences available from NCBI Virus database in order to trace the closest Pakistani Corona virus homologues and then subject those closest strains for downstream processing. In an earlier analysis two Pakistani strains Gilgit1 MT240479 and Manga MT262993 were found in close proximity to USA MT263429 and China MT259229 strains. Further variant calling analysis using Galaxy platform on these 4 strains was performed to predict the effect of novel variants of Pakistani strains on the severity of COVID-19. SWISS-MODEL 3D structural modelling analysis was also performed for comparing mutant proteins of Pakistani strains to have an insight of potential role of these mutant proteins on pathogenicity of SARS-nCoV2. Rectangular cladogram was constructed using online NCBI Virus database phylogenetic tool [9] based on genbank sequence type and whole genome sequenced data taxid: 2697049 of 784 SARS-nCoV2 strains from all around the globe. FASTA sequences of two Pakistan (Gilgit1 MT240479 and Manga MT262993), China (Wuhan) MT259229 and USA (Washington) MT263429 strains were retrieved from NCBI Genbank [10] . Pakistani FASTA sequences were converted to FastQ format using FASTA-to-Tabular-to-FASTQ tools (Galaxy Version 1.1.0) [11] and then mapped against reference genome using BWA MEM v 0.7.17.1 [12] . Mapped reads were coordinate sorted using SortSam feature and duplicate sequences were marked using MarkDuplicate feature of Picard tool. Aligned sequencing reads were processed for per position variant call using Naive Variant Caller (NVC) v 0.0.3 [13] . SnpSift Variant type [14] and SnpEff eff was used to annotate variants by custom building of reference sequence databases using SnpEff build v 4.3+T.galaxy4 ( Figure. Proteins bearing variation in Pakistani strains were subjected to homology based structure modelling using Promod3 on Swiss model platform [16, 17] . MT259229 (Chinese) and MT263429 (USA) were first subjected to search for homologues templates from which a prophesied 3D model was built. Next PDB files of these were used as template to build Pakistani mutant protein (N and ORF1a) 3D models based on target-template alignment along with Quaternary Structure Quality Estimate (QSQE) score complementing the GMQE for tertiary structure evaluation which is accomplished by supervised built-in "Support Vector Machine" (SVM) algorithm [18, 19] . Cladogram constructed with NCBI Virus database showed relatedness of Pakistani MT240479 (Gilgit1) with MT184913 (CruiseA-USA) that shared same internal node from which the distance of MT240279 is 0.002342 and of MT184913 is 0.002381, while MT262993 (Manga-Pak) was present adjacent to MT039887 (WI-USA). MT262993 distance from its internal node is 0.000919 and that of MT039887 is 0.000865 ( Figure. Pakistani strain MT262993 (29,836 bp) is more closely related to the virus strain MT259229 (Table 1) , while MT263429 (29,889 bp) strain from USA is different with 9 variants loci at the rate of 1 variant per 2,988 bp ( Table 2) . (Table 5 ). and the graphical representation of variants impacts is shown in (Figure. 3). [20] . It binds RNA tightly and packages the viral genome into capsid (a ribonucleoprotein) [21] . ORF1ab (Replicase polyprotein 1) encodes 7096 amino acids present at 5' end and is involved in replication and translation of viral RNAs [22] . It interacts with the host innate immune response and is responsible for host virulence. Results also indicated that ORF1a polyprotein (4405aa) shows a rate of nonsynonymous substitutions usually [23] . protein. Around 70% of human pathogens are of zoonotic origin including all Corona viruses types and one of them the SARS-nCoV2 also, which is causing COVID-19, the ongoing pandemic [24] . SARS-nCoV2 might also have affected its interaction with the Angiotensin-converting enzyme 2 (ACE2) receptors in humans causing low virulence but for validation of this assumption functional studies are needed. We also performed structural analysis and modelled mutant proteins 3D structure from target-template alignment using N and ORF1a proteins of USA and Chinese strains as template and both Pakistani N and ORF1a as target sequences to visualize it on amino acid level to further ensure the differences in studied strains. We conclude our discussion by making the instance that N and ORF1a genes variants in Pakistani Corona virus might be associated with some functional phenotype causing low mortality rate in Pakistan vs USA and Chinese strains, however no variants were found in RBD and polybasic cleavage site of spike region which is more critical region for the virulency of this virus. This hypothesis still needs more research to validate and to find out association of these mutant genes and other influencing factors with the pathogenicity of this virus. A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01) SARS: The First Pandemic of the 21 st Century Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV) Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses The proximal origin of SARS-CoV-2 NCBI Virus Manipulation of FASTQ data with Galaxy Aligning sequence reads, clone sequences and assembly contigs with Dissemination of scientific software with Galaxy ToolShed Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3 SWISS-MODEL: homology modelling of protein structures and complexes QMEANDisCo-distance constraints applied on model quality estimation Modeling protein quaternary structure of homo-and hetero-oligomers beyond binary interactions by homology The SARS coronavirus nucleocapsid protein-forms and functions The molecular biology of coronaviruses Genetic diversity and evolution of SARS-CoV-2. Infection SARS coronavirus replicase proteins in pathogenesis Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV