key: cord-0853071-ed7i2wpi authors: Ghorbani, Abozar; Samarfard, Samira; Ramezani, Amin; Izadpanah, Keramatollah; Afsharifar, Alireza; Eskandari, Mohammad Hadi; Karbanowicz, Thomas P.; Peters, Jonathan R. title: Quasi-species nature and differential gene expression of severe acute respiratory syndrome coronavirus 2 and phylogenetic analysis of a novel Iranian strain date: 2020-09-13 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2020.104556 sha: 5ebe3d237c786b90eb55be9b11d8fad33c087194 doc_id: 853071 cord_uid: ed7i2wpi A novel coronavirus related to severe acute respiratory syndrome virus, (SARS-CoV-2) is the causal agent of the COVID-19 pandemic. Despite the genetic mutations across the SARS-CoV-2 genome being recently investigated, its transcriptomic genetic polymorphisms at inter-host level and the viral gene expression level based on each Open Reading Frame (ORF) remains unclear. Using available High Throughput Sequencing (HTS) data and based on SARS-CoV-2 infected human transcriptomic data, this study presents a high-resolution map of SARS-CoV-2 single nucleotide polymorphism (SNP) hotspots in a viral population at inter-host level. Four throat swab samples from COVID-19 infected patients were pooled, with RNA-Seq read retrieved from SRA NCBI to detect 21 SNPs and a replacement across the SARS-CoV-2 genomic population. Twenty-two RNA modification sites on viral transcripts were identified that may cause inter-host genetic diversity of this virus. In addition, the canonical genomic RNAs of N ORF showed higher expression in transcriptomic data and RT-qPCR compared to other SARS-CoV-2 ORFs, indicating the importance of this ORF in virus replication or other major functions in virus cycle. Phylogenetic and ancestral sequence analyses based on the entire genome revealed that an Iranian strain and a Turkish isolate of SARS-COV-2 are closely related to each other and confirming that SARS-CoV-2 is possibly derived from a recombination event between SARS-CoV and Bat SARS-like CoV. Ancestor analysis of the isolates from different locations including Iran suggest shared Chinese ancestry. These results propose the importance of potential inter-host level genetic variations to the evolution of SARS-COV-2, and the formation of viral quasi-species. The RNA modifications discovered in this study may cause amino acid sequence changes in polyprotein, spike protein, product of ORF8 and nucleocapsid (N) protein, suggesting further insights to understanding the functional impacts of mutations in the life cycle and pathogenicity of SARS-CoV-2. The coronavirus pandemic of 2019-20 (COVID-19) has rapidly become an international crisis in mid-December 2019 that was first reported from China , Zhou et al., 2020 . Severe acute respiratory syndrome-related coronavirus (SARS-CoV-2), COVID-19 clinically presents the symptoms with fever and mild respiratory illness, with ~20% of cases requiring hospital intervention, of which 5% require intensive support (Wu and McGoogan, 2019) . As of May 2020, there are 5,701,337 reported cases globally with 357,688 deaths (WHO, 2020) . Coronaviruses (COVs) are the largest group of nonesegmented, single-stranded, positive-sense ribonucleic acid (+ssRNA) viruses in the order Nidovirales (Siddell et al., 2019) . They are classified within the Coronaviridae family, subfamily Coronavirinae and cause enzootic infections in a broad range of vertebrates including mammals and birds. The genome applies different transcriptional and (post-) translational mechanisms to regulate viral propagation in host cells (de Wit et al., 216; Su et al., 2016) . Phylogenetic analysis of full-length genomes of various coronavirus species has revealed that SARS-CoV-2 fell within the subgenus Sarbecovirus of the genus single nucleotide polymorphisms (SNPs) within a population. These SNPs may be the origin of viral quasi-species formation that can lead to resistance-breaking strains in some hosts (Domingo et al., 2012; Seguin et al., 2014) . In acute RNA viruses, mutation hotspots mainly occur in immunogenic sites to enable the virus to escape the host antiviral immune system, therefore, from a public-health perspective, understanding the mutation rate of the SARS-CoV-2 at interhost levels as it spreads through the population is imperative. (Rogozin and Pavlov, 2003) . This study uses SNP prediction based on RNA-Seq data and the extent of molecular divergence between SARS-CoV-2 specific reads mapped to RefSeq to understand mutational factors driving the evolution of SARS-CoV-2 and the pattern of spread of SNPs in the viral population via SARS-CoV-2 transcriptomic data analyses. The expression levels (transcripts per Kilobase Million, TPM) of SARS-Cov-2 ORFs in RNA-seq data were also explored to provide new insights into the virus replication strategy. This study also provides a phylogenetic analysis of SARS-CoV-2 including an Iranian isolate that was recently sequenced. The SARS-CoV-2 transcriptome was previously delineated by Illumina MiniSeq runs on iSeq 100 sequencer using total RNA extracted from hCov-19 infected patients' throat swab. (Fang et al., 2020) . Raw RNA sequencing (RNA-Seq) data for the four throat swab samples (Fang et al., 2020) was downloaded from the National Centre for Biotechnology Information, Sequence Read Archive (NCBI SRA) under BioSample SRA: PRJNA616446 (Table 1) . Data analysis on fastq files were conducted with CLC Genomics Workbench (version 12, QIAGEN, Venlo, The Netherlands). The reads were adaptor trimmed (Illumina MiniSeq) and quality trimmed (using the default parameters: bases below 15 nt were trimmed, ambiguous nucleotides maximal 2). For validation of differential expression of SARS-CoV-2 genes, total RNA was extracted from 70 COVID-19 infected patients using ExgeneTM Viral DNA/RNA (GeneAll, Korea) following the manufacturer's instructions. UltraPlex™ 1-Step ToughMix® (Quantabio, USA) and specific primers and probes for N and RdRp genes, were used for expression analysis. Data were expressed relative to the expression of the Human RNase P as an internal control gene. The threshold cycle (Ct) number was calculated from log scale amplification curves by ABI QuantStudio (Applied Biosystems, USA) software. Amplification efficiencies were calculated and included in data normalization. Data normalization was performed using pfaffl formula (Balotf et al., 2012) . Virus-mapped reads were used for variant discovery using CLC genomic workbench 12. Low-frequency variant detection tools from CLC genomic workbench 12 were used to obtain SNPs from the viral population. Minimum frequency was selected on 5 for the whole of the virus population in four samples. SNPs were validated, with their position in each ORF annotated using Geneious Prime 2019 (Biomatters, New Zealand). The SNPs that cause changes to amino acids were determined. All SARS-CoV-2 whole genome sequences obtained from human hosts with geographical annotations were obtained from NCBI and Global initiative on sharing all influenza data (GISAID). SARS-CoV-2 whole-genome sequences were aligned with the ClustalW method and a phylogenetic tree was constructed using MEGA7 following the neighbor-joining method, maximum composite likelihood-parameter distance matrix, bootstrap values of 1,000 replicates and with a 50% threshold score (Kumar et al., 2016) . The ancestral sequence of the virus was constructed by using the maximum likelihood (ML) algorithm based on the generated phylogenetic tree in MEGA 7. Using RNA-Seq data, this study was able to determine the SNPs on an unprecedented scale for the SARS-CoV-2 population. Identification of these SNPs is critical to understanding J o u r n a l P r e -p r o o f Journal Pre-proof SARS-CoV-2 variation capacity. The abundance of reads from total RNA-Seq that mapped to the SARS-CoV-2 reference genome provided about 98-100 % coverage across the viral genome, with robust sequencing depth (3,383-126,390 reads) of the whole viral genome (Table 1 ). The average fold coverage of viral reads was sufficient to detect low-abundance microbial species genome from metagenomic datasets, implying the reliability of the assembled draft viral genome for SNP determination and further analyses (Albertsen et al., 2013) . A total of 199,809 of 32,865,178 reads, from four biological replicates were mapped to the SARS-CoV-2 genome (Table 1) . Single nucleotide polymorphisms were determined by annotating ORFs for the virus population in four samples. The SNP profiling revealed that the ORFs encoding the RdRp and S proteins, ORF8 and nucleoprotein (N) of SARS-CoV-2 have undergone mutations within this population at inter-host level. At least 22 sites displayed substantial differences with frequency ranging from 5.06% to 99.8% across the mapped reads when applied with a threshold of 5% for SNP detection, suggesting potential RNA modifications (Table 2 and Fig.1 ). Twenty-one low-and higher-frequency SNPs were identified, with variable density across the viral genome and a replacement mutation at a low frequency of 6.4% (Table 2 ). The positions of SNPs in the coding regions of SARS-CoV-2 annotated ORFs were in RdRp ORF (14 SNPs), S ORF (4 SNPs), ORF8 (2 SNPs), N ORF (1 SNP) ( Table. 2 and Fig.1) . The SNP C → T with variant frequency ranging from 5% to 99.8% was the most abundant SNP type that was positioned only in polyprotein and S protein regions of the genome ( Table 2 ). The only polymorphisms that cause changes in the deduced encoded amino acid sequence were positioned at 5122, 9512, 10843, 11206 and 11876 of RdRp ORF and the whole of SNPs in S, ORF8 and N ORFs (Table 2 and Fig.1 ). The ORF1/ab encodes polyproteins are likely to be involved in the viral transcription and replication (Chan et al., 2020) , but it is not known if the identified amino acid change in ORF1/ab affects these viral processes in the variant of SARS-CoV-2 isolated from throat swab samples. This study observed four SNPs in the spike glycoprotein (S protein), located in the peptide signal and S2 domain, indicating that some sites of S protein might be subjected to positive selection (Lv et al., 2020) . The S glycoprotein of COVs plays a crucial role in the binding and attachment of the virion to the cell membrane, through the host ACE2 receptor (Xiao et al., 2003) . Therefore, SARS-CoV-2 S glycoprotein is an important determinant of tissue and cell tropism as well as host range (Millet et al., 2015) . Although the real functional impact of the low-frequency SNPs (C > T) at S protein sequence level remains unclear, the mutation 24 positioned in signal peptide domain and other SNPs located in S2 domain could J o u r n a l P r e -p r o o f Journal Pre-proof modify the viral tropism, suggesting new hosts or increasing SARS-CoV-2 pathogenesis (Shang et al., 2020; Millet et al., 2015) . Among these hotspots, one low-frequency SNP in position 610 is located within the N protein which is probably associated with an overall increased mutation rate. The N protein of SARS-CoV-2 is vital for packaging the positivestrand viral RNA genome into helical ribonucleocapsid (RNP) over its interactions with the viral genome and membrane protein M (Chan et al., 2020) . The N protein provides a potential vaccine antigen, as it is important in viral immune response. Determining the high-frequency SNPs encoding N genes of different SARS-CoV-2 strains will be key in developing a potential vaccine (Zhao et al., 2005) . Throughout viral replication, hundreds of viral progenies are produced that vary at least at one position and the subsequent rounds of replication generate a more complex mutant distribution that includes variants lying farther away from each other in the viral sequence frame (Lauring and Andino, 2010) . This group of mutants forms a ''cloud'' of variants called quasispecies. RNA viruses have a quasispecies nature with a high mutation rate within infected hosts (Lauring and Andino, 2010) . A viral quasispecies includes large numbers of genome variants forming the population structure of viruses that are genetically linked over mutation, interact with each other at a functional level, and jointly contribute to form the main characteristics of a viral population (Lauring and Andino, 2010) . The high mutation rate in RNA viruses leads to a high level of intra-host variants (Ni et al., 2016; Domingo et al., 2012) . The SNPs and substitution observed in quasispecies of SARS-CoV-2 reflect the interpatient capacity of the polymorphic quasispecies which may increase rapidly during the outbreak and cause viral immunological escape, resistance to anti-viral drugs and affect the sensitivity of the molecular diagnostics assays. In this study, those SNPs that were detected at low frequency implies viral variations of low impact on the functionality of the genome. However, the frequencies of SNPs in a viral population can be largely affected by the virus population size and epidemic characteristics (Noh et al., 2017; Karamitros et al., 2016) . Based on the previous studies SARS-CoV shared 99.8% sequence homology with SARS-like CoV, with a total of 202 single-nucleotide (nt) variations (SNVs) identified across the genome . The SARS-CoV-2 mutations at inter-host level may attribute to the viral survival and immune evasion in infected cells because the human immune system is found to be less responsive to RNAs with SNPs (Kariko et al., 2005) . It is yet to be examined if the SNPs detected in ORFs and aa sequence modifications detected in the current study are unique to SARS-CoV-2 or conserved in other taxonomically related coronaviruses. J o u r n a l P r e -p r o o f At least one Gig of clean data from each sample obtained from transcriptome sequencing, was used as a query to analyze the viral gene expression level after quality control, mapping reads to the SARS-CoV-2 Refseq and viral ORFs annotations. Differentially expressed genes (DEGs) were characterized for each sample (adjusted-value p < 0.01) (Fig.2) . Based on the number of specific transcripts (TPM) identified in RNA-Seq, SARS-CoV-2 N which is positioned towards the 3′ terminus of the genome was the most highly expressed gene than the 5′ terminal genes encoding polyproteins and S protein (Fig. 3) . Quantitative comparison of TPM reads showed that the N RNA is the most abundantly expressed transcript, followed by ORFs 10, 8, M, 7a, 1a/b ant other SARS-Cov-2 ORFs (Fig. 3) . One-way ANOVA p < 0.01 revealed that there is no significant difference in viral transcript levels between genes encoding polyprotein, ubiquitin ligases (ORF10), M, and accessory proteins (ORF7a), whereas their expression levels were significantly higher than those of S, E and other auxiliary proteins (ORFs 3a, 7b, and 6) (Fig.2) . There are no results on the SARS-CoV gene expression level in patient tissue, and it is unclear which accessory genes of SARS-CoV-2 are highly expressed. In this study, ORF10 encoding a putative ubiquitin ligase is the most highly expressed accessory gene of SARS-CoV-2 when compared with the other viral auxiliary genes. The viral RNA replication has been evolved toward an equilibrium at which a heterogeneous population of viral RNAs (quasispecies) is reproduced with high efficiency (Holland et al., 1992; Domingo et al., 12) . The viral RdRP is up regulated during the evolution of RNA viruses, since RdRP is the central enzyme during this process and has the optimal combination of RNA synthesis efficiency and nucleotide incorporation fidelity (Holland et al., 1982) . Earlier reports have indicated that upon viral entry, the N-protein of the SARS-CoV is primarily distributed in the cytoplasm, after which they expressed heterologously or in infected cells (Surjit et al., 2005; You et al., 2005; Rowland et al., 2005) . The co-expression of the M and N viral proteins was reported as a minimal requirement for the formation of SARS-COV virus-like particles in transfected human 293 renal epithelial cells (Huang et al., 2004) . Association of M and N proteins stabilizes the nucleocapsid (N protein-RNA complex), and the internal core of virions to promote the completion of viral assembly (Chan et al., 2020; Fehr and Perlman, 2015; Narayanan et al., 2000) . The higher expression level of M gene than those of E and S genes implies the association and binding of M and N for the completion of viral assembly in infected cells. Assembly and release of virions are the last stages of the virus life cycle (Chan et al., 2020) . It has been reported that coronaviruses use an RdRP processivity factor to expedite replication of their RNA genome. The function of SARS-COV RNA polymerase was revealed to be linked with the capability of an up-regulated nsp7/nsp8/nsp12 complex that associates with the activity of nsp14-exonuclease for removing terminal mismatches from an RNA duplex (Bouvet et al., 2012) . Since the N gene of COVs is highly immunogenic, unglycosylated, and is highly expressed in infected cells, it may be an ideal target for developing diagnostic tools that can detect the SARS-CoV-2, as only one SNP was observed in this part of the viral genome based on the present data analysis (Shi et al., 2003; Guan et al., 2004) . The whole-genome-based phylogenetic trees for selected SARS-CoV-2 strains reported from different locations are deduced using the NJ method. The consensus tree was derived by bootstrapping values of 1,000 replicates and with a 50% threshold score based on the NJ algorithm (Kumar et al., 2016) . The robustness of the tree topology was estimated by branch support and the placement of root of the SARS-CoV-2 NJ tree was considered by introducing different out-groups including SARS-COV and Bat SARS-like virus. Since the branching order of the NJ tree shows location of virus linkage, in principle the NJ tree provides the details that COVID-19 spread from one place to another. The phylogenetic tree shows that the branches of the NJ tree are mainly located in the same geographical continent and includes two distinct phylogenetic clades of SARS-CoV-2 isolates (Fig.4) , corresponding to subgroups I and II. The clade I mostly include the Asian SARS-CoV-2 sequences. Whilst the phylogenetic clade II is more diverse than clade I in terms of geographic origins of viral strains and includes two clusters particularly with the root distributed in Europe (Fig 4) . The Iranian genome sequence (MT281530) appeared, in contrast, to be in a different cluster of the clade I including a Turkish genome sequence (EPIISL417413) (Fig 4) . Based on the position and the evolutionary relationship of strains presented in the SARS-CoV-2 NJ tree, as well as the presence of Chinese isolates in both clades, it is deduced that the SARS-CoV-2 may have begun to spread to several regions of the world before its outbreak in Wuhan (Gao and Luo, 2020) . To trace back the potential time of the evolution involving ancestral lineages of SARS-CoV-2, this hypothesis was tested further by joint maximum likelihood reconstruction of ancestral sequences based on the entire genome for all SARS-CoV-2 strains presented in the NJ tree and compared with the whole genome phylogeny (Fig.5) . The findings presented did not support the hypothesis, indicating that the COVID-19 may have begun to spread to several regions in the world before its outbreak in Wuhan. Based on the reconstruction of ancestral sequences SARS-CoV-2 may have been derived from a recombination event between two different COVs including SARS-CoV and Bat SARS-like CoV. Furthermore, the most recent common ancestor in the recombinant region of the clade leading to the Chinese SARS-CoV-2 isolates (MN975262 and MN938384) is the ancestor of other isolates from different locations. As more SARS-CoV-2 isolate sequences become available, stronger lineage variation may To conclude, this data provides a platform with new directions to investigate the mechanisms that play a key role in the pathogenicity of SARS-CoV. All authors report no conflict of interest related to the submitted work. J o u r n a l P r e -p r o o f RNA 3'-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp10/nsp14 exoribonuclease complex Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan SARS and MERS: recent insights into emerging coronaviruses Viral quasispecies evolution. Microbiol Why are RNA virus mutation rates so damn high? Genome-wide data inferring the evolution and population demography of the novel pneumonia coronavirus Coronaviruses: an overview of their replication and pathogenesis The conserved coronavirus macrodomain promotes virulence and suppresses the innate immune response during severe acute respiratory syndrome coronavirus infection Recombinant protein-based enzyme-linked immunosorbent assay and immunochromatographic tests for detection of immunoglobulin G antibodies to severe acute respiratory syndrome (SARS) coronavirus in SARS patients Phylogenetic study of 2019-nCoV by using alignment-free method Clinical features of patients infected with 2019 novel coronavirus in Wuhan Genetic Diversity of RNA Viruses Rapid evolution of RNA genomes A contaminantfree assessment of Endogenous Retroviral RNA in human plasma Suppression of RNA recognition by Toll-like receptors: the impact of nucleoside modification and the evolutionary origin of RNA Quasispecies theory and the behavior of RNA viruses Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses Rapid reversion of sequence polymorphisms dominates early human immunodeficiency virus type 1 evolution Molecular basis of binding between novel human coronavirus MERS-CoV and its receptor CD26 Host cell proteases: Critical determinants of coronavirus tropism and pathogenesis Characterization of the coronavirus M protein and nucleocapsid interaction in infected cells Intra-host dynamics of Ebola virus during Intracellular localization of the severe acute respiratory syndrome coronavirus nucleocapsidprotein: absence of nucleolar accumulation during infection and after expression as a recombinant protein in Vero cells De novo reconstruction of consensus master genomes of plant RNA and DNA viruses from siRNAs Potential maternal and infant outcomes from (Wuhan) coronavirus 2019-nCoV infecting pregnant women: Lessons from SARS, MERS, and other human coronavirus infections Structure of mouse coronavirus spike protein complexed with receptor reveals mechanism for viral entry Diagnosis of severe acute respiratory syndrome (SARS) by detection of SARS coronavirus nucleocapsid antibodies in an antigen-capturing enzyme-linked immunosorbent assay Additional changes to taxonomy ratified in a special vote by the International Committee on Taxonomy of Viruses The nonstructural proteins directing coronavirus RNA synthesis and processing Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human Continuous and discontinuous RNA The severe acute respiratory syndrome coronavirus nucleocapsid protein is phosphorylated and localizes in the cytoplasm by 14-3-3-mediated translocation Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods A new coronavirus associated with human respiratory disease in China Moral imperative for the immediate release of 2019-nCoV sequence data CoV S glycoprotein: expression and functional characterization Subcellular localization of the severe acute respiratory syndrome coronavirus nucleocapsid protein Immune responses against SARS-coronavirus nucleocapsid protein induced by DNA vaccine Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin A novel coronavirus from patients with pneumonia in China The corresponding author evinces her deep gratitude to the Shiraz University of Medical Sciences for providing samples and facilities for the completion of this work. Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K.L., Tyson, G.W., Nielsen, P.H.,