key: cord-302608-fw4pmaoc authors: Huang, Jiao-Mei; Jan, Syed Sajid; Wei, Xiaobin; Wan, Yi; Ouyang, Songying title: Evidence of the Recombinant Origin and Ongoing Mutations in Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) date: 2020-03-19 journal: bioRxiv DOI: 10.1101/2020.03.16.993816 sha: doc_id: 302608 cord_uid: fw4pmaoc The recent global outbreak of viral pneumonia designated as Coronavirus Disease 2019 (COVID-19) by coronavirus (SARS-CoV-2) has threatened global public health and urged to investigate its source. Whole genome analysis of SARS-CoV-2 revealed ~96% genomic similarity with bat CoV (RaTG13) and clustered together in phylogenetic tree. Furthermore, RaTGl3 also showed 97.43% spike protein similarity with SARS-CoV-2 suggesting that RaTGl3 is the closest strain. However, RBD and key amino acid residues supposed to be crucial for human-to-human and cross-species transmission are homologues between SARS-CoV-2 and pangolin CoVs. These results from our analysis suggest that SARS-CoV-2 is a recombinant virus of bat and pangolin CoVs. Moreover, this study also reports mutations in coding regions of 125 SARS-CoV-2 genomes signifying its aptitude for evolution. In short, our findings propose that homologous recombination has been occurred between bat and pangolin CoVs that triggered cross-species transmission and emergence of SARS-CoV-2, and, during the ongoing outbreak, SARS-CoV-2 is still evolving for its adaptability. The family Coronaviridae is comprised of large, enveloped, single stranded, and positivesense RNA viruses that can infect a wide range of animals including humans Guan et al., 2003) . The viruses are further classified into four genera: alpha, beta, gamma, and delta coronavirus (King et al., 2012) . So far, all coronaviruses (CoVs) identified in human belong to the genera alpha and beta. Among them betaCoVs are of particular importance. Different novel strains of highly infectious betaCoVs have been emerged in human populations in the past two decades that have caused severe health concern all over the world. Severe acute respiratory syndrome coronavirus (SARS-CoV) was first recognized in 2003, causing a global outbreak (Zhong, 2004; Peiris et al., 2004; Cherry, 2004) . It was followed by another pandemic event in 2012 by a novel strain of coronavirus designated as Middle East respiratory syndrome coronavirus (MERS-CoV) (Lu et al., 2013) . Both CoVs were zoonotic pathogens and evolved in animals. Bats in the genus Rhinolophus are natural reservoir of coronaviruses worldwide, and it is presumed that both SARS-CoV and MERS-CoV have been transmitted to human through some intermediate mammalian hosts (Li et al., 2005a; Bolles et al., 2011; Al-Tawfiq and Memish, 2014) . Recently, emergence of another pandemic termed as Coronavirus Disease 2019 (COVID-19) by World Health Organization (WHO) caused by a novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been reported (Zhu et al., 2020) . To date, more than 174,000 people are infected and over 6,600 death tolls, having transmission clusters worldwide including China, Italy, South Korea, Iran, Japan, USA, France, Spain, Germany and several other countries causing alarming global health concern. The large trimeric spike glycoprotein (S) located on the surface of CoVs is crucial for viral infection and pathogenesis, which is further subdivided into N-terminal S1 subunit and C-terminal S2 domain. The S1 subunit is specialized in recognizing receptors on host cell, comprising of two separate domains located at N-and C-terminal which can fold independently and facilitate receptor engagement (Masters, 2006) . Receptor-binding domains (RBDs) of most CoVs are located on S1 C-terminus and enable attachment to its host receptor (Li et al., 2005b) . The host specificity of virus particle is determined by amino acid sequence of RBD and is usually dissimilar among different CoVs. Therefore, RBD is a core determinant for tissue tropism and host range of CoVs. This article presents SARS-CoV-2 phylogenetic trees, comparison and analysis of genome, spike protein, and RBD amino acid sequences of different CoVs, deducing source and etiology of COVID-19 and evolutionary relationship among SARS-CoV-2 in human. To determine the evolutionary relationship of the SARS-CoV-2, phylogenetic analysis was performed on whole genomic sequences of different CoVs from various hosts. The Maximum-likelihood (ML) phylogenetic tree is shown in Figure 1 , which illustrates four main groups representing four genera of CoVs, alpha, beta, gamma, and delta. In the phylogenetic tree, strains of SARS-CoV-2 (red colored) are cluster together and belong to the genera Betacoronavirus. Among Beta-CoVs, SARS-CoV, Civet SARS CoV, Bat SARS-like CoVs, bat/RaTG13 CoV, and SARS-CoVs-2 clustered together forming a discrete clade from MERS-CoVs. The clade is further divided into two branches and one of the branches comprises all SARS-CoV-2 strains clustered together with Bat/Yunnan/RaTG13 CoV forming a monophyletic group. Bat/Yunnan/RaTG13 exhibited ~96% genomic similarity with SARS-CoV-2. This specifies that SARS-CoV-2 is closely related to Bat/Yunnan/RaTG13 CoV. The ML phylogenetic tree demonstrates that CoVs from bat source are found in the inner joint or neighboring clade of SARS-CoV-2. This indicates that bats CoVs particularly Bat/Yunnan/RaTG13 are the source of SARS-CoV-2, and they are emerged and transmitted from bats to humans through some recombination and transformation events in intermediate host. To explore the emergence of SARS-CoV-2 in humans, we investigated CoVs S-protein and its RBD as they are responsible for determining the host range ( Table 1) . The S-protein amino acid sequence identity between SARS-CoV-2 and related beta-CoVs showed that bat/Yunnan/RaTG13 shares highest similarity of 97.43%. However, the amino acid sequence identity of RBD of SARS-CoV-2 with bat/Yunnan/RaTG13 is 89.57%. On the other hand, Beta-CoVs from pangolin sources (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08) revealed highest RBD amino acid sequence identity of 96.68% and 96.08% respectively with SARS-CoV-2. These indication shows the existence of homologous recombination events within the S-protein gene between bat and pangolin CoVs. Similarity plot analysis of CoVs genome sequences from bat, pangolin and human also indicated a possible recombination within S-protein of SARS-CoVs-19 ( Figure S1 ). The amino acid residues change in S-protein of SARS-CoV-2 was further analyzed with SARS-CoV, pangolin and bat CoVs including pangolin/Guandong/1/2019, pangolin/Guangdong/lung08, and bat/Yunnan/RaTG13 (Figure 2) . Regardless of low homology between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and SARS-CoV (SARS_AAR07630), they had many homologues areas in S-protein. The five key amino acid residues of S-protein at positions 442,472, 479,480, and 487 of SARS-CoVs are described to be at the angiotensin-converting enzyme-2 (ACE2) receptor complex interface and supposed to be crucial for human to human and cross-species transmission (Li et al., 2005b; Wu et al., 2012) . Figure 2b and Table S1 describe that all key amino acid residues of RBD (except two positions) are completely homologues between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and pangolin CoVs (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08), supporting our postulation of recombination event in S-protein gene. Even though, all five crucial amino acid residues of SARS-CoV-2 for binding to ACE2 are different from SARS-CoV, their hydrophobicity and polarity are similar, having same S-protein structural confirmation and identical RBD 3-D structure (Xu et al., 2020) . In addition, six critical key residues in MERS-CoV RBD binding to its receptor dipeptidyl peptidase 4 (DPP4) are all different in SARS-CoV and SARS-CoV-2 related coronavirus (Figure 2a) . We also investigated some of the important evolutionary and phylogenetic aspects of SARS- Table 2 . Among different orfs of SARS-CoV-2, orf1a was most variable segment with total number of 44 dissimilar amino acid substitutions. It was followed by spike segment S orf with 13 amino acid residue substitutions. However, orf6 and orf7b are the most conserved regions without amino acid changes. In addition, orf10, E, M and orf7a have tended to be more conserved, with only one or two amino acid substitutions. With the global spread of SARS-CoV-2, its amino acid sequence is also significantly varied (Figure 3) . Usually, RNA viruses have high rate of genetic mutations, which leads to evolution and provide them with increased adaptability (Lin et al., 2019) . To further explore SARS-CoV-2 evolution in human, we have performed phylogenetic analysis based on the aforementioned SARS-CoV-2 in correspondence with their amino acid substitution. One hundred and twenty-five newly sequenced SARS-CoV-2 complete genomes were obtained from Global Initiative on Sharing All Influenza Data EpiFluTM database (GISAID EpiFlu TM ) and GenBank. Closely related beta-CoVs genomes sequences from different hosts were also collected and analyzed together with SARS-CoV-2. Open reading frames (orfs) of CoVs genomes were predicted using ORFfinder (v0.4.3) with default parameters ignoring nested orfs. Raw pair-end reads of pangolin dataset sample (SRR10168377) obtained from NCBI were filtered with bbmap.sh (v38.79) by removing adaptors, trimming low quality reads from both sides (quality value < 20), and reads length less than 50 nt were ignored. Host reference genome (pangolin ManJav1.0, GCF_001685135.1) contaminant reads were removed by bowtie2 (v2.3.5.1) [13] . Pangolin CoV genome fragments were assembled via MEGAHIT (v1.2.9) (Li et al., 2015) . The sequences of CoVs were aligned using multiple sequence alignment MAFFT (v7.450) (Katoh et al., 2019) . Aligned sequences were visualized with Jalview (v2.10.3) (Waterhouse et al., 2009) . Poorly aligned regions and gaps were removed by trimAL (v1.4.rev22) (CapellagutiƩrrez et al., 2009) . Maximum likelihood (ML) phylogenetic trees of whole genome sequences were constructed in IQ-TREE (v1.6.12) (Nguyen et al., 2015) . Support for inferred relationships in the phylogenetic tree was assessed by bootstrap analysis with 1000 replicates and the best-fit substitution model was determined by IQ-TREE model test. to be involved in recombination. Similarity scores between genomic sequences were generated by Simplot (v3.5.1) (Lole et al., 1999) . Red color highlights interaction positions of SARS-CoV-2 and pangolin CoVs with different amino acids residues. Middle East respiratory syndrome coronavirus: transmission and phylogenetic evolution SARS-CoV and emergent coronaviruses: viral determinants of interspecies transmission trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses The chronology of the 2002-2003 SARS mini pandemic Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization Virus taxonomy. Ninth report of the International Committee on Taxonomy of Viruses Fast gapped-read alignment with Bowtie 2 MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph Structure of SARS coronavirus spike receptor-binding domain complexed with receptor Bats are natural reservoirs of SARS-like coronaviruses Naturally occurring mutations in PB1 affect influenza A virus replication fidelity, virulence, and adaptability Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination Middle East respiratory syndrome coronavirus (MERS-CoV): challenges in identifying its source and controlling its spread The molecular biology of coronaviruses IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies Severe acute respiratory syndrome From SARS coronavirus to novel animal and human coronaviruses Jalview Version 2-a multiple sequence alignment editor and analysis workbench Mechanisms of host receptor adaptation by severe acute respiratory syndrome coronavirus Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission Management and prevention of SARS in China A novel coronavirus from patients with pneumonia in China The authors have declared that no competing interests exist.