key: cord-311445-b6bc6vwd authors: Bansal, Kanika; Patil, Prabhu B. title: Codon pattern reveals SARS-CoV-2 to be a monomorphic strain that emerged through recombination of replicase and envelope alleles of bat and pangolin origin date: 2020-10-12 journal: bioRxiv DOI: 10.1101/2020.10.12.335521 sha: doc_id: 311445 cord_uid: b6bc6vwd Viruses are dependent on the host tRNA pool, and an optimum codon usage pattern (CUP) is a driving force in its evolution. Systematic analysis of CUP of replicase (rdrp), spike, envelope (E), membrane glycoprotein (M), and nucleocapsid (N) encoding genes of SARS-CoV-2 from reported diverse lineages to suggest one-time host jump of a SARS-CoV-2 isolate into the human host. In contrast to human isolates, a high degree of variation in CUP of these genes suggests that bats, pangolins, and dogs are natural reservoirs of diverse strains. At the same time, our analysis suggests that dogs are not a source of SARS-CoV-2. Interestingly, CUP of rdrp displays conservation with two bat SARS isolates RaTG13 and RmYN02. CUP of the SARS-CoV-2 E gene is also conserved with bat and pangolin isolates with variations for a few amino acids. This suggests role allele replacement in these two genes involving SARS strains of least two hosts. At the same time, a relatively conserved CUP pattern in replicase and envelope across hosts suggests them it to be an ideal target in antiviral development for SARS-CoV-2. The origin and success of novel SARS coronavirus (SARS-CoV-2) (betacoronavirus) causing the COVID-19 disease pandemic has been a topic of intense discussion. In the past two decades since the first outbreak of SARS in 2002, several SARS-related coronaviruses were reported from the bat, which was speculated to be significant reservoirs for the future possible outbreaks (Cui et al., 2019; Ge et al., 2013; Hu et al., 2017; Li et al., 2005) . Bats are the only flying mammals representing 20% of known mammalian species and are critical natural reservoirs of many zoonotic viruses like Nipah virus, Hendra virus, rabies virus, Ebola virus, etc. (Halpin et al., 2000; Leroy et al., 2005; Mackenzie et al., 2001) . Besides bat, a considerable number of wild animals have played a pivotal role in zoonotic transfers (Bengis et al., 2004) . According to reports before the SARS-CoV-2 pandemic, there was a high-risk assessment of SARS coronavirus infection from wild animals like bats, civets, pangolins, snakes, tiger, and primates in China due to human interventions (Bell et al., 2004; Gottlieb, 2003; TANG et al., 2006) . The human-wildlife interface as a part of culture or globalization poses risks for zoonotic transfers followed by disease outbreaks like coronavirus outbreaks: SARS (2002 SARS ( ,2003 , MERS (2012) , and SARS-CoV-2 (2019). Their genome similarities with already reported SARS viruses from diverse animals are estimated by animal reservoirs for such outbreaks. For instance, SARS 2003 outbreak virus had 99.6% genome similarity with palm civets indicating it to be a direct source. Just 0.4% divergence from the animal reservoir stipulates its recent transfer into the masked palm civet population (Shi and Hu, 2008) . Despite genetic diversity with bat SARS-CoV they were ultimately found to be a source of the pandemic due to no pathogen prevalence in wild civet population and clinical symptom manifestation in civets, unlike bats (Li et al., 2005) . However, in the current pandemic, there are several theories of the origin of SARS-CoV-2 either from bat, pangolin, dog or some intermediate host, etc. (Paraskevis et al., 2020; Zhang et al., 2020) . The closest match to SARS-CoV-2 is RaTG13 (96% identity) isolated from the Rhinolophus affinis bat (Zhou et al., 2020) , followed by pangolin SARS viruses with 91% identity (Zhou et al., 2020) . As the closest match is just 96%, it has opened a heated debate in the scientific community for its origin, and no direct animal source can be detected. According to genome similarities, SARS-CoV-2 differs from its closest SARS coronavirus by 4%, followed by 9% with its next closest relative, pangolin. It indicates that the virus has evolved before infecting humans, and there is a missing link between bat/pangolin and humans, which further inflates the argument on the animal source. Nevertheless, another study based on CpG island deficiency in SARS-CoV-2 and canine coronavirus (alphacoronavirus) suggested that dogs may have provided a cellular environment for SARS-CoV-2 evolution into a CpG deficient virus (Xia, 2020) . Hence, they claim dog to be a direct source of the current pandemic, raising a constant debate (Pollock et al., 2020) (https://www.linkedin.com/pulse/where-dog-laymans-version-my-mbe-paper-xuhua-xia/). Nevertheless, most of the other RNA viruses like pestvirus, in addition to bat or pangolin SARS-CoV, are also depleted in CpG is not included in the study. Further, the spike receptorbinding domain of SARS-CoV-2 was similar to pangolin SARS strains compared to bat SARS. CpG island deficiency is not a unique feature of dog SARS-CoV, and Pollock et al. have concluded that there is no direct evidence for the role of dogs as intermediate hosts. Hence, we need to address this issue with other fundamental evidence. For host jump events, viral codon optimization based on the host tRNA pool is critical (Khandia et al., 2019; Tian et al., 2018; Van Weringh et al., 2011) . In the present study, we have focused on codon usage pattern (CUP) of SARS coronavirus from different hosts under debate (bat, pangolin, and dog) as a probable origin for SARS-CoV-2. Usage patterns of synonymous codons are a critical feature in the adaptation of organisms as viruses are dependent on the host tRNA pool for replication and disease manifestations. For instance, codon adaptation indices were studied for retroviruses infecting humans, including the HIV-1 virus (RoyChoudhury and Mukherjee, 2013). Once the viral genome is in the host translational mechanism, genes having optimized codons according to the host translate faster, resulting in higher fitness of the virus (Carbone, 2008) . Hence, an optimum CUP is vital in its evolution, and probable host jumps. This also results in synonymous changes in the viral genome, which are not revealed by protein mutational studies. Codon usage study of SARS-CoV-2 has found high AU content influencing its codon usage and better adaptation to the humans (Dilucca et al., 2020) . However, another study comparing the codon usage pattern of SARS-CoV-2 with other betacoronaviruses suggested that current pandemic coronavirus is subjected to different evolutionary pressures (Gu et al., 2020) . However, systematic insights into CUP is required to understand its origin and phenomenal success of emergent viruses like SARS-CoV-2. The genome of SARS-CoV-2 is a single positive-stranded RNA of approximately 30,000 nucleotides. Major structural proteins are spike protein (S), envelope protein (E), membrane glycoprotein (M), nucleocapsid protein (N), and non-structural RNA dependent RNA polymerase (rdrp). As these five viral proteins are common amongst betacoronaviruses (Woo et al., 2010) , it will be useful in studying diverse coronaviruses from diverse hosts, i.e., humans, bat, pangolin, and dog. Presently, we have analyzed CUP of these five proteins in SARS coronavirus, which are considered important targets for vaccine and antiviral development for SARS-CoV-2 (Aftab et al., 2020; Du et al., 2009; Huang et al., 2020; Wu et al., 2020) . CUP calculations should be based on highly redundant amino acids rather than amino acids having one or two codons. Here, we have calculated the percentage of GC biased synonymous codons for amino acids having at least four synonymous codons (Glycine, Valine, Threonine, Leucine, Arginine, Serine, Proline and Alanine) (Patil and Sonti, 2004) . We have compared CUP for all five genes amongst 134 SARS-CoV-2 genomes representing all the sub-lineages from the two major lineages A and B (supplementary table 1) . Here, genes having anonymous nucleotides were omitted from the analysis. CUPs of all the genes were uniform for all SARS-CoV-2 reported from the human population worldwide irrespective of ancestral or most recent lineages (supplementary figures 1, 2, 3, 4, and 5), indicating a single event of a zoonotic transfer of a thriving strain of SARS. However, each of them all the genes have distinct CUP among themselves, indicating distinct selection pressure (figure 1). Since horseshoe bat SARS strains RaTG13 and RmYN02 are known to be the closest strains to SARS-CoV-2, we have compared CUP of these two strains with SARS-CoV-2 (figure 2). Here, the rdrp and E patterns overlapped for all the three strains with slight variations in E for serine and valine. However, all three strains have distinct patterns for S, M, and N genes. This reaffirms the previous studies that RaTG13 and RmYN02 are close to SARS-CoV-2 (Zhou et al., 2020) , indicating that SARS-CoV-2 is distinct from them. Unique S, M, and N patterns indicate SARS-CoV-2 has evolved to adapt to humans and to look for the possibility of intermediate hosts. We analyzed other known SARS viruses from hypothesized hosts. Unlike SARS-CoV-2 isolates of human, CUP is variable in isolates of non-human hosts like a bat, pangolin, and dog (supplementary table 2) depicting ongoing adaptation and evolution of SARS in these hosts ( figure 3) . Interestingly, variations in CUP of S, M, and N genes are more marked across non-human hosts than the rdrp and E gene. For instance, rdrp CUP for pangolin and dog does not have that degree of variation and is host-specific compared to bat rdrp. Bat CUP for rdrp is variable in all the strains under study, correlating with the fact that bat is a reservoir of SARS coronaviruses. Whereas, CUP of E is conserved in bat and pangolin. Interestingly, rdrp CUP of SARS-CoV-2 shows a slight overlap in pattern with all pangolin SARS strains, but it is not as prominent as two bat strains (figure 2). Among bat-SARS, pattern matches with just two previously suggested SARS, i.e., RaTG13 and RmYN02. In contrast, SARS-CoV-2 CUP of E has an overlapping pattern with all isolates of bat except for JTMC15 and serine and valine CUP. Interestingly, the precisely same pattern in all pangolin strains overlapping with SARS-CoV-2 except for valine is also observed. Unlike rdrp and E CUP of SARS-CoV-2, S, M, and N CUP does not have any similarity with non-human hosts studied till now. Further, no similarities in CUP of all five genes with dogs overrules the role of dog in SARS-CoV-2 evolution. The above analysis reveals single patterns for all five genes in different lineages of SARS-CoV-2 affirms a single event of host jump of codon-optimized SARS strain from its animal reservoir. However, amongst five SARS-CoV-2 distinct CUP patterns are observed, pointing towards different evolutionary forces acting in the structural and non-structural genes. While comparing CUP of the current pandemic with all the hypothesized hosts, rdrp and E gene sources were found to be bat RaTG13 and RmYN02 and not for S, M, and N genes. This contradicts the previous study correlating rdrp, S, and N proteins with RaTG13 isolate (Gu et al., 2020) . Though RaTG13 is the closest match to the current pandemic, some genes of SARS-CoV-2 does not share CUP with it. This indicates codon optimization in SARS-CoV-2, according to the new host (human) as compared to its animal reservoir. Further, no gene has any overlap with dog coronavirus strains, ruling out the possibility of dog as an animal reservoir. Nevertheless, the unknown S, M, and N genes play pivotal roles in host interactions and disease manifestations. Gene adaptations in S, M, and N have been pivotal in the emergence of the SARS viral pool in the hosts and the current pandemic. Suggesting that S, M, and N have mainly been under drastic variation both within and across the hosts. Either hyper-recombination events or selection pressure can explain this disparity in CUP conservation for rdrp, E than S, M, N genes. Hence, for the high variability across hosts, the mutation is not a rampant force . We can conclude that evolution in host interaction steps (S, M, and N) has more significance than at the level of replicability (rdrp) and packaging (E). In addition to this, out of five genes in bat, only E shows a unified CUP, overlapping with pangolin and SARS-CoV-2. This indicates that rdrp and E are under purifying selection and could be better targeted for antivirals development. High selection pressure in S, M, and N genes indicate that antiviral development attempts targeting these may be futile. For centuries, bats are exploited as medicines, food products in southern China and Asia (Mickleburgh et al., 2002) . Hence, wildlife-human interactions have been critical for the evolution of human SARS coronavirus outbreaks. The high diversity of CUP in bats compared to humans and pangolins is reflected by the high genetic diversity of SARS in bats (Li et al., 2005) . Further, some of the SARS-related coronaviruses are found to have broad species tropism (Ge et al., 2013) . Implicates that bats have provided a playground for SARS to mutate or recombine, resulting in a strain that might be successful in other hosts (in this case, humans) than bats. As the resulting strain (SARS-CoV-2) might be an accidental strain that originated from recombination in bat but was unfit in bat or its unknown animal reservoir, not leaving its evolutionary footprint. A highly similar strain to the present pandemic could not be found and further warrants studies to explore the diversity of SARSlike coronaviruses. For each gene the frequency of codon usage for different amino acids was calculated using a web based program (www.bioinformatics.org). Further, eight amino acids i.e., Glycine, Valine, Threonine, Leucine, Arginine, Serine, Proline and Alanine that have atleast four synonymous codons were selected and the percentage of synonymous codons that end with G or C was calculated for each amino acid and gene. The pattern was calculated for a group of genes by plotting mean values ± SD corresponding to a particular amino acid. Zhang, T., Wu, Q., and Zhang, Z.J.C.B. (2020) . Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak. Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L., Zhang, W., Si, H.-R., Zhu, Y., Li, B., and Huang, C.-L. (2020) . A pneumonia outbreak associated with a new coronavirus of probable bat origin. nature 579, 270-273. Analysis of SARS-CoV-2 RNA-dependent RNA polymerase as a potential therapeutic drug target using a computational approach Animal origins of SARS coronavirus: possible links with the international trade in small carnivores The role of wildlife in emerging and re-emerging zoonoses. Revue scientifique et technique-office international des epizooties 23 Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. bioRxiv Codon bias is a major factor explaining phage evolution in translationally biased hosts Origin and evolution of pathogenic coronaviruses Codon Usage and Phenotypic Divergences of SARS-CoV-2 Genes The spike protein of SARS-CoV-a target for vaccine and therapeutic development Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor Chinese scientists must test wild animals to find the host of SARS Multivariate analyses of codon usage of SARS-CoV-2 and other betacoronaviruses Isolation of Hendra virus from pteropid bats: a natural reservoir of Hendra virus Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus Pharmacological therapeutics targeting RNA-dependent RNA polymerase, proteinase and spike protein: from mechanistic studies to clinical trials for COVID-19 Analysis of Nipah virus codon usage and adaptation to hosts Fruit bats as reservoirs of Ebola virus Bats are natural reservoirs of SARS-like coronaviruses Emerging viral diseases of Southeast Asia and the Western Pacific A review of the global conservation status of bats Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event Variation suggestive of horizontal gene transfer at a lipopolysaccharide (lps) biosynthetic locus in Xanthomonas oryzae pv. oryzae, the bacterial leaf blight pathogen of rice Viral CpG deficiency provides no evidence that dogs were intermediate hosts for SARS-CoV-2 A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Complex codon usage pattern and compositional features of retroviruses. Computational and mathematical methods in medicine A review of studies on animal reservoirs of the SARS coronavirus The adaptation of codon usage of+ ssRNA viruses to their hosts HIV-1 modulates the tRNA pool to improve translation efficiency Coronavirus genomics and bioinformatics analysis Vaccines and therapies in development for SARS-CoV-2 infections Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense