key: cord-0858468-3wt7m0n6 authors: Das, Jayanta Kumar; Roy, Swarup title: Comparative Analysis of Human Coronaviruses Focusing on Nucleotide Variability and Synonymous Codon Usage Pattern date: 2020-07-28 journal: bioRxiv DOI: 10.1101/2020.07.28.224386 sha: f4573845601f8ee99e6a53d53a2bf082f89b3675 doc_id: 858468 cord_uid: 3wt7m0n6 Prevailing pandemic across the world due to SARSCoV-2 drawing great attention towards discovering its evolutionary origin. We perform an exploratory study to understand the variability of the whole coding region of possible proximal evolutionary neighbours of SARSCoV-2. We consider seven (07) human coronavirus strains from six different species as a candidate for our study. First, we observe a good variability of nucleotides across candidate strains. We did not find a significant variation of GC content across the strains for codon position first and second. However, we interestingly see huge variability of GC-content in codon position 3rd (GC3), and pairwise mean GC-content (SARSCoV, MERSCoV), and (SARSCoV-2, hCoV229E) are quite closer. While observing the relative abundance of dinucleotide feature, we find a shared typical genetic pattern, i.e., high usage of GC and CT nucleotide pair at the first two positions (P12) of codons and the last two positions (P23) of codons, respectively. We also observe a low abundance of CG pair that might help in their evolution bio-process. Secondly, Considering RSCU score, we find a substantial similarity for mild class coronaviruses, i.e., hCoVOC43, hCoVHKU1, and hCoVNL63 based on their codon hit with high RSCU value (≥ 1.5), and minim number of codons hit (count-9) is observed for MERSCoV. We see seven codons ATT, ACT, TCT, CCT, GTT, GCT and GGT with high RSCU value, which are common in all seven strains. These codons are mostly from Aliphatic and Hydroxyl amino acid group. A phylogenetic tree built using RSCU feature reveals proximity among hCoVOC43 and hCoV229E (mild). Thirdly, we perform linear regression analysis among GC content in different codon position and ENC value. We observe a strong correlation (significant p-value) between GC2 and GC3 for SARSCoV-2, hCoV229E and hCoVNL63, and between GC1 and GC3 for hCoV229E, hCoVNL63, SARSCoV. We believe that our findings will help in understanding the mechanism of human coronavirus. proteins compose the spikes on the viral surface, binds 48 to host receptors [11] . However, many of these se- The frequency of dinucleotide feature is also impor-72 tant that may affect codon usage [19] . In this context, in Table 1 , which are utilized in our subsequent analysis. Similarly, we calculate pair nucleotide composition 145 in a particular position p as follows: where f x and f y represent the individual frequency 161 of nucleotides x and y respectively, and f x y is the fre-162 quency of dinucleotides (xy) in the same sequence. We means that the codon is used more frequently. where X i is the number of occurrences of the jth codon 185 for the ith amino acid, which is encoded by n i synony- where F i denotes the average homozygosity for the We observe that the content of A is high for SARSCoV-211 2, and is low for MERSCoV and hCoVNL63; the con- Table 4 . RSCU value with a high score (≥ 1.5) is highlighted for all seven strains. We observe the max- [34]. To understand the evolutionary mechanism, we con- all seven strains of coronaviruses is shown in Figure 8 . The solid line in the figure represents the regression line. The detail of the regression line with significant statisti-373 cal p-value, and R 2 value is shown in Coronaviruses post-sars: update on 426 replication and pathogenesis Coronavirus as a possible 430 cause of severe acute respiratory syndrome Commentary: Middle east respiratory 435 syndrome coronavirus (mers-cov): announcement of the coron-436 avirus study group Mers-cov: understanding the latest 438 human coronavirus threat Emerging viruses in human populations Nsp3 of coronaviruses: Struc-453 tures and functions of a large multi-domain protein Cryo-em structure of the 456 sars coronavirus spike glycoprotein in complex with its host cell 457 receptor ace2 Recent evidence 459 for evolution of the genetic code., Microbiology and Molecular 460 Synonymous but not the same: the 491 causes and consequences of codon bias Forces that influence the 494 evolution of codon bias Moderate mutation rate in the sars 498 coronavirus genome and its implications 501 Analysis of the codon usage pattern in middle east respiratory 502 syndrome coronavirus Genetic 505 diversity among sars-cov2 strains in south america may impact 506 performance of molecular detection, medRxiv Understanding the origin of 508 'batcovratg13', a virus closest to sars The proximal origin of sars-cov-2 Codon pair bias is a direct conse-513 quence of dinucleotide bias An evolutionary perspective on synony-515 mous codon usage in unicellular organisms The neutral analysis of GC3 against GC1/GC2/GC12 for all seven strains of coronaviruses. The solid line represents the regression line