key: cord-263825-g8p2lsr0 authors: Maldonado, Lucas L.; Kamenetzky, Laura title: Molecular features similarities between SARS-CoV-2, SARS, MERS and key human genes could favour the viral infections and trigger collateral effects date: 2020-06-25 journal: bioRxiv DOI: 10.1101/2020.06.23.167072 sha: doc_id: 263825 cord_uid: g8p2lsr0 In December 2019 rising pneumonia cases caused by a novel β-coronavirus (SARS-CoV-2) occurred in Wuhan, China, which has rapidly spread worldwide causing thousands of deaths. The WHO declared the SARS-CoV-2 outbreak as a public health emergency of international concern therefore several scientists are dedicated to the study of the new virus. Since human viruses have codon usage biases that match highly expressed proteins in the tissues they infect and depend on host cell machinery for replication and co-evolution, we selected the genes that are highly expressed in the tissue of human lungs to perform computational studies that permit to compare their molecular features with SARS, SARS-CoV-2 and MERS genes. In our studies, we analysed 91 molecular features for 339 viral genes and 463 human genes that consisted of 677873 codon positions. Hereby, we found that A/T bias in viral genes could propitiate the viral infection favoured by a host dependant specialization using the host cell machinery of only some genes. The envelope protein E, the membrane glycoprotein M and ORF7 could have been further benefited by a high rate of A/T in the third codon position. Thereby, the mistranslation or de-regulation of protein synthesis could produce collateral effects, as a consequence of viral occupancy of the host translation machinery due tomolecular similarities with viral genes. Furthermore, we provided a list of candidate human genes whose molecular features match those of SARS-CoV-2, SARSand MERS genes, which should be considered to be incorporated into genetic population studies to evaluate thesusceptibility to respiratory viral infections caused by these viruses. The results presented here, settle the basis for further research in the field of human genetics associated with the new viral infection, COVID-19, caused by SARS-CoV-2 and for the development of antiviral preventive methods. M and ORF7 could have been further benefited by a high rate of A/T in the third codon 23 position. Thereby, the mistranslation or de-regulation of protein synthesis could produce 24 collateral effects, as a consequence of viral occupancy of the host translation machinery 25 due tomolecular similarities with viral genes. Furthermore, we provided a list of 26 candidate human genes whose molecular features match those of SARS-CoV-2, 27 SARSand MERS genes, which should be considered to be incorporated into genetic 28 population studies to evaluate thesusceptibility to respiratory viral infections caused by 29 these viruses.The results presented here, settle the basis for further research in the field 30 of human genetics associated with the new viral infection, COVID-19, caused by 31 SARS-CoV-2 and for the development of antiviral preventive methods. 32 Since its initial outbreak at Huanan Seafood Wholesale Market in Wuhan, China, in late 36 2019, COVID-19 has affected more than 4 million people and caused more than 300 37 thousand deaths all around the world. Thereafter, scientists are focused not only on 38 studying the biology and dissemination of COVID-19 to control the transmission and 39 design proper diagnostic tools and treatments, but also theyare racing to design a 40 vaccine that could prevent the infection caused by the coronavirus SARS-CoV-2.This 41 virus belongs to the Betacoronavirus(β-coronavirus) of the Coronaviridaefamily, which 42 is also composed of three more genera: 43 Alphacoronavirus(αCoV),Gammacoronavirus(γCoV) andDeltacoronavirus(δCoV) 44 (Chen et al., 2020a) . Viruses from this family possess a single-stranded, positive-sense 45 RNA and thegenome ranges from 26 to 32 kb (Su et al., 2016) . 46 Coronaviruses have been identified in several host species including humans, bats, 47 civets, mice, dogs, cats, cows and camels (Cavanagh, 2007; Clark, 1993 In order to contribute to solving the sanitary emergence, here we provide a thorough and 78 comprehensive analysis that could help to understand the viability of the virus as well as 79 the susceptibility of the human host to the viral infection based on the molecular 80 patterns of their genes. Therefore, the main goals of ourwork wereto study the 81 molecular and evolutionary aspects of the human coronaviruses SARS-CoV-2,SARS 82 and MERS andto determine the level of similarity of the codon usage and molecular 83 features betweenthe genes of human coronaviruses and the human genesin order to 84 identify the factors that are responsible for the codons selection in the viruses.Moreover, 85 we proposed to identify the essential viral genes for viral replication andhumangenes 86 whosetranslation machinery is involved in propitiating the system for viral replicationin 87 orderto determine whether the genetic population variability could be involved in 88 modelling the gene features andtherefore contributing to the human susceptibility to 89 viral infections. 90 Up to late April, a total of 500 SARS-CoV-2 β -coronavirus genome became available. 92 The total available sequences of β -coronavirus were downloaded from the NCBI 93 (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) including the reference genomes of 94 MERS (NC_019843), SARS (NC_004718) and SARS-CoV-2 (NC_045512) and were 95 classified according to their host. Different SARS-CoV-2 isolates from different 96 countries were pre-analysed but only reference genomes were retaineddue to the low 97 variability of the data.The genomes qualitywas assessed and the genomes containing 98 more than 10 gaps were discarded. CDS of representative viruses fromthe previous 99 classification were selected and analysed. Since human viruses have codon usage biases 100 (CUB) that match highly expressed proteins in the tissues they infect (Miller et (Sueoka, 1988) , the three stop codons (UAA, UAG, and UGA) were excluded 113 in the calculation of P3, and the two single codons for methionine (AUG) and 114 tryptophan (UGG) were excluded from P1, P2, and P3. 115 The following codon indices were calculated: relative synonymous codon usage 117 (RSCU) (Sharp and Li, 1987) , the effective number of codons (ENc) (Wright, 1990) , 118 codon adaptation index (CAI) (Lee et al., 2010; Sharp and Li, 1987) , codon bias index 119 (CBI) (Bennetzen and Hall, 1982) , the optimal frequency of codons (Fop) (Ikemura, 120 1981) , General Average Hydropathicity (GRAVY) (Sharp and Li, 1987) , aromaticity 121 (Aromo) (Lobry and Gautier, 1994) and GC-content at the first, second and third codon 122 positions (GC1, GC2 and GC3), frequency of either a G or C at the third codon position 123 of synonymous codons (GC3s), the average of GC1 and GC2 (GC12)and Translational 124 selection (TrS2). 125 ENc indicates the degree of codon bias for individual genes. Over a range of 126 values from 20 to 61, lower values indicate higher codon bias, while ENc equal to 61 127 means that all codons are used with equal probability (Novembre, 2002; Wright, 1990) . 128 CAI values measure the extent of bias toward preferred codons in highly 129 expressed genes. CAI values range between 0 and 1.0, with higher CAI values 130 indicating higher expression and higher CUB (Lee et al., 2010; Sharp and Li, 1987 ) 131 under the assumption that translational selection would optimize gene sequences 132 according to their expression levels. 133 CBI is another measure of directional codon bias, based on the degree of 134 preferred codons used in a gene, like to the frequency of optimal codons. It measures 135 the extent to which a gene uses a subset of optimal codons. In genes with extreme codon 136 bias, CBI will be equal to 1, whereas in genes with random codon usage the CBI values 137 will be equal to 0 (Bennetzen and Hall, 1982) . 138 Fop is a species-specific measure of bias towards particular codons that appear 139 to be translationally optimal in particular species. It can be calculated as the ratio 140 between the frequency of optimal codons and the total number of synonymous codons. 141 Its values range from 0 if a gene contains no optimal codons to 1 if a gene is entirely 142 composed of optimal codons (Ikemura, 1981) . The determination of optimal codons was 143 carried out based on the axis 1 ordination, the top and bottom 5% of genes were 144 regarded as the high and low bias datasets, respectively. Codon usage in the two data 145 sets was compared using chi-square tests, with the sequential Bonferroni correction to 146 assess significance according to Peden (Peden, 1999) . Optimal codons were defined as 147 those that are used at significantly higher frequencies (p-value < 0.01) in highly 148 expressed genes compared with the frequencies in genes expressed at low levels. The determination of codon pair biases in coding sequences was performed 163 using CPBias (https://rdrr.io/github/alex-sbu/CPBias/) developed in R. as described by 164 Coleman et al. (Coleman et al., 2008) . The CPS is defined as the natural logarithm of the 165 ratio of the observed over the expected number of occurrences of a particular codon pair 166 in all protein-coding sequences of a species. The CPB was used as an index and also to 167 determine the bias in CPS among the virus and host genes. The expected number of 168 codon pair occurrences estimates the number of codon pairs to be present if there is no 169 association between the codons that form the codon pair. It is also calculated to be 170 independent of codon bias and amino acid frequency (Coleman et al., 2008) . (Greenacre, 1984) . The data were normalized 205 according to Sharp and Li (Sharp and Li, 1987) in order to define the relative 206 adaptiveness of each codon (Peden, 1999; Suzuki et al., 2005) , codon usage indices 207 described above were also included as variables. PCA analyses were performed using 208 "factoextra R package" (https://cloud.r-209 project.org/web/packages/factoextra/index.html). 210 showed ENc values that ranged from 49.08 to 55.30 being the human MERS gene, 304 followed by the bats MERSgene, the viral genes that presented the lowest values. 305 The genes that encode for the spike protein S presented ENc values that ranged 306 from 44.16 to 47.68. The gene of the humanSARS-CoV-2 showed the lowest ENc value 307 followed by the genes of bat and pangolin SARS-CoV-2. ORF genes presented ENc 308 values that ranged from 26.60 to 57.89, being the human SARS-CoV-2 the virus that 309 presented the lowest and the highest ENc value for ORF7 and ORF10 respectively. 310 distantly.This analysis also showed that CPB is highly related to the dinucleotide bias. 384 In this cluster the genes that encode for the ORF genes showed low values of CPB, 385 being the lowest of all the clusters. Conversely, the genes that encode for the spike 386 protein S presented high CPB values. 387 found that the total gene repertoire had a similar ENc average that differs only 1 unit 495 with respect to their non-human host they come from, reflecting the molecular features 496 of their original host. Furthermore, as demonstrated in our clustering analysis, codon 497 pair usage seems to be dependent on the dinucleotide bias and the human CPB was 498 higher for human genes than for viruses genes as previously reported (Kames et al., 499 2020; Kunec and Osterrieder, 2016) . 500 Moreover, our analyses allowed us to distinguish not only the main factors that 501 contribute to the distribution of the genes along the axes in PCA, but also to determine 502 some particular different features among human and non-human viruses in specific 503 genes that could be important for explaining the virus infection evolution. In contrast to 504 SARS-CoV-2 of bats and pangolins, human SARS-CoV-2 exhibited a differential 505 distribution in particular genes that depended mostly on the A/T content in the third 506 Two viral genes that also present high CPB are ORF1a/b, that encodes for the 548 replicase complex (polyproteins pp1a and pp1ab) and the Spike protein S that 549 participates in the early viral infection by attaching to the host receptor ACE2 and 550 mediating the internalization of the virus (Guo et al., 2020) . In our studies, ORF1a/b 551 grouped with the gene that encodes for the nucleocapsid protein N, indicating that their 552 molecular features are highly conserved and are also presentin several human genes. 553 This result is in concordance with previous works that proposed these genes as 554 candidates for deoptimization for the design of attenuated vaccines due to their high 555 positive CPB values (Kames et al., 2020) . Instead, the gene that encodes for the spike 556 protein S, grouped with ORF7 (involved in viral pathogenesis and apoptosis induction) 557 that also presents high and similar positive CPB values. For all of them, a higher rate of 558 A/T composition in the third codon position was observed. Changes in the third position 559 produce synonymous substitutions that could have conducted to a codon optimization in 560 human cells using the host machinery that translates only genes whose molecular 561 features match the viral needs. Some viral genes seem to have been favoured for an 562 increased viral replication in humans and optimized by using or mimicking some 563 particular molecular patterns of human genes. But only some genes, such as the 564 envelope E, the ORF 6 and 8, could be the key for an exacerbated viral pathogenesis. 565 Furthermore, because of these molecular and codon usage similarities between some 566 highly expressed human genes and viral genes that occupy the same clusters, the 567 translation machinery of the host could propitiate the translation of viral genes to the 568 detriment of human gene expression in lung tissues.Indeed, mistranslation or de-569 regulation of protein synthesis has been reported as a consequence of tRNA miss- In our study, we described the main factors that shape CUB in SARS-CoV-2, 595 SARS and MERS in comparison with highly expressed genes in human lung tissue and 596 revealed matching features with human genes that could have favoured the virus for an 597 incremented pathogenesis. Furthermore, we provided a list of candidate human genes 598 that could be involved in the viral infection and had not been described yet which could 599 be the key for explaining collateral effects and the human susceptibility to viral 600 infectionsandshould be considered to be incorporated into genetic population studies. 601 602 6. Declarations CoV-2 (NC_045512) to 4) using a hierarchical method 668 of viral genes forSARS (NC_004718) of the human host and human genes based on the 669 molecular features. CPB correlation is included in the left for each cluster relating the 670 CPB of human genes (horizontal axis) and CPB of the viral genes to 6) using a hierarchical method 672 of viral genes forSARS (NC_038294) of the human host and human genes based on the 673 molecular features. CPB correlation is included in the left for each cluster relating the 674 CPB of human genes (horizontal axis) and CPB of the viral genes Comparative genomic analysis 677 MERS CoV isolated from humans and camels with special reference to virus 678 encoded helicase SARS-CoV-2 codon usage bias downregulates 680 host expressed genes with similar codon usage Bats and 683 coronaviruses Codon selection in yeast Chromosome Architecture and Genome Organization Neurologic complications of COVID-690 19 Coronavirus avian infectious bronchitis virus Genomic characterization of the 2019 novel human-pathogenic coronavirus 695 isolated from a patient with atypical pneumonia after visiting Wuhan Emerging coronaviruses: Genome structure Emerging coronaviruses: Genome structure Bovine coronavirus Virus Attenuation by Genome-Scale Changes in Codon Pair Bias Origin and evolution of pathogenic coronaviruses Correlations between the compositional properties of human genes, codon usage, 713 and amino acid composition of proteins Modulation of host 716 cell death by SARS coronavirus proteins Neurologic manifestations in an infant with COVID-19 Canine 724 parvovirus type 2 (CPV-2) and Feline panleukopenia virus (FPV) codon bias 725 analysis reveals a progressive adaptation to the new niche after the host jump Codon usage in bacteria: Correlation with gene 728 expressivity Theory and applications of correspondence analysis COVID-19) outbreak-A n update on the status Bats may be SARS reservoir Selection intensity for codon 740 bias The Structure of Viruses Correlation between the abundance of Escherichia coli transfer 744 RNAs and the occurrence of the respective codons in its protein genes: A proposal 745 for a synonymous codon choice that is optimal for the E Sequence analysis of SARS-CoV-2 genome reveals features 749 important for vaccine design Codon Pair Bias Is a Direct Consequence of 752 PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for 755 Molecular and Morphological Phylogenetic Analyses Pathways to disease from natural variations in human cytoplasmic tRNAs Relative codon adaptation index, a 761 sensitive measure of codon usage bias Prevalence and 764 impact of cardiovascular metabolic diseases on COVID-19 in China Bats are natural 767 reservoirs of SARS-like coronaviruses Hydrophobicity, expressivity and aromaticity are 770 the major trends of amino-acid usage in 999 escherichia coli chromosome-encoded 771 genes Coronavirus genomic RNA packaging Human 775 viruses have codon usage biases that match highly expressed proteins in the tissues 776 they infect Live attenuated influenza virus vaccines by computer-aided rational 780 design How China sees America Attenuation of human respiratory syncytial virus by genome-scale codon-785 pair deoptimization Accounting for Background Nucleotide Composition When 788 Measuring Codon Usage Bias Sequence comparison of the N genes of five 791 strains of the coronavirus mouse hepatitis virus suggests a three domain structure 792 for the nucleocapsid protein Analysis of codon usage Severe acute respiratory syndrome Fasttree: Computing large minimum 799 evolution trees with profiles instead of a distance matrix Analysis of codon usage 802 bias of Crimean-Congo hemorrhagic fever virus and its adaptation to hosts The codon adaptation index-a measure of 805 directional synonymous codon usage bias, and its potential applications Large-scale recoding of an arbovirus genome to rebalance its insect versus 809 mammalian preference Fast, 812 scalable generation of high quality protein multiple sequence alignments using 813 Host influence 815 in the genomic composition of flaviviruses: A multivariate approach Genetic Recombination, and Pathogenesis of Coronaviruses Directional mutation pressure and neutral molecular evolution A problem in multivariate analysis of 823 codon usage data and a possible solution The adaptation of codon usage 826 of +ssRNA viruses to their hosts A comprehensive analysis of genome 829 composition and codon usage patterns of emerging coronaviruses Codon Usage Pattern of Genes Factors influencing codon 834 usage of mitochondrial ND1 gene in pisces, aves and mammals Recoding of the vesicular stomatitis virus L gene by computer-aided design 838 provides a live, attenuated vaccine candidate Review 841 of bats and SARS The need for urogenital tract 844 monitoring in COVID-19 Covid 19 and the digestive system The "effective number of codons" used in a gene Focus on the Crosstalk Between Deliberate 854 reduction of hemagglutinin and neuraminidase expression of influenza virus leads 855 to an ultraprotective live vaccine in mice MERS, SARS and other coronaviruses as causes 858 of pneumonia Isolation of a Novel Coronavirus from a Man with 861 Pneumonia in Saudi Arabia COVID-19 and the 864 cardiovascular system Fatal 867 swine acute diarrhoea syndrome caused by an HKU2-related coronavirus of bat 868 origin X P _ 0 0 5 2 6 8 6 8 6 L R R K 2 l e u c i n e -r i c h r e p e a t s e r i n e / t h r e o n i n e -p r o t e i n k i n a s e 2 i s o f o r m X 1 X P _ 0 1 1 5 1 0 9 2 3 C O L 6 A 5 c o l l a g e n a l p h a -5 ( V I ) c h a i n i s o f o r m X 1 X P _ 0 1 1 5 1 3 4 3 3 A B C A 1 3 A T P -b i n d i n g c a s s e t t e s u b -f a m i l y A m e m b e r 1 3 i s o f o r m X 2 X P _ 0 1 1 5 1 3 4 3 4 A B C A 1 3 A T P -b i n d i n g c a s s e t t e s u b -f a m i l y A m e m b e r 1 3 i