key: cord-305581-0bqxwh1o authors: Hassan, Sk. Sarif; Choudhury, Pabitra Pal; Roy, Bidyut title: Molecular phylogeny and missense mutations of envelope proteins across coronaviruses date: 2020-09-12 journal: Genomics DOI: 10.1016/j.ygeno.2020.09.014 sha: doc_id: 305581 cord_uid: 0bqxwh1o Envelope (E) protein is one of the structural viroporins (76–109 amino acids) present in the coronavirus. Sixteen sequentially different E-proteins were observed from a total of 4917 available complete genomes as on 18th June 2020 in the NCBI database. The missense mutations over the envelope protein across various coronaviruses of the β-genus were analyzed to know the immediate parental origin of the envelope protein of SARS-CoV2. The evolutionary origin is also endorsed by the phylogenetic analysis of the envelope proteins comparing sequence homology as well as amino acid conservations. A novel coronavirus has been causing the ongoing pandemic which is certainly life threatening as our world is experiencing since December 2019 [1] . non-structural, accessory, etc.) among which two major structural proteins of the coronaviruses maintenance of epithelial polarity in mammals [14] . Almost all the proteins embedded in the SARS-CoV2 are being mutating as evidenced over the past few months [15, 16, 17] . It is hard to infer whether the mutations in E protein infect and sicken people deferentially due to COVID- 19 . In order to comprehend the effect of mutation over various proteins, one needs to accumulate all the mutations over the proteins from wide variety of SARS-CoV2 genomes available worldwide. On the other hand, most unsettled, controversial issue is the source/proximal origin of the SARS-CoV2. 35 Pattern of the genetic differences and motifs of the proteins present in SARS-CoV2 distinguish SARS-CoV2 from any other known coronavirus and this is what makes the discovery of origin hard [18, 19] . Zhang, Wu et al. show that the natural reservoir of SARS-CoV2 are Bat and Pangolin [20] . Recently, based on genomic and protein sequences from few coronoviruses of different hosts including human, reported that Pangolin may not be intermediate host for coronavirus for transmission from bat to human [21] . Presently, we wish to transact the transmission issue by analyzing mutations in one of most conserved proteins (envelope protein) over the SARS-CoV2 and other host-CoV genomes. In this study, using protein sequences from a large number of coronaviruses from different hosts including human, we analyzed the phylogenetic relationship among different hosts. A comparative investigation of the envelope (E) protein of CoVs of the  -genus family including SARS-CoV2 from the perspective of missense mutations as well as molecular organization of the amino acids in the envelope proteins has been performed in order to gain an insight and discover the intermediate hosts. This study considered all the envelope proteins of coronaviruses from different hosts viz. Bat, Camel, Cat, Cattle, Pangolin, Chimpanzee and human SARS-CoV2. In the Table 1 , total number of available CoV genomes of respective hosts as well as distinct number of envelope proteins are presented. Among these 72 different envelope proteins the most common envelope proteins of the respective host CoVs are listed below. Here we present the missense mutations of the E protein of the CoVs of Bat, Camel, Cat (Feline), Cattle (Bovine) and human SARS-CoV2. It is noted that the envelope (E) protein of the CoVs of Pangolin and Chimpanzee are found to be 100% conserved as presented in Table 1 and consequently no mutation was found over there. In order to detect the missense mutations, we have made the multiple sequence alignment of the E protein sequences using the Clustal-Omega server [24, 25] . In the following Table 4 , description of the amino acid residues and their respective color and property are mentioned. 65 This notations are used in Fig.2 . Table 4 : Amino acid residues and their respective color and property used in Fig.2 conservation between groups of weak similarity [25] . Among 79 available complete CoV genomes of Bat, twenty-five sequences possess various mutations in the three domains of the E protein as presented in the Fig.2 . The missense mutations over the E proteins of Bat-CoV with the respective domains are described in the Table 5 . There exists variety of mutations in the envelope proteins of Bat-CoV. The most of the frame-shift mutations occurred in the C-terminal domain of the protein. Also there are mutations in other two domains viz. TMD and N-terminal. Clearly, changes in the R-group property from Hydrophobic/Acidic to Hydrophilic/Basic of the amino acid residues of the three domains of the E protein may affect the function of the envelope protein. It is to be noted that envelope protein sequence of the protein QDF43841, YP_009273007, AIA62348 and ATQ39391 possess mutations at the cysteine residue such as C40V, C40I, C44V, C44I respectively. E protein sequence of the proteins AIA62357, ASL68958, AHY61342, AUM60029 contain the mutation C44A. These missense mutation at the cysteine residue would affect virus growth, release, entry, protein transport, and stability [26] . There is an important mutation V25C which is found in the TMD of E protein in the genome YP_009273007, which might stop the ion channel activity and led to in vivo attenuation. The TMD of the E protein for Bat CoV genomes AIA62348, ASL68947, AIA62357, ASL68958, ATQ39391 contains a mutation F26T and it may also cause stopping the ion channel activity [27, 28, 29] . Mutations in the motif 'DFLV' might also affect its binding to the PALS1 protein and accordingly may influence replication and/or infectivity of the virus [30] . Among 269 available complete CoV genomes of Camel, only 9 of them possess mutations as presented in the Fig.3 . The Table 6 . It is worth noting that though the amount of variability of E proteins is too high comparatively, but the N-terminal of each E protein is absolutely conserved. The envelope proteins of the cattle CoV are highly conserved as shown in Fig.5 . It is noted that there are two frame-shifts in the N-terminal sequence. The E protein is present over all the available 4917 SARS-CoV2 genomes as on 18th June, 2020 in the NCBI database. There are only sixteen distinct E proteins over the 4917 available SARS-CoV2 genomes. The mutations of the E proteins (presented in Table 7 ) are determined through the multiple sequence alignment as shown in Fig.6 . It is to be noted that the mutations in the C-terminal domain of E protein from SARS-CoV to SARS-CoV2 is already described in the unpublished article [31] . be mentioned that the SARS-CoV2 E protein is much close to that of the Pangolin-CoV, from the variability perspective. This closeness is also supported by sequence based homology. Here we illustrate the phylogenetic relationship among the E proteins (most common E protein as mentioned in the Table 3 ) across different CoVs based on sequence homology, as shown in Fig.7 . Journal Pre-proof From the phylogeny Fig.7 , it is derived that among all E proteins of all the host CoVs, the E proteins of Pangolin-CoV and SARS-CoV2 are very much close to each other. In order to get more intensive phylogenetic relationship among the E proteins of the host CoVs, we further did amino acid frequency based phylogeny. We determined the amino acid frequencies for each of the common E protein from each of the host CoV as tabulated in Table 8 . Based on frequency vector for each E protein, pairwise euclidean distance has been calculated and consequently the phylogeny is derived (Fig.8 ). The sequence based homology of 72 distinct E proteins across the different host CoVs are presented in Fig.9 . In the Table 9 , for each of the E protein of the CoVs, frequency of each amino acids are computed, which yields the amino acids conservation based phylogeny (Fig.10 ). Table 9 : Frequency of amino acids over the envelope proteins across the seven different host-CoVs Table 10 . Based on the amino acid frequency vector for each proteins, the Shannon entropy (SE) is computed which is tabulated in Table 10 . This SE of the amino acid conservation of the E protein suggests molecular level closeness of the E protein. This phylogenetic relationship is endorsed by the amino acid conservation and their associated SE found in Table 10 . We also observed phylogenetic relationship among E proteins from Bat-CoV, Pangolin-CoV and SARS-CoV2 (Fig.11 ). This relationship was drawn using amino acid conservation and their associated SE (Table 10 ). Here, we performed phylogenetic analysis using the protein sequences of coronaviruses from different hosts although different investigators also performed phylogenetic analysis using the genomic and protein sequences of few coronaviruses from different hosts [21] . But the phylogenetic analysis, using protein sequences from a large number of seuquences, may provide a better picture of the relationship among hosts so far as the intermediate host between human and bat is concerned since protein is the functional unit in the cell. So, this study, using protein sequence variations, may provide the clue why few hosts are resistant or sensitive to the disease Covid-19. We observed variations in protein sequences of E-protein in Human-SARS-CoV2, Bat-CoV, Camel-Cov etc. but no variation in protein sequence in Pangolin-CoV and Chimpanzee-CoV were obtained. Based on mutation characteristics and amino acid conservations over the E proteins across various host CoVs, this report predicts potential close kins of human SARS-CoV2 as the Pangolin-CoV and Bat-CoV which was also reported in a recent study [21] . Pangolin, the closest kin of SARS-CoV2, is also confirmed by the analysis made in this study. The missense mutations of the E protein across various host CoVs, may bar the usual functions of the envelope protein and consequently the virus may become weaker in infectivity. It is our belief that various missense mutations in the E protein could weaken the SARS-CoV2 and would help us gets rid of COVID-19 in future. The Covid-19 infection and rheumatoid arthritis: Faraway, so close! Structural insights into sars coronavirus proteins Biochemical and functional characterization of the membrane association and membrane permeabilizing activity of the severe acute respiratory syndrome coronavirus envelope protein Subcellular location and topology of severe acute respiratory syndrome coronavirus envelope protein Severe acute respiratory syndrome coronavirus envelope protein ion channel activity promotes virus fitness and pathogenesis Sars coronavirus e protein forms cation-selective ion channels Structural flexibility of the pentameric sars coronavirus envelope protein ion channel Channel-inactivating mutations and their revertant mutants in the envelope protein of infectious bronchitis virus Exploring the formation and the structure of synaptobrevin oligomers in a model membrane Importance of conserved cysteine residues in the coronavirus envelope protein 44-amino-acid e5 transforming protein of bovine papillomavirus requires a hydrophobic core and specific carboxyl-terminal amino acids The human and simian immunodeficiency virus envelope glycoprotein transmembrane subunits are palmitoylated Expression of sars-coronavirus envelope protein in escherichia coli cells alters membrane permeability The sars coronavirus e protein interacts with pals1 and alters tight junction formation and epithelial morphogenesis Moderate mutation rate in the sars coronavirus genome and its implications Molecular conservation and differential mutation on orf3a gene in indian sars-cov2 genomes Rare mutations in the accessory proteins orf6, orf7b and orf10 of the sars-cov2 genomes The proximal origin of sars-cov-2 A genomic perspective on the origin and emergence of sars-cov-2 Probable pangolin origin of sars-cov-2 associated with the covid-19 outbreak Relative von neumann entropy for evaluating amino acid conservation Large multiple sequence alignments with a root-to-leaf regressive method The embl-ebi search and sequence analysis tools apis in 2019 Mutational effects on protein-protein interactions Coronavirus envelope protein: current knowledge In-silico approaches to detect inhibitors of the human severe acute respiratory syndrome coronavirus envelope protein ion channel The infectious bronchitis coronavirus envelope protein alters golgi ph to protect the spike protein and promote the release of infectious virus Enhanced binding of sars-cov-2 envelope protein to tight junction Sars-cov2 envelope protein: non-synonymous mutations and its consequences