key: cord-273882-tqdcb3oo authors: Pratibha,; Shaju, C.; Kamal, title: Ubiquitous Forbidden Order in R-group classified protein sequence of SARS-CoV-2 and other viruses date: 2020-08-21 journal: bioRxiv DOI: 10.1101/2020.08.21.261289 sha: doc_id: 273882 cord_uid: tqdcb3oo Each amino acid in a polypeptide chain has a distinctive R-group associated with it. We report here a novel method of species characterization based upon the order of these R-group classified amino acids in the linear sequence of the side chains associated with the codon triplets. In an otherwise pseudo-random sequence, we search for forbidden combinations of kth order. We applied this method to analyze the available protein sequences of various viruses including SARS-CoV-2. We found that these ubiquitous forbidden orders (UFO) are unique to each of the viruses we analyzed. This unique structure of the viruses may provide an insight into viruses’ chemical behavior and the folding patterns of the proteins. This finding may have a broad significance for the analysis of coding sequences of species in general. of the side chains associated with codon triplets. In an otherwise pseudo-random sequence, we search for forbidden combinations of k th order. The results indicate what nature has decided not to do rather than what to do. We found that these forbidden orders are ubiquitous to each of the viruses we analyzed. These ubiquitous forbidden orders (UFO) are unique structures of the viruses that may provide an insight into viruses' chemical behavior and the folding patterns of the proteins. This finding may have a broad significance for the analysis of coding sequences of species in general. Coding sequences of species were downloaded from the National Centre for The codon triplets of four bases (A, G, C, and T) in the sequence can have 64 possible combinations that form 20 amino acids. Each of these amino acids is associated with a side chain that controls the folding patterns of the protein and its chemical behavior. Based on the chemical properties of the side chain, each codon triplet can be sub-classified as Non-polar (N), uncharged polar (P), acidic (A), or basic (B). A MATLAB code reads the sequence and classifies each amino acid triplet as its respective R-group side chain. Based on the chemical properties of the side chain of an amino acid in a coding sequence of a species, we generated a sequence of N (non-polar), P (polar), A (acidic), and B (basic). A new sequence of four symbols, N, B, A, and P is thus created from the protein sequence of the species that looks something like this -NBNNPAAABBNPNNNPABBABABAA…………… where each letter represents amino acid with one of the chemical properties listed above. This R-group sequence is used to obtain CGR plots of any given protein sequence of a species (Figure 1 ). Theoretically, all combinations of N, B, A, and P are possible in this sequence. In this study, we look at the sequence from a different perspective. Instead of studying what is present in the protein sequence, we decided to analyze what is absent from the sequence. CGR of a driven IFS -A protein sequence X(k) can be considered as a string composed of N, P, A, and B. We consider a unit square U and name corners Ci (i=1,2,3,4) as N, B, A, and P respectively, which corresponds to the value of X(k). The initial point P(0) is the midpoint of the square. Now the second point P(1) is the midpoint between P(0) and CX(1). In General, P(k) is plotted as the midpoint between P(k-1) and CX(k) [6] . After plotting the genetic sequence X in unit square U, the unit square is divided into 2 k x 2 k sub squares; each sub-square represents a unique sub-sequence of length k. An example of the movement of points in CGR is shown with the first eight members of the data sequence (PNAABNPA….) in Figure 1a . An example of addresses of the sub-squares for different orders (k =, 1, 2, 3, and 4) is given in Figure 1b . PC-plots -To make these plots, the percentage of points plotted in sub-square is calculated. This percentage value represents the intensity of points in each subsquare. After plotting points by CGR and dividing the unit square into 2 k x 2 k sub squares, each sub-square is color-filled based on the calculated intensity values. Figure 1c shows the percentage-CGR plot made for the Human Adenovirus for k=6. The existing literature presents studies of these plots in a phylogenetic analysis of species [6, 7, 9, 10, 12] . We take a step further but in the opposite direction and look for those combinations of N, B, A, and P, which are ubiquitously forbidden by nature for a given length (k). d. e. f. Ubiquitous Forbidden Order (UFO) plots of Human Adenovirus for combinations of order (k) 4, 5, and 6. The vertices in the squares are the same as depicted in figure 1a The red color indicates that the corresponding address is forbidden. For example, In figure 1d , the address ABAB is forbidden. The forbidden order in the plots can only be visualized for lower orders. It can be seen that Figures 1e and 1f are becoming more and more chaotic as the value of k increases and are difficult to analyze. Next, we analyzed protein sequences of 26 viruses (Figures 2, 3 , and Supplementary Figures 1 -4) to search for a ubiquitous forbidden order in each one of them. Our purpose was to find some clues for uniquely analyzing the SARS-CoV-2 to handle the COVID-19 pandemic. Figure 2 shows the 4 th level UFO plots of seven coronaviruses infecting humans. From UFO plots, the viruses seem to be getting optimized with time. The earlier coronaviruses, 229E, OC43, NL63, and HKU1 (Figure 1 d, e, f, and g), which are also the mild coronaviruses, have a lot of forbidden addresses in the amino acid polypeptide chain. With evolution, the structure seems to be getting simpler for MERS, SARS-CoV-1, and SARS-CoV-2 (Figure 1 a, b, and c ). It appears that nature prefers to be simple and less complex to be able to survive and evolve. Although the UFO plot is unique for each virus, they seem to optimize their evolution in time. A close examination of UFO plots of SARS-CoV-2 ( Figure 2a) and SARS-CoV-1 ( Figure 2b ) reveals that the evolution of the SARS-CoV-2 from SARS-CoV-1 is quite straight forward. One just has to prohibit a particular order BAPB in the SARS-CoV-1 protein sequence to get a SARS-CoV-2 strain. This opens up a point of discussion whether this formation is possible in a laboratory setting or not. As non-biologists, we cannot comment on their origin, though our results provide an alternative approach for further exploration by the subject specialists. Among the 26 viruses studied, we noted that the forbidden order BPAB is unique to SARS CoV-2, Rubella, and Avian IB (Figure 3) . A forbidden order BPAB in the UFO plot means that the 4 th order combination BAPB is prohibited by nature, i.e., a basic side chain(B), followed by an acidic side chain(A), followed by an uncharged polar side chain(P) cannot be followed by a basic side chain(B). This rule is found to be followed only by the SARS-CoV-2, Rubella, and the Avian Infectious Bronchitis (AIB) sequences among all viruses studied by us. The whole forbidden structure of SARS-CoV-2 is an inherent part of the Rubella plot and partly of the AIB plot. This feature in the SARS-CoV-2 protein sequence may be a pointer to support the idea of using the MMR vaccine in COVID-19 as floated by Fidel and Noverr [15] and the use of recombinant ACE2 by Kruse [16] as a preventive measure to reduce the inflammation in COVID-19 patients. The use of drugs/vaccines for existing viruses in managing untreatable viral diseases has been suggested by others also [17, 18] including for COVID-19 [19] . Next, we tried to analyze the studies on the evolutionary origin of the SARS-CoV-2 and its comparison to the Bat coronavirus RaTG13 isolate [20, 21] . We found that at R-group classified sequences of N, B, A, and P in these two samples are identical up to level 4 of the amino acid ordering in the protein structures (Figures 4a, d, g) . However, at the next level, i.e. k=5, the differences between the two protein sequences start to emerge (Figures 4b, e , h) and become quite clear at the 6 th level of ordering (Figures 4c, f, i) . This supports the findings of Wrobel et.al. [21] . indicate the addresses that are forbidden in the second plot but are allowed in the first one. Note the increase in complexity with the increase in order. There is no evident difference between the SARS-CoV-2 and its closest relative [20] RaTG13 structures at the 4 th order (Figure 4g ), but the higher-order comparison (Figures 4h, i) reveals key differences between these two viruses. Recently, Flies et.al. [22] emphasized shifting the focus of immunological research to new models and interdisciplinary studies. We have looked at the amino acid sequences, as non-biologists, from a different angle based on the chemical properties of the side chain of an amino acid in a coding sequence of a species. The forbidden order BAPB is unique to SARS CoV-2. A basic side chain, followed by an acidic side chain, followed by an uncharged polar side chain can not be followed by a basic side chain. This rule is found to be followed only by the SARS-CoV-2, Rubella, and the Avian Infectious Bronchitis among the viruses studied by us. This study of the forbidden order of R-group side chains in a protein sequence opens up new directions for microbiologists to study coding sequences. The consequences of the forbidden order to the properties of a protein are yet to be ascertained. DNA Sequence Alignment by Microhomology Sampling during Homologous Recombination Gapped blast and psi-blast: a new generation of protein database search programs A novel method of characterizing genetic sequences: genome space with biological distance and applications Numerical encoding of dna sequences by chaos game representation with application in similarity comparison The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances Chaos game representation of gene structure Genomic signature: characterization and classification of species assessed by chaos game representation of sequences Analysis of genomic sequences by Chaos Game Representation Similarity analysis for DNA sequences based on chaos game representation case study: The albumin Alignment-free genomic sequence comparison using FCGR and signal processing A categorization of COVID-19 treatment strategies: A modified chaos game representation (CGR) analysis of genome sequences for thirty-two pathogens. COVID-19 virtual conference Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses Could an unrelated live attenuated vaccine serve as a preventive measure to dampen septic inflammation associated with COVID-19 infection? Therapeutic strategies in an outbreak scenario to treat the novel coronavirus originating in Wuhan, China. F1000Res.9:72 Developing the concept of beneficial nonspecific effect of live vaccines with epidemiological studies Non-specific effects of BCG vaccine on viral infections A SARS-CoV-2 protein interaction map reveals targets for drug repurposing Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic SARS-CoV-2 and bat RaTG13 spike glycoprotein structures inform on virus evolution and furin-cleavage effects Rewilding immunology