key: cord-0682672-4yyk43xz authors: Rout, Ranjeet Kumar; Hassan, Sk Sarif; Sheikh, Sabha; Umer, Saiyed; Sahoo, Kshira Sagar; Gandomi, Amir H. title: Feature-extraction and analysis based on spatial distribution of amino acids for SARS-CoV-2 Protein sequences date: 2021-11-10 journal: Comput Biol Med DOI: 10.1016/j.compbiomed.2021.105024 sha: daa58650dc32b393ddc131dd22298b7485fc80dc doc_id: 682672 cord_uid: 4yyk43xz BACKGROUND AND OBJECTIVE: The world is currently facing a global emergency due to COVID-19, which requires immediate strategies to strengthen healthcare facilities and prevent further deaths. To achieve effective remedies and solutions, research on different aspects, including the genomic and proteomic level characterizations of SARS-CoV-2, are critical. In this work, the spatial representation/composition and distribution frequency of 20 amino acids across the primary protein sequences of SARS-CoV-2 were examined according to different parameters. METHOD: To identify the spatial distribution of amino acids over the primary protein sequences of SARS-CoV-2, the Hurst exponent and Shannon entropy were applied as parameters to fetch the autocorrelation and amount of information over the spatial representations. The frequency distribution of each amino acid over the protein sequences was also evaluated. In the case of a one-dimensional sequence, the Hurst exponent (HE) was utilized due to its linear relationship with the fractal dimension (D), i.e. [Formula: see text] , to characterize fractality. Moreover, binary Shannon entropy was considered to measure the uncertainty in a binary sequence then further applied to calculate amino acid conservation in the primary protein sequences. RESULTS AND CONCLUSION: Fourteen (14) SARS-CoV protein sequences were evaluated and compared with 105 SARS-CoV-2 proteins. The simulation results demonstrate the differences in the collected information about the amino acid spatial distribution in the SARS-CoV-2 and SARS-CoV proteins, enabling researchers to distinguish between the two types of CoV. The spatial arrangement of amino acids also reveals similarities and dissimilarities among the important structural proteins, E, M, N and S, which is pivotal to establish an evolutionary tree with other CoV strains. The novel coronavirus (COVID-19) has rapidly become a major global emergency that has and continues to affect all lives around the globe. [1] [2] [3] Presently, this disease, a pandemic as announced by the WHO, is a major health concern.[4] [5] Currently, the largest genome (of size approximately 30 kb) for RNA viruses is known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). [6] [7] . Coronaviruses (CoVs) are classified into three different classes, including -CoV, -CoV, and -CoV, based on genetic and antigenic criteria. [8] [9] . The SARS-CoV-2 is classified as -CoV [10] and has received widespread research attention across the world [11] [12] [13] . Every day, new genome sequences, as well as primary protein sequences of SARS-CoV-2, are being added to databases, such as the NCBI virus database [14] [15] As of this writing, no antiviral drugs with proven efficacy nor vaccines for CoV2 prevention have been reported [16] [17] , while researchers have yet to attain a complete understanding of the molecular biology of SARS-CoV-2 infection [18] [19] As a result, COVID-19 cases increase and have reached a global pandemic level, thus urgently requiring in-depth knowledge, infection mechanism, and other aspects of the virus-like forecasting its progression [18] [20] . Although various protein-protein interactions (PPIs) of the virus and host are known, its viral infection mechanism is not fully understood [21] [22] Therefore, identifying interactions between the SARS-CoV-2 virus proteins and host proteins will largely help to understand this mechanism and further develop treatments and vaccines [23] . As a first step, it is critical to gain clarity of SARS-CoV-2 proteins and PPIs between the virus and host proteins [24] . It is known that the protein fold depends on the number, spatial arrangement, and topological connectivity of secondary structure elements (SSEs) [25] , yet the spatial arrangement of secondary structure elements (SSEs) is not well-understood [26] . Because the geometric threedimensional structure of a protein depends on the spatial arrangement of the SSEs [27] [28] , both the spatial distribution and presence/absence of different amino acids over a primary protein sequence of SARS-CoV-2 are significant. It is also pertinent to mention that the spatial arrangement uncovers the rules that govern the folding of polypeptide chains, and the primary sequence of a protein reveals the molecular events in evolution [29] [30] . Specifically, the alternation and spatial arrangement of amino acids over the primary sequence appear to affect the function and conformability of the protein, respectively [31] [32] [33] . In the present study, the spatial composition of 20 amino acids across the primary proteins of SARS-CoV-2 was examined according to the Hurst exponent and Shannon entropy. A frequency analysis of the amino acids was also conducted and further compared to a similar analysis for 89 genomes of SARS-CoV-2 [34] . The usability of Shanon entropy and Hurst exponent for analysis of protein sequences is reported in [29] which is to find out correlation among all these sequences. As of 24 March 2020, there are 944 known primary protein sequences of SARS-CoV-2 in the NCBI Virus Database ( ) [35] . Out of these sequences, only 105 sequences are distinct, although these sequence data have been taken from wide ranges of geographic locations over the world. The complete list of 105 distinct sequences, which are denoted , ,…, , with their corresponding accessions is provided at the end of the article in Appendix C. These 105 distinct protein sequences were considered in this study. The SARS-CoV and MERS-CoV, the SARS-CoV-2 genome comprises of 12 open reading frames (ORFs) in number. Genes encoding structural proteins such as spike (S), membrane (M), envelope (E), and nucleocapsid (N), are present in the remaining one-third of its genome spanning from the 5′ to the 3′ terminal, along with several genes encoding non-structural proteins (NSPs) and accessory proteins scattered in between is shown in Figure 1 . [36] The 20 amino acids are distinguished below: • Herein, we represent the studied amino acids as corresponding to A, C, F, G, H, I, L, M, N, P, Q, S, T, V, W, Y, D, E, K, and R respectively. Each primary protein sequence was decomposed into 20 different binary sequences of and , according to the following rule: Given a primary protein sequence of SARS-CoV-2 for every amino acid , where to , put wherever is present and elsewhere put . Consequently, for every given primary protein sequence for all sequences , there are 20 binary sequences corresponding to the 20 different amino acids , . The length of these complete 105 primary protein sequences widely varies from 13 to 7097. One complete SARS-CoV-2 protein sequence, N99, has the smallest length of 13, and one protein sequence, N26, has the largest length of 7097. There are 6, 3, 8, Translation of this ssRNA results in the formation of two polyproteins, namely pp1a and pp1ab that are further sliced to generate numerous nonstructural Proteins (NSPA). The remaining ORFS encode for various structural and accessory proteins that help in the assembly of the viral particle and evading immune response. To characterize the amino acid spatial distribution over the primary protein sequences of SARS-CoV-2, the Hurst exponent and Shannon entropy were applied as parameters, and the amino acid density/frequency analysis J o u r n a l P r e -p r o o f was performed. Unsupervised machine learning was mostly utilized for analysis of gene and genome sequences and also used for intra-protein analysis. Markov Clustering and Affinity Propagation procedures were compared directly to the method described in [41] [42] and K-means clustering techniques in [43] . K-means algorithm is better for analyzing inter and intra class analysis of protein sequences [44] . A recent application of minimum variance cluster analysis for hierarchical agglomerative clustering technique was performed well and discussed in [45] and also identified groups of molecular systems to enhance insight into peptide dynamics. K-mean clustering algorithm is used to develop homogeneous subclasses inside the data. These data points in each cluster are as analogous as possible according to a widely used distance measure viz. Euclidean distance. Based on the performance and applicability one of the most commonly used simple clustering techniques is the Kmeans clustering [42] [46] . In this paper, k-mean clustering algorithm has been used to generate 10 clusters for respective amino acids with the 105 SARS-CoV-2 datasets. The implementation of the spatial feature extraction has been performed using MATLAB-2016a version, on Microsoft 2010 OS. The statistical analysis of these spatial features is also analyzed with the help of STATISTICA 10.0 software in the upcoming sections. The following section briefly describes these methods with reference to similar works. [47] [48] [49] . The HE lies in the interval , where HE is strictly less than for rough anti-correlated sequences and lies in the ranges -for positively correlated sequences. If HE = , then the sequence depicts its randomness with white noise. [50] [51] [52] . The HE of a binary sequence is defined as given in Equ. 1 where n is the length of the sequence: ∑ and The autocorrelation of the binary representations of each amino acid over the SARS-CoV-2 protein sequences was obtained by measuring the Hurst exponent. There are two kinds of Shannon entropy that were considered in this present study. • Binary Shannon entropy: The entropy of a Bernoulli process is measured with probability of the two outcomes , which is defined in equation 2: where frequency probabilities of 1's and 0's are respectively and ; is the length of the binary sequence; and is the number of 1's in the binary sequence of length [53] . The binary Shannon entropy is a measure of the uncertainty in a binary sequence. When probability , the event is certain to never occur; so there is no uncertainty, and entropy is . When probability , the result is certain; thus entropy must be . When , the uncertainty is at a maximum and consequently, the SE is . • Amino acid conservation Shannon entropy: Protein Post Translational Modification (PTM) is an important biological mechanism for expanding the genetic code [54] [55] . To find the conservation of amino acids in primary protein sequences, Shannon entropy is deployed. For a given protein sequence, the SE is calculated as follows: where represents the occurrence frequency of amino acid in the sequence. Over the primary protein sequences of SARS-CoV-2, we aimed to explore the amino acid frequency distributions and corresponding statistical descriptions [11] [56] . The density of the amino acids over a primary protein sequence can also be found using the following formula: where is an amino acid present in the primary protein sequence ; is the length of sequence ; and is the frequency of amino acid in sequence . This amino acid density would clarify the richness of essential amino acids in contrast to others. J o u r n a l P r e -p r o o f Herein, the positive/negative trend of the spatial distribution of the 20 amino acids over the SARS-CoV-2 protein sequences based on the Hurst exponent and Shannon entropy is reported. As mentioned earlier, the Hurst exponent implies the fractality (organized non-linearity) of the spatial representations. Also, the amount of uncertainty in the presence/absence of amino acids over the protein sequences was determined through Shannon entropy measurements, which provide conservation information about the amino acids. Based on the frequency distributions of all amino acids over the SARS-CoV-2 protein sequences, 14 SARS-CoV protein sequences were subsequently compared with 105 SARS-CoV-2 proteins. For is not present in the protein sequences N3, N80, N97, N98 and N99 of the SARS-COV-2. The spatial organization of amino acid H is random (neither trending nor negatively autocorrelated) in the protein sequences N5, N15, N88, N89, N90, N91, N92, N93, N94, and N95, which belong to cluster 2 as shown in Table 6 (Appendix A). Cluster 2 contains ten sequences (N68, N88, N89, N90, N91, N92, N93, N94, N95, and N99) with no HE (*), which indicates that the corresponding binary sequences , , , , and are completely free from amino acid (C). Protein sequences N68 and N81 lack amino acid A 4 (G) (conditionally essential), as can be seen in Table 5 (Appendix A), while N99 is the only sequence that does not have essential amino acid A 6 (I). The spatial distribution of amino acid A 6 (I) over the protein sequence N102 is truly random since the HE is 0.509, whereas the other 104 sequences are trending with HEs greater than 0.5. The spatial arrangements of amino acid A 7 (L) over these proteins are neither random nor trending as the HE is greater than 0.5 but less than 0.6. The HE of the binary representation of the amino acids forming eight clusters ranges from to with a standard deviation between 0.04 to 0.111. The binary representation of the spatial organization of nonessential amino acid A 12 The protein sequences of different lengths, ranging from 13 to 419, are provided below. Table 4 lists the amino acid(s) that are not present in the sequences. The protein sequence N99 of length 13 does not contain some essential, conditionally essential, and nonessential amino acids, including C, H, M, P, T, W, Y, E, K and R. The largest sequences N88, N89, N90, N91, N92, N93, N94, N95 of length 419 do not contain amino acid C. It is noted that amino acid M is present over all the protein sequences, except N99, which has the smallest length of 13. Also, it is has been observed that the essential amino acids L, M, F and V as well as non-essential amino acids A, D, N and S are present in all the protein sequences of SARS-CoV-2. In addition, the six conditionally essential amino acids were not found to be essential for all the proteins of SARS-CoV-2. Proteins that have a length greater than 419 contain all 20 amino acids. It is reported that the presence of amino acid I, G and V is of primordial importance, in this study it has also been found that N99 does not contain I and amino acid G is not present in N68, N81 sequences. It is also noted that amino acid H is randomly spatially distributed over protein sequences N5, N15, N88, N89, N90, N91, N92, N93, N94 and N95, as observed in the previous subsections. The essential hydroxyl amino acid M is randomly arranged over proteins N80 and N102. Also, amino acid L is distributed over the protein sequence N102 randomly, while only amino acid K is randomly spread over N104. In sequences N98 and N102, amino acid R is distributed with a negative trend ( ). Also, the amino acids K, Y, S, Q, N, and F are negatively trending over the protein sequences N103, N80, N7, N100, N2, and N5, respectively. Therefore, amino acids C, G, P, T, W, and E are distributed over all 105 proteins with positive autocorrelation (positively trending). Here, we explore the correlation (of trending behaviors) of the amino acid distribution over 105 proteins of SARS-CoV-2. The correlation matrix of ten amino acids, A, C, F, G, H, I, L, M, N and P, versus another ten amino acids Q, S, T, V, W, Y, D, E, K and R, is presented below. The spatial distribution of amino acid A with the same distribution of amino acids Q, T, V, W, and Y is positively correlated based on the HEs shown in Table 5 Table 5 . The correlation-based on HEs of the spatial distribution is also demonstrated in the graphs in Fig. 4 . It is worth mentioning that the correlation matrix (presented in Table 5) also displays the negative correlations of the spatial distribution of the proteins An example of the correlation (correlation coefficient r: 0.443) between the spatial distribution (autocorrelation) of amino acid M and the spatial distribution of amino acid Y is given below in Fig. 3 . J o u r n a l P r e -p r o o f (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . J o u r n a l P r e -p r o o f (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . The following subsection discuss the amount of uncertainty/certainty of the presence of amino acids over the protein sequences. J o u r n a l P r e -p r o o f For amino acids , the Shannon entropy (SE) was determined for the 105 binary sequences The SE of the binary representation of the amino acids forming five clusters ranges from to with a standard deviation between 0.0448 to 0.0919. The SE of the spatial distribution of amino acid in protein sequence N68 was determined to be 0.121, which is the lowest amount of uncertainly compared to the SE of other amino acids. In clusters 4 and 1, almost all the protein sequences had an SE less than 0.5, indicating the definite presence and absence of a particular amino acid over the protein sequences. The amount of uncertainly is high for protein sequences N3 and N99 with lengths of 198 and 13, respectively. Amino acids and are absent from protein sequence N99, with an SE less than 0.5, as shown in Tables 35 and 36, respectively. The amino acid (V) is present over all 105 proteins, and hence, none of the binary representations has SE = 0. For the amino acid V, the SE of N74 and N77 is 0.391, which implies the presence of this amino acid over the proteins has good certainty, and N96 and N97 have the maximum uncertainty of SE = 0.665. Cluster 1 contains five protein sequences, in which amino acid is absent, and hence, SE = 0. Also, SE = 0 for the binary spatial representations of N99 and N103 for amino acid , N80 and N99 (belonging to cluster 2) for amino acid , N80, N81 and N99 for amino acid , and N81 and N99 amino acid due to the absence of these amino acids. It is pertinent to note that amino acids and are present over all 105 proteins with certainty ( . Most of the proteins in the largest cluster 2 including other clusters contain amino acid that is spatially distributed with certainty. The SE of the binary representation of the amino acids forming six clusters ranges from to with a standard deviation between 0.0749-0.852. Amino acid is absent from the primary protein sequences N68 and N81, and consequently, SE = 0 implies no uncertainty. Similarly, SE = 0 for the binary spatial representations of protein sequence N99 for amino acid , sequences N81, N99 and N103 for amino acid (P), and sequences N96 and N97 for amino acid (Q). Amino acid is spread spatially with certainty over the proteins N2 (length of 138) and N89, N90, N91, N92, N93, N94 and N95 (lengths of 419) in cluster 3. Clusters 1 and 5 for amino acid and cluster 1 for amino acids and contain the majority of the protein sequences, where the presence of these amino acids is spread over the proteins with almost certainty. Comparatively, clusters 2 and 6 contain five protein sequences, where the absence of the amino acid is spread with almost certainty. Cluster 3 contains one protein sequence N80 where the spatial distribution has SE = 0.562, which indicates that the absence of amino acid over the protein is without uncertainty. The SE of the binary representation of the amino acids forming seven clusters each ranges from to with a standard deviation between 0.0667 to 0.0765. It was found that SE = 0 for the spatial distribution of amino acid in the protein sequences N68, N88, N89, N90, N91, N92, N93, N94, N95 and N99, which indicates the amount of uncertainty is zero. In other words, the absolute absence of amino acid over these proteins and the spatial presence of amino acid C over the protein sequences of other clusters have low uncertainty (high certainty). The SE is greater than 0.5 for the binary representations of amino acid over the proteins N81 and N99, and consequently, the amount of uncertainty is lowering. In other clusters containing the other protein sequences, the spatial presence of amino acid over the protein sequences has low uncertainty (high certainty). The SE of the binary representation of the amino acids forming eight clusters ranges from to with a standard deviation between 0.0459 to 0.0749. Because amino acid is absent from proteins N3, N80, N97, N98 N99 and amino acid is absent from N99 (smallest length of 13), SE = 0 for the amino acids, implying there is no uncertainty. In addition, SE = 0.078 for the spatial representation of the presence and absence of amino acid over the proteins N88, N89, N90, N91, N92, N94 and N95 (lengths of 419) belonging to cluster 4); hence, the spatial distribution is more certain/orderly. All the clusters except cluster 6 contain only protein sequences over which amino acid is spatially distributed with certainty, whereas cluster 6 contains two sequences N81 (length of 43) and N68 (length of 61), where the absence of the amino acid dominates the presence with certainty. It is pertinent to mention that SE = 0 for the binary representations of amino acid that is absent from protein sequence , which has been demonstrated in this study. It was also observed that maximum SE was obtained for the spatial distribution of amino acids over lengthy sequences, such as N99, N80, etc. Interestingly, for some given amino acid , the same SE was obtained for some spatial distributions of some protein sequences , irrespective of their lengths, for many values of . This essentially suggest that the probability of the presence of amino acid over these protein sequences is the same. Further, we explored the correlation in the amount of uncertainty between the spatial distributions of the 20 amino acids over the proteins of SARS-CoV-2. Table 6 presents the correlation matrix of ten amino acids (A, C, F, G, H, I, L, M, N and P) versus another ten amino acids (Q, S, T, V, W, Y, D, E, K and R). Table 6 . Correlation matrix of SEs of present amino acids over the protein sequences. Based on the SEs, the spatial distribution of amino acid A was found to be positively correlated with the distributions of amino acids Q, S, D, K and R, as shown in Table 6 . Likewise, the spatial distribution of amino acid C is positively correlated with amino acids T, V, Y and E. Similarly, the positive correlations between the spatial distributions of amino acids F, G, H, I, L, M, N and P and the other amino acids are established in the correlation matrix in Table 6 , which also shows negative correlations. Figure 5 . Shows the Plot of SEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . of amino acid R with the spatial distribution of amino acid P is given in Fig. 6 . were formed, and the respective SE plots and histograms for the 105 protein sequences are provided in Table 7 . It can be observed that the Shannon entropy of amino acid conservation along the protein sequences of SARS-CoV-2 ranges from 0.7 to 0.982. Since the SE is close to 1, meaning uncertainty is at a maximum, all amino acids must be uniformly distributed over the protein sequences. More than 50% of the proteins sequences (54) belonging to cluster 2 of SARS-CoV-2 have SE = , which further implies that the amino acids are almost uniformly spread over the sequences. Subsequently, the frequency analysis of the amino acids over the proteins is given in the following subsection. In this section, the frequencies of the amino acids in the 105 SARS-CoV-2 protein sequences are statistically compared, as shown in Figs. 9 and 10. J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f A correlation matrix between the frequency distribution of amino acids over the 105 SARS-CoV-2 protein sequences is provided in Table 8 , and the respective correlation graphs are illustrated in Fig. 11 . It can be observed that the correlation coefficient is very close to 1, which indicates significant correlations between the frequencies of each amino acid over the proteins. For instance, the correlation coefficient between the frequency distributions of amino acids A (Aliphatic) and K (Basic) is 1, as illustrated in Fig. 12 , meaning strong correlation. J o u r n a l P r e -p r o o f Overall, it is observed that protein sequences of the same length have very similar frequency distributions of the twenty amino acids. proteins (S1, S2, S11) with their accessions are given here in Table 9 . It is noted that the protein with the accession ACU31032 (S14) is a spike protein of length 1241 as mentioned in the NCBI database. The spike protein (S-protein) is a large type I transmembrane protein of length not exceeding 1400 amino acids. The spike protein has an important function in the case of SARS-CoV. [58] [59] Among all other proteins of SARS-CoV, spike protein is the main antigenic component that is responsible for inducing host immune responses, neutralizing antibodies, and/or protective immunity against virus infection [60] . We, therefore illuminate here the spatial representations of the amino acids over the spike protein including the other 13 proteins as mentioned in Table 10 . The HE, SE, and frequency distributions are given in the following and compared with the SARS-CoV2 proteins. J o u r n a l P r e -p r o o f protein S12 which is a hypothetical protein. It is noted that the HE is kept blank for the cases where the spatial distribution of an amino acid is completely a sequence of zeros i,e. absence of the amino acid over the protein. Below in Table 11 , we derive the correlation coefficients of the HEs of the spatial representations of the amino acids over the 14 SARS-CoV proteins. Table 11 . It is noted that the SE is turned out to be zero for the cases where the spatial distribution corresponding to an amino acid that is absent over a protein. The spatial distribution of amino acids over the proteins of SARS-CoV is all without much uncertainty except for three cases where the SEs are greater than 0.5 where the absence of amino acids dominates in terms of certainty. The correlation coefficients of the SEs of the spatial distributions of the amino acids over the 14 SARS-CoV proteins are given in Table 12 . It is observed that the correlations among the SEs of the spatial distributions of the amino acids over the proteins are not significantly up as tabulated in Table 12 . The highest positive correlation based on SEs of the spatial distributions of the amino acid C with that of Y is turned up as 0.572. Previous reports state that the genomes of SARS-CoV and SARS-CoV-2 exhibit similar protein sequences. However, we found that the spatial arrangement of amino acids over the studied protein sequences is certainly different, contributing to differences between proteins. This study reveals the hidden spatial arrangement of the amino acids of SARS-CoV-2 and SARS-CoV1. Specifically, the spatial arrangements of amino acids over the Authors' Contribution: SH had initiated the problem for the study, and RKR and SH executed the results from the data. SH, RKR, SS, SU, KSS, and AHG analyzed and interpreted the results. SH was a major contributor in writing the manuscript. All authors read and approved the final manuscript. Appendix B: Hurst Exponent of 105 number of SARS-CoV-2 binary sequences, for the amino acid A3 (F), for the amino acid A4 (G), for the amino acid A5 (H), for the amino acid A6 (I), for the amino acid A7 (L), for the amino acid A8 (M), for the amino acid A9 (N), and for the amino acid A10 (P), for the amino acid A11 (Q), (c) and (d) for the amino acid A12 (S), (e) and (f) for the amino acid A13 (T), (g) and (h) for the amino acid A14 (V), (i) and (j) for the amino acid A15 (W), (k) and (l) for the amino acid A16 (Y), (m) and (n) for the amino acid A17 (D), (o) and (p) for the amino acid A18 (E), (q) and (r) for the amino acid A19 (K), (s) and (t) for the amino acid A20 (R). QIJ96471 38 15 P158 QIJ96501 38 16 P161 QIJ96521 38 17 P184 QII87803 38 18 P192 QII87791 38 19 P201 QII87827 38 20 P212 QII87815 38 21 P223 QII57176 38 22 P233 QII57276 38 23 P243 QII57346 38 24 P252 QII57226 38 25 P264 QII57286 38 26 P269 QII57236 38 27 P285 QII57186 38 28 P308 QII57336 38 29 P311 QII57216 38 30 P325 QII57206 38 31 P333 QII57326 38 32 P342 QII57306 38 33 P353 QII57316 38 34 P362 QII57296 38 35 P364 QII57196 38 36 P376 QIA98562 38 37 P387 QII57246 38 38 P394 QII57256 38 39 P413 QII57266 38 P741 QHU36836 75 261 P753 QHU36856 75 262 P761 QHU36866 75 263 P771 QHU36826 75 264 P778 QHU36846 75 265 P795 QHU79206 75 266 P802 QHR84451 75 267 P809 QHR63252 75 268 P819 QHR63262 75 269 P829 QHR63272 75 270 P839 QHR63282 75 271 P849 QHR63292 75 272 P856 QHO62113 75 273 P865 QHQ82466 75 274 P880 QHQ71965 75 275 P890 QHQ71975 75 276 P891 QHO62108 75 277 P900 QHO62879 75 278 P912 QHN73812 75 279 P921 QHN73797 75 280 P927 QHO60596 75 281 P939 QHD43418 75 282 P16 YP_009725303 83 283 P12 YP_009725305 113 284 P23 YP_009724395 121 285 P24 YP_009724396 121 286 P31 QIK50443 121 287 P35 QIK50444 121 288 P48 QIK50453 121 289 P49 QIK50454 121 290 P54 QIK50422 121 291 P57 QIK50423 121 292 P68 QIK50432 121 293 P69 QIK50434 121 294 P73 QIK02959 121 295 P74 QIK02960 121 296 P85 QIK02969 121 297 P86 QIK02970 121 298 P96 QIK02949 121 299 P97 QIK02950 121 300 P107 QIJ96479 121 301 P109 QIJ96478 121 302 P116 QIJ96489 121 303 P118 QIJ96488 121 Clinical features of patients infected with 2019 novel coronavirus in A Novel Coronavirus from Patients with Pneumonia in China Consideration on the strategies during control On spatial molecular arrangements of SARS-CoV2 genomes of Indian patients Spatial Distribution of Amino Acids of the SARS-CoV2 Proteins Another Decade, Another Coronavirus A novel coronavirus outbreak of global health concern Genomic variance of the 2019-nCoV coronavirus Zoonotic origins of human coronaviruses The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 A Genomic Perspective on the Origin and Emergence of SARS-CoV-2 The proximal origin of SARS-CoV-2 On the origin and continuing evolution of SARS-CoV-2 Database resources of the National Center for Biotechnology Information Virus Variation Resource-improved response to emergent viral outbreaks Research and Development on Therapeutic Agents and Vaccines for COVID-19 and Related Human Coronavirus Diseases COVID-19, an emerging coronavirus infection: advances and prospects in designing and developing vaccines, immunotherapeutics, and therapeutics, Hum. Vaccines Immunother Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs Overlapping and discrete aspects of the pathology and pathogenesis of the emerging human pathogenic coronaviruses SARS-CoV, MERS-CoV, and 2019-nCoV A multiple combined method for rebalancing medical data with class imbalances Protein-protein interactions of human viruses Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method Structural genomics of SARS-COV-2 indicates evolutionary conserved functional regions of viral proteins A SARS-CoV-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing Protein structure comparison: implications for the nature of "fold space Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions Classification of Mer Proteins in a Quantitative Manner A geometric algorithm to find small but highly similar 3D substructures in proteins UmerSaiyed, Intelligent Classification and Analysis of Essential Genes Using Quantitative Methods New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage Hydrophobie distribution and spatial arrangement of amino acid residues in membrane proteins Intercalating amino acid guests into montmorillonite host EightyDVec: a method for protein sequence Markov clustering versus affinity propagation for the partitioning of protein interaction graphs, BMC Bioinforma Unsupervised feature selection using an improved version of Differential Evolution The global k-means clustering algorithm An automatic tool to analyze and cluster macromolecular conformations based on self-organizing maps Clustering algorithms applied on analysis of protein molecular dynamics Validating clustering of molecular dynamics simulations using polymer models, BMC Bioinforma The variations of human miRNAs and Ising like base pairing models Ranking and clustering of Drosophila olfactory receptors using mathematical morphology Analysis of Purines and Pyrimidines distribution over miRNAs of Human, Gorilla, Chimpanzee, Mouse and Rat Fractal Analysis of Time Series and Distribution Properties of Hurst Exponent Estimation of Hurst exponent revisited Introducing fractal dimension algorithms to calculate the Hurst exponent of financial time series Divergence Measures Based on the Shannon Entropy The Shannon information entropy of protein sequences Shannon information entropy in the canonical genetic code The SARS-CoV S glycoprotein: Expression and functional characterization Characterization of severe acute respiratory syndrome-associated coronavirus (SARS-CoV) spike glycoprotein-mediated viral entry The spike protein of SARS-CoV -A target for vaccine and therapeutic development Receptor-binding domain of SARS-CoV spike protein induces highly potent neutralizing antibodies: Implication for developing subunit vaccine Treatment of SARS with human interferons QIK50421 61 103 P60 QIK50431 61 104 P78 QIK02958 61 105 P87 QIK02968 61 106 P98 QIK02948 61 107 P104 QIJ96477 61 108 P111 QIJ96487 61 109 P127 QIJ96507 61 110 P138 QIJ96527 61 111 P149 QIJ96467 61 112 P160 QIJ96497 61 113 P165 QIJ96517 61 114 P191 QII87786 61 115 P195 QII87798 61 116 P203 QII87822 61 117 P217 QII87810 61 118 P221 QII57172 61 119 P231 QII57272 61 120 P240 QII57342 61 121 P251 QII57222 61 122 P262 QII57282 61 123 P276 QII57232 61 124 P281 QII57182 61 125 P291 QII57302 61 126 P307 QII57332 61 127 P317 QII57212 61 J o u r n a l P r e -p r o o f 128 P323 QII57202 61 129 P329 QII57322 61 130 P347 QII57312 61 131 P358 QII57292 61 132 P366 QII57192 61 133 P377 QIA98558 61 134 P389 QII57242 61 135 P391 QII57252 61 136 P405 QII57262 61 137 P417 QHS34550 61 138 P430 QIA98587 61 139 P435 QIH55225 61 140 P447 QIH45027 61 141 P457 QIH45037 61 142 P462 QIH45047 61 143 P467 QIH45057 61 144 P487 QIG55998 61 145 P498 QIE07475 61 146 P510 QIE07465 61 147 P521 QIE07455 61 148 P531 QIE07485 61 149 P542 QID98798 61 150 P552 QID21052 61 151 P559 QID21072 61 152 P566 QID21062 61 153 P578 QIC53208 61 154 P593 QIC53217 61 155 P596 QIB84677 61 156 P607 QIA98600 61 157 P616 QIA98610 61 158 P636 QIA20048 61 159 P645 QHZ87586 61 160 P653 QHZ87596 61 161 P661 QHZ00393 61 162 P665 QHZ00362 61 163 P681 QHZ00403 61 164 P690 QHZ00383 61 165 P696 QHW06053 61 166 P702 QHW06063 61 167 P722 QHW06043 61 168 P732 QHU79198 61 169 P736 QHU36838 61 170 P752 QHU36858 61 171 P758 QHU36868 61 J o u r n a l P r e -p r o o f 172 P769 QHU36828 61 173 P775 QHU36848 61 174 P791 QHU79208 61 175 P798 QHR84453 61 176 P811 QHR63254 61 177 P821 QHR63264 61 178 P831 QHR63274 61 179 P841 QHR63284 61 180 P851 QHR63294 61 181 P867 QHQ82468 61 182 P876 QHQ71967 61 183 P882 QHQ71977 61 184 P902 QHO62881 61 185 P906 QHN73814 61 186 P918 QHN73799 61 187 P928 QHO60598 61 188 P941 QHD43420 61 189 P25 YP_009724392 75 190 P37 QIK50440 75 191 P43 QIK50450 75 192 P53 QIK50419 75 193 P67 QIK50429 75 QIG56001 419 701 P500 QIE07478 419 702 P508 QIE07468 419 703 P519 QIE07458 419 704 P526 QIE07488 419 705 P536 QID98801 419 706 P550 QID21055 419 707 P562 QID21075 419 708 P564 QID21065 419 709 P582 QIC53211 419 710 P584 QIC53221 419 711 P600 QIB84680 419 712 P609 QIA98602 419 713 P618 QIA98613 419 714 P638 QIA20052 419 715 P643 QHZ87589 419 716 P651 QHZ87599 419 717 P659 QHZ00396 419 718 P668 QHZ00365 419 719 P678 QHZ00386 419 720 P679 QHZ00406 419 721 P707 QHW06066 419 722 P708 QHW06056 419 723 P719 QHW06046 419 724 P729 QHU79201 419 725 P737 QHU36841 419 726 P751 QHU36861 419 727 P757 QHU36871 419 728 P768 QHU36831 419 729 P776 QHU36851 419 730 P788 QHU79211 419 731 P796 QHR84456 419 732 P815 QHR63258 419 733 P825 QHR63268 419 734 P835 QHR63278 419 735 P845 QHR63288 419 736 P855 QHR63298 419 737 P858 QHO62115 419 738 P864 QHQ82471 419 739 P873 QHQ71970 419 740 P881 QHQ71980 419 741 P895 QHO62110 419 742 P899 QHO62884 419 743 P911 QHN73817 QHN73802 419 745 P926 QHO60601 419 746 P944 QHD43423 419 747 P4 YP_009725300 500 748 P7 YP_009725309 527 749 P10 YP_009725308 601 750 P18 YP_009725298 638 751 P9 YP_009725307 932 752 P418 QHS34546 1272 753 P28 YP_009724390 1273 754 P38 QIK50438 1273 755 P44 QIK50448 1273 756 P58 QIK50417 1273 757 P64 QIK50427 1273 758 P71 QIK02954 1273 759 P81 QIK02964 1273 760 P95 QIK02944 1273 761 P102 QIJ96473 1273 762 P119 QIJ96483 1273 763 P126 QIJ96503 1273 764 P136 QIJ96523 1273 765 P150 QIJ96463 1273 766 P156 QIJ96493 1273 767 P170 QIJ96513 1273 768 P178 QII87794 1273 769 P181 QII87806 1273 770 P182 QII87782 1273 771 P198 QII87818 1273 772 P229 QII57268 1273 773 P246 QII57168 1273 774 P254 QII57218 1273 775 P260 QII57278 1273 776 P268 QII57338 1273 777 P274 QII57228 1273 778 P282 QII57178 1273 779 P294 QII57161 1273 780 P304 QII57328 1273 781 P314 QII57208 1273 782 P321 QII57198 1273 783 P337 QII57318 1273 784 P339 QII57298 1273 785 P345 QII57308 1273 786 P363 QII57288 1273  The spatial representation and distribution frequency of amino acids were examined. The Hurst exponent and entropy were applied to fetch the autocorrelation. The simulation results enables to distinguish between the two types of CoV. The spatial arrangement reveals the important information of structural proteins. o This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.o The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript o The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:Author's name